Gary L. Horowitz, MD, chair of the CAP Chemistry Resource Committee, didn’t exactly feel he was onto a scintillating topic two years ago when the Clinical and Laboratory Standards Institute asked him to lead an update of its standard document on reference intervals.
To prepare to chair the working group, he sat down with a copy of CLSI’s C28-A2, “Defining, Establishing, and Verifying Reference Intervals in the Clinical Laboratory.” Though it sets the gold standard in the field, it’s not the most exciting reading, he confessed in a talk at the CAP ’08 meeting. But reference intervals, the ranges of values for a laboratory test that are associated with defined states of health, are not a trivial subject: Invalid ones can lead to patients’ being taken off medication when they shouldn’t, being discharged too early, or to other negative outcomes.
C28-A3, the updated version of the reference interval report, not only adds guidelines on decision limits and re-emphasizes the importance of validating reference intervals, but also outlines new and more manageable protocols for labs to accomplish the task, said Dr. Horowitz, director of clinical chemistry at Beth Israel Deaconess Medical Center, Boston, and associate professor of pathology at Harvard Medical School.
“Every lab in this country is more than capable of validating the reference levels they’re using. It is very difficult to establish them, but validating them is becoming very straightforward.” An exciting but little-known CAP program—the Reference Range Service—gives labs even more powerful tools to have confidence in their reference intervals, he added.
One of the important changes in the new standard document is the introduction of the concept of decision limits: tests with national guidelines, for which traditional, laboratory—specific reference intervals do not apply. Cholesterol, hemoglobin A1c, neonatal bilirubin, glucose, urine albumin, and creatinine are tests “where accuracy trumps peer group assessment, where it’s not good enough to be just as good as your peers, and you don’t want to establish your own reference intervals—they would be counterproductive,” Dr. Horowitz said.
He pointed to CAP Surveys data from anonymized testing platforms for hemoglobin A1c that suggest there is a problem with accuracy in some cases. Using the reference method in the example, the true value was 8.4 percent, yet more than 50 percent of one analyzer’s results were less than 7.9 percent, while another analyzer’s results for 250 labs had more than 50 percent of values over 8.7 percent. (See “Hemoglobin A1c data”) (PDF, 1.6 MB) “That’s not too good, and yet because the CAP has traditionally done peer-group grading, that was deemed acceptable.”
The question, he said, is whether those labs even know that what they’re doing is not that accurate. “There are some instruments out there that are not performing adequately.” It’s not the individual laboratory that is at fault, he stressed. “We’ve identified the manufacturers and we want to encourage them to improve their assays. The best way we [the CAP] can accomplish that is to share our data and to give manufacturers advance warning that, for assays where we’re using real patient material, we’re moving from peer-group grading to accuracy grading.”
Similarly, with a Survey of neonatal bilirubin testing performance, the reference value of the Survey sample was 21.7 mg/dL and Dr. Horowitz’s laboratory’s mean value was 22.8, “so we were okay by the criterion of acceptable clinical performance within 10 percent. But for one platform, more than 50 percent of the laboratories had values that were greater than 10 percent from the truth, even though they were acceptable using peer group criteria.” Again, the pressure should be on the manufacturers to address this, he said. “I can tell you that two platforms over the last two years have had problems where the mean values have been off by roughly 20 percent.”
With neonatal bilirubin, guidelines came out in 2004 that are used by neonatologists and obstetrician/gynecologists across the country to help decide when to discharge infants from the hospital, Dr. Horowitz pointed out (Management of hyperbilirubinemia in the newborn infant 35 or more weeks of gestation. Pediatrics. 2004;114:297–316). “The authors carefully reviewed a lot of data. What they determined is, for example, that a bilirubin of greater than 8 mg/dL in a 24-hour-old neonate is above the 95th percentile and requires careful monitoring; at 48 hours, the 95th percentile is 13 mg/dL. In the old days, laboratories focused on neonatal bilirubins at three days, where the critical values were close to 20. Now we really do have to be concerned about bilirubins that are much lower, because babies are getting discharged from the hospital much sooner, and we have to make sure our methods match the methods that were used to establish these national guidelines.”
For analytes where national guidelines do not exist, the CLSI document describes three ways to establish reference intervals: traditional methods, multicenter trials, and transference from existing reference intervals, Dr. Horowitz said.
Theoretically, each laboratory should establish its own reference intervals, but the recommended number of 120 reference samples is a high standard to meet. With multicenter trials, data from multiple laboratories can be pooled-by having six laboratories, for example, each collect 20 samples.
“It’s a good way to get at the magic number to set your own reference intervals. But if you are going to do a multicenter trial, there are a few things you need to do. You need to make sure everyone is using the same criteria for selecting their subjects. You need to make sure the methods used, even if they’re not the same, are all traceable so the values generated are comparable. And you need to have some kind of commutable sample, a sample with no matrix effects, included as quality control so you know everyone is measuring in the same way.”
The third method, transference, applies to laboratories that already have set their own reference intervals. “Let’s say you collected samples from 120 neonates, established your own reference intervals, did a wonderful job, and now you decide to change methods. Do you have to go back and do the study again? The answer may be no.” You may be able to take the data you collected before and transfer it to the new platform, he said.
Selecting reference individuals can require deciding between excluding certain people or partitioning people based on certain criteria, Dr. Horowitz said. Typical partitioning criteria would be gender, age, or race. Others are tobacco use, genetic factors, hospitalization, recent infection, or fasting status. “You may want to say if you’re not fasting you can’t participate, or you may want to look at the effects of fasting on the sample. Similarly with pregnancy, you may want to partition people based on those criteria.”
The CLSI guidelines strongly favor direct sampling techniques over indirect sampling techniques. In the former, strict criteria are applied to individuals, often with the aid of a lengthy questionnaire, who will potentially serve as reference individuals. The questionnaire data are used to exclude and to partition individuals whose samples will be used to establish the reference interval.
Indirect sampling is a legitimate alternative to direct sampling. “A lot of us would like to be able to use data we have on hand to see whether we can ‘back into’ reference intervals. It involves applying statistical techniques to values in a lab database that were collected for other reasons.” The Hoffman technique is an example of such an application. “The director of our hematology lab wanted to know whether the reference intervals for G6PD [glucose-6-phosphate dehydrogenase], which came from the package insert, were appropriate for her laboratory. I said, ‘What if we went back in our database and looked at the G6PD values we generate?’ Since we knew that the vast majority are from normal individuals, I was able to identify deficient patients from the rest of the values by carefully reviewing a histogram of our patient values from the past.”
With the Hoffman technique, if these values are plotted as a cumulative percentage and they end up on a straight line, it means they are normally distributed. The straight line can be used to interpolate whatever central percentage you want for reference intervals. “So using this technique, I could prove that the way we were running the test in our lab, our observed reference interval was comparable to the manufacturer’s.”
The technique could also be used with blood donor samples or with people screened for lead exposure, such as pediatric patients. “If people are largely normal, then you might consider doing these indirect techniques as a surrogate for getting representative samples.”
Dr. Horowitz cites another experience as evidence of how helpful this technique can be. “A few years ago a clinician called about our bicarbonate data, indicating that he was seeing a high proportion of abnormally high values in his patients. We reviewed our internal QC data and proficiency data, all of which seemed to indicate that our values were fine.” But when the laboratory looked at its outpatient data, it found 30 percent of outpatient samples had high bicarbonates. “Since the vast majority of our outpatients are healthy, we inferred that our method and our reference interval were not in synch.”
This was an important lesson, he said. “Neither QC nor proficiency testing can validate your reference interval. QC can tell you about your precision and whether or not your values are changing over time. Proficiency testing can tell you whether your values match your peer group. But maybe the manufacturer’s reference range was established incorrectly, or maybe your patients are different in some way from the population the manufacturer used to establish its recommendations.” This is why package inserts always say each laboratory must validate its own reference interval, Dr. Horowitz said. “It’s relatively easy to go back to your data. If you took just 10 normal outpatients a day for several days and if 30 percent of them had high bicarbonates, I would venture to say I can tell you what platform you’re using. So you have ways at your disposal to do validation of your reference intervals.” In this case, for a number of reasons, Dr. Horowitz made the decision to change the reference interval for bicarbonate.
Preanalytical considerations such as subject and specimen preparation are also important factors in valid reference intervals. “Things as simple as how long the individual sits before blood is drawn can make a tremendous difference on some analytes—as much as 20 percent. At what force were the samples centrifuged? At what temperature, and for how long, were the samples stored before analysis? All of these factors need to be defined in connection with doing reference intervals.”
For the nonparametric method of setting reference intervals, which the CLSI document strongly endorses as the preferred method, Dr. Horowitz explained that 120 samples are needed for each potential partition. That means if there are different intervals for men and women, 240 are needed. “Once you add age groups or ethnic background, you’re talking about hundreds of patients. So when you start thinking about partitions, you’ve got trouble.” If you want to collect data for more than one partition, you want to ask if they are different enough that separate reference intervals are required for them. (Unfortunately, the only way to know they’re not needed is to collect the data and prove it.)
Dr. Horowitz called attention to a recent study where this approach was demonstrated perfectly. (Brewster LM, et al. Distribution of creatine kinase in the general population: implications for statin therapy. Am Heart J. 2007;154:655–661). In the study, the authors collected data from several hundred healthy men and women of different ethnicities and showed convincingly that CK reference intervals are affected not only by gender but also by ethnicity; manufacturers’ reference intervals may be twofold to threefold too low for some ethnic groups. Though CK does not need to be measured routinely in patients taking cholesterol-lowering drugs, many physicians still do it. If labs are using manufacturers’ recommended CK reference intervals, some patients may appear to have abnormal results, when in fact they do not, and be inappropriately taken off these drugs.
One key element gets almost no attention, he said: the confidence with which you know the upper and lower limits. “How tightly do we know those limits? If you look at statistics, you actually need only 39 patients to use the nonparametric approach.” The problem is that the confidence limits are so wide at the upper and lower limits of 39 data points that they’re worthless, he said, because you’re depending on the first and last points to establish the reference interval. Thus, to get adequately narrow confidence limits with the nonparametric technique, statisticians recommend 120 per partition. “And that sounds like a lot to us because it is a lot.”
Robust techniques offer a potential practical alternative, especially in cases where it’s virtually impossible to collect 120 samples per partition. “A robust statistic is one that is less sensitive to extreme points,” Dr. Horowitz explained. For example, a mean is not a robust statistic because if the last point in a distribution is changed, the mean shifts; that value contributes a lot of weight to the mean. But the median doesn’t change at all. It’s a “robust” statistic. Robust techniques are described in more detail in the CLSI document and are available in standard statistical packages, but Dr. Horowitz cautioned that they should be used only in conjunction with knowledgeable statisticians. “So as intimidating as the statistics look, the bottom line is it’s really hard to get reference individuals,” he said. “If you can cut the number you need and determine reference intervals from 40 to 60 to 80 points, it makes life a lot easier.”
Whenever a laboratory brings a new method in house, it should do a correlation study with its existing method. If it has a currently valid reference interval, it can use the data from the correlation study to transfer the reference interval to the new method. The advantage of this kind of transference, he said, is that the laboratory doesn’t have to obtain samples from reference individuals.
Validating, as opposed to establishing, reference intervals is, in fact, relatively easy and can be done with samples from just 20 reference individuals. If no more than two of those 20 values are outside your lab’s proposed reference interval, you can adopt that reference interval; it’s statistically valid. It’s supported by CLSI, and you have done an excellent job of validating the reference interval. Surprisingly, he said, all the statistical gurus agree that this is true and the validation is not a shortcut. However, “if three out of the 20 are out, you’ve got a problem and must investigate further.”
Dr. Horowitz called on manufacturers to make improvements in the reference interval information in their package inserts. “We need to know whether they looked at common partitioning factors such as age, fasting, and gender. And for tests where there are well-known sub-class differences based on partitioning factors, give us those differences. They should have a statement on the traceability of their method.” And at a minimum, he said, they should always indicate the actual number of individuals studied and the percentiles used for the reference limits. “Believe it or not, in many package inserts you can’t tell if the reference interval represents the central 95 percent of individuals studied [2.5th to 97.5th percentile], the central 90 percent of individuals studied [5th to 95th percentile], or the 99th percentile. It doesn’t say.”
In the interim, laboratories can take advantage of the CAP Reference Range Service. “This is one of the greatest innovations, but hardly anybody knows about it,” Dr. Horowitz said. The program consists of four modules offering service for more than 40 chemistry and hematology analytes. Labs collect their own data on a limited number of reference individuals and submit it to the CAP. The CAP pools the data with comparable data from other labs using the same method. “And they have lots of data, from lots of labs, from lots of instruments. They slice and dice and do magical things with it using the techniques described in the CLSI document.”
Participating laboratories then get a detailed statistical analysis of their own data, as well as the data from all the other labs using their methods. In most cases, there is more than enough pooled data for the CAP to establish recommended reference intervals—often by gender, sometimes by ethnicity. The more laboratories that participate, the more data the CAP has, and the better the results for everybody, Dr. Horowitz said.
In summary, the new CLSI standard document allows labs to validate reference intervals with smaller sample sizes and more practical techniques, it re-asserts the manufacturers’ responsibility for accuracy in the tests with national guidelines, and it provides many resources for performing reference interval studies—all of which will help ensure that reference intervals are valid and that laboratory results are clinically useful, Dr. Horowitz said.
“Clinicians are just overwhelmed with information. One of the most common and important things they do is evaluate whether lab values are ‘normal’ or ‘abnormal.’ The least we can do is make sure our reference intervals are valid.”
Anne Paxton is a writer in Seattle.
Doing It By the Numbers: Making Reference Intervals Manageable
Gary L. Horowitz, MD
Director of Clinical Chemistry
Beth Israel Deaconess Medical Center
Thursday, May 7 from 1 to 2 PM (ET) US
- Obtain samples from 20 reference individuals.
- If no more than two of those 20 values are outside your lab’s proposed interval, you can adopt that statistically valid interval.
- If three of the 20 are outside, investigate further.