PAP/NGC Programs Review
Jonathan H. Hughes, MD, PhD,
Nancy A. Young, MD,
David C. Wilbur, MD
The Centers for Medicare and Medicaid Services presented on Feb. 8 an update on the findings of the 2005 national cytology proficiency testing program. The two CMS-approved testing programs for 2005 were the State of Maryland Cytology Proficiency Testing Program and the Midwest Institute for Medical Education Inc., or MIME.
The testing process includes an initial 10-slide examination with subsequent second and third tests for initial failures. As of Jan. 31, 2006, 12,786 individuals (pathologists and cytotechnologists) had taken either the Maryland or MIME initial examinations, and an overall failure rate of nine percent was observed. The first testing event failure rate for cytotechnologists was seven percent, while the failure rates for pathologists who perform primary screening or secondary screening were 33 percent and 10 percent, respectively. For individuals who failed the first testing event, the failure rates for the second testing events were four percent for cytotechnologists, 34 percent for primary screening pathologists, and eight percent for secondary screening pathologists. The third testing event yielded failure rates of 11 percent (cytotechnologists), 20 percent (primary screening pathologists), and 13 percent (secondary screening pathologists). The overall failure rate on the first testing event with the 2005 MIME test was nine percent, which is similar to historical first testing event failure rates observed with the Maryland test (11 percent in 1990 and six percent in 1995). The MIME test demonstrated only minor differences in slide performance among conventional, ThinPrep, and SurePath preparations.
These preliminary results appear to demonstrate obvious trends: Cytotechnologists perform better than pathologists; pathologists who are secondary screeners perform better than those who are primary screeners; and test performance improves with re-testing. Proponents of the current proficiency testing program would probably argue that these trends are proof that PT successfully identifies subsets of individuals who have substandard interpretive abilities, and that the performance improvement that occurs upon repeat testing of these individuals is evidence of the PT program’s ability to encourage self-study and continuing education, with the desired result of improving patient care. However, there are alternative ways to interpret these findings and their implications for public health and patient safety.
The most important issue to consider when drawing conclusions about the 2005 proficiency testing data is whether the test is a valid measure of interpretive ability in the first place. There are many reasons why it may not be. For many pathologists, the testing conditions do not mirror the real-world practice situation. In the interests of good patient care and quality assurance, most pathologists will show challenging cases or high-grade lesions to colleagues, in an effort to reach a consensus or “laboratory” interpretation. However, this type of collegial consultation is not permitted on a PT examination.
Even more worrisome is that reproducible and reliable testing material is extremely difficult to find. It is well documented that the Pap test is associated with considerable interobserver variability, even among so-called experts. Is a test slide consisting of primarily koilocytes and low-grade dysplastic cells, in which occasional cells that possess slightly higher nuclear:cytoplasmic ratios are often found, a straightforward category C (LSIL) designation? Could this be a “tricky” category D (HSIL) slide that is meant to test the ability of the examinee to identify a few high-grade cells? Anecdotal evidence from test takers in the current cycle attest to these types of slides being present in the testing event with answers being rendered on both sides from otherwise “passing” and hence “qualified” observers.
Of course, some of these problems of interobserver variability could be reduced by using only slides that have been field-validated by a large number of individuals in the PT slide sets. Unfortunately, the 2005 MIME test did not use field-validated slides; the MIME slides are selected by a three-member panel of pathologists. Our experience on the CAP Cytopathology Committee suggests that using a three-person panel is not a scientifically valid method for choosing testing material. A significant number of slides that are felt to be good example cases by three members of the CAP Cytopathology Committee (cytopathology board-certified practitioners) do not perform well when they are circulated in the educational arm of the Pap program. Fortunately, these slides, which do not field validate, never enter graded Pap sets or the slide sets used for CAP-administered proficiency testing. In fact, preliminary data from the CAP mock PT examination administered in 2005, which used only field-validated slides, indicates an initial failure rate that is about half that of the MIME test.
The 2005 proficiency testing data demonstrate that performance improves with repeat testing. Does this mean that the process of participating in PT has the desired effect of improving the diagnostic abilities of people who fail initially, or is the improved performance on repeat testing simply a result of an improvement in test-taking strategies? The phenomenon of improved test performance with repeat testing has been documented on many standardized tests (for example, SAT, MCAT, LSAT) and even on I.Q. tests, and is usually attributed to “gaming” the test rather than an increase in knowledge or ability (or I.Q.!).1,2 Preliminary data from the CAP mock proficiency test indicate that test gaming is a real phenomenon. For example, herpes infection, which has always been identified with an extremely high degree of accuracy among cytotechnologists in the CAP Pap program, was over-diagnosed as LSIL and HSIL by cytotechnologists in the CAP mock PT. It is unlikely that this reflects a sudden inability to recognize herpes on the part of cytotechnologists. Rather, because the test penalizes cytotechnologists more heavily for under-diagnosing HSIL than for over-diagnosing infection, it appears that cytotechnologists may purposefully upgrade category B slides to category C and D interpretations, to “hedge” against the possibility of under-diagnosing a category D slide, which can result in a one-slide failure.
Of course, this can be problematic for the pathologist who subsequently receives the slides and the cytotechnologist’s diagnoses. Specifically, in this PT scenario, the pathologist may find himself disagreeing with many of the diagnoses rendered by a cytotechnologist with whom he rarely disagrees in real-life practice. The cytotechnologist cannot be blamed for employing test-taking strategies that maximize the likelihood of passing the test. Nonetheless, these strategies produce a situation for the pathologist that does not reflect daily practice, and that is yet another example of how proficiency testing does not test real-life diagnostic ability.
There is little doubt that pathologists who perform primary screening do not perform very well on the test. But what are the implications for public health if large numbers of primary screening pathologists fail the test and are no longer able to sign out Pap smears? Most pathologists who do primary screening do not do it by choice. Many work in rural or other underserved areas, where they do not have access to a cytotechnologist. The nationwide shortage of cytotechnologists, which seems to be a chronic problem, makes it unlikely that many of these underserved areas will attract qualified cytotechnologists at any time in the foreseeable future. If the primary screening pathologists in these areas are no longer allowed to sign out Pap tests because they fail the regulatory PT, the women they serve may lose access to Pap tests altogether, or specimens will have to be transported long distances to other laboratories where contact with clinicians and turnaround times will be suboptimal.
It is currently unclear that the way that PT is administered now achieves the desired goals of identifying individuals who need to sharpen their diagnostic skills and improving patient care. Current PT does not reflect real-life practice. Moreover, there is little evidence to indicate that a 10-slide test, the “correct” answers for which are determined by a panel of three pathologists, is a scientifically valid measure of performance. Even though it is well established that there is interobserver variability among experts on the diagnoses of LSIL and HSIL, the current test-grading scheme penalizes pathologists (but not cytotechnologists) for LSIL-HSIL discrepancies. The differential grading systems for cytotechnologists and pathologists produce test-taking strategies and issues of test gaming that may decrease performance.
If one adds up all of these shortcomings and their compounding negative effects on test performance, there can be little doubt that some extremely well-qualified cytotechnologists and pathologists will fail the test. Even if these individuals pass on a subsequent examination, they will undoubtedly experience some degree of embarrassment or a loss of confidence. If excellent diagnosticians decide that it is “just not worth it,” they might leave the practice of gynecologic cytology. Such a trend could have an adverse impact on the public health of underserved areas, an unintended consequence of regulatory proficiency testing.
- Kulik JA, Kulik CC, Bangert RL. Effects of practice on aptitude and
achievement test scores. American Educational Research
- Wing H. Practice effects with traditional mental test items. Applied
Psychological Measurement. 1980;4:141–155.
Dr. Hughes is staff pathologist, Laboratory Medicine Consultants, Las Vegas; Dr. Young is senior member, Department of Pathology, Fox Chase Cancer Center, Philadelphia; and Dr. Wilbur is director of cytology, Massachusetts General Hospital, Boston. Dr. Wilbur is chair of the CAP Cytopathology Committee; Drs. Hughes and Young are members.