|
|
|---|
The CAGE questionnaire is commonly used as a brief screening test for alcohol dependence (Ewing, 1984). It is scored on a five-point scale of 04. Ask most medical students (and some professors) what a positive test (such as a score of 2 out of 4) means and they are likely to answer that alcohol dependence is present. And if the test is negative (a score of 0, indicating no affirmative answers), then the condition is absent. This reductionist way of interpreting test results, although commonplace in medicine, is wrong: it is quite possible that someone scoring 4/4 on CAGE will not have alcohol dependence and someone scoring 0/4 will. Thankfully, in psychiatry this elemental approach to diagnostic tests is rare, but possibly only because psychiatrists use fewer tests than their colleagues in physical medicine.
The use of tests in psychiatry has always been widespread in research, as they are the main method of measuring outcome in clinical trials. Tests are also increasingly common in clinical practice; for example, neuropsychological testing in the diagnosis of dementia (De Jager et al, 2003) or screening for depression in general practice (Henkel et al, 2003). Therefore, whether for the purpose of interpreting results of clinical trials or for routine clinical practice, an understanding of the use of tests, their interpretation and limitations is helpful.
Tests tend to fall into two broad categories: those used for screening and those used for diagnosis. However, many of the principles used in understanding these different applications are similar.
|
|
|---|
Screening tests should be easy to administer, acceptable to patients, have high sensitivity (i.e., identify most of the individuals with the disease), identify a treatable disorder and identify a disorder where intervention improves outcome (Wilson & Junger, 1968).
Diagnostic tests, on the other hand, are meant to provide the user with some surety that a disease is present.
No diagnostic test is 100% accurate, even those based on pathology results, although for the purposes of understanding the interpretation of tests it is necessary to suspend disbelief and assume that the reference standard (or gold standard) diagnostic procedure against which another test is compared is 100% accurate (Warner, 2003). The reference standard test may be another questionnaire, a structured interview to provide diagnoses (e.g. using the DSM or ICD) or an interview with a clinician. Rarely in psychiatry, the reference standard may be derived from pathology, such as brain histopathology in dementia studies.
|
|
|---|
Example 1: The usefulness of the CAGE
Bernadt et al(1982) studied the usefulness of the CAGE in a psychiatric setting. Out of a sample of 371 patients on a psychiatric unit, 49 were diagnosed as having alcohol dependence, on the basis of a comprehensive assessment by clinicians, which was the reference standard in this study. The authors compared these assessments with the samples CAGE results, using a score of 2 or more to indicate a positive result, i.e. alcohol dependence. The CAGE was positive for 45 of these 49 patients, i.e. it correctly identifying 92% of those thought to have alcohol dependence by the clinicians. (This is the CAGEs sensitivity.) However, using this 2/4 cut-off, the CAGE missed 4 patients (8%) defined by the reference standard as having alcohol dependence. Of the 322 patients thought not to have alcohol dependence by a clinician, 248 scored <2, i.e. below the cut-off. Thus, the CAGE correctly excluded 77% of those who did not have alcohol dependence. (This is its specificity.) The CAGE incorrectly suggested that 74 people had alcohol dependence when the reference standard of the clinical assessment said they did not.
These results are shown in Table 1
. The format of this table, often referred to as a 2x2 table, is conventional for presenting such data. The results of the reference standard are given in the columns, and those of the diagnostic or screening test in the rows. The results of the reference standard are read vertically, down the columns of the table. To see how the other test (screening or diagnostic) performed, read horizontally, across the rows. Not all authors use this convention of putting the reference standard at the top of the table.
|
View this table: [in a new window] | Table 1 Comparison of clinician-assessed (the reference standard) and CAGE-identified alcohol dependence (n = 371) |
The ideal test is one that has very high sensitivity and specificity, so that most true cases are identified and most non-cases are excluded. However, sensitivity and specificity change in opposite directions as the cut-off (cut-point) of a test changes, and there is a trade-off between maximising sensitivity or specificity. For example, if the threshold for diagnosing alcohol dependence using the CAGE were to be made more demanding by increasing it to 3 or more positive answers (a score of
3), some patients who were truly alcohol dependent would no longer be identified by the test, so its sensitivity would decrease. On the other hand, using a
3 cut-off would mean that fewer patients would be mis-identified as possibly having alcohol dependence when they did not, so the specificity would increase. The sensitivity and specificity always have this inverse relationship.
Positive and negative predictive values
Sensitivity and specificity are characteristics of how accurate a test is. This is not particularly important to patients, who are much more likely to want to an answer to the questions If I score positive for a test, what is the likelihood that I have the disease, or If I score negative, what is the likelihood I dont have the disease. To answer these questions, refer again to Table 1
, but this time look horizontally, across the rows. Reading along the row marked Positive, a total of 119 people scored positive on the CAGE, but only 45 (38%) had alcohol dependence as defined by the reference standard. This is the tests positive predictive value (PPV). For the 252 individuals who scored negative on the CAGE, 248 did not have alcohol dependence as defined by the reference standard: a negative predictive value (NPP) of 98%.
Unlike sensitivity and specificity, the PPV and NPV will change with the prevalence of a disease. For rare diseases, the PPV will always be low, even when a test is near perfect in terms of sensitivity and specificity.
Example 2: Near-perfect risk assesment of a rare event
Table 2
compares data on murders committed in England with the results of an imaginary near-perfect (sensitivity and specificity of 99%) test designed to identify who will commit murder. The figures are invented but are close enough to make the point. Unfortunately, a real test of this accuracy is unavailable, and current predictive tests of human behaviour have much lower sensitivities and specificities (closer to water divining!), so the PPV will be much lower than in this example.
|
View this table: [in a new window] | Table 2 Number of individuals who commit murder in one year compared with the number predicted by a fictitious, near-perfect test to identify murderers |
|
|
|---|
Example 3: ROC curves
Figure 1
shows ROC curves for two tests of cognitive function. The Mini-Mental State Examination (MMSE) is a brief test of cognitive function used to screen for dementia (Folstein et al, 1975). The MMSE scores between 0 (very impaired cognition) and 30 (very good cognition). The cut-off for suspecting dementia is usually taken as 24. The 3MS is a modified version of the MMSE (McDowell et al, 1997). In Fig. 1
, the sensitivity and 1 specificity of the MMSE and the 3MS, measured against the reference standard, is plotted for each test score (e.g. in the MMSE 30 0). If, say a cut-off of 29/30 is selected for the MMSE, so that all those scoring below 29 were suspected of having dementia, the sensitivity would be very high, but the specificity would be low (giving high false-positive rates). This point would appear on the graph near the top right. If a cut-off of 5/30 is used, the sensitivity would be low (lots of people scoring 6 and above would be told they did not have dementia) but the specificity high (very few people scoring below 5 would not have dementia). This would be near the bottom left of the graph.
![]() View larger version (43K): [in a new window] |
Fig. 1 A receiver operating characteristic curve comparing the MMSE (squares) with the 3MS (bars). The curves follow the plots of sensitivity and 1 specificity (false positives) for each test score. The 3MS has a greater area under the curve (the space between the 45º diagonal line and the curve) and is closer to the ideal (sensitivity and specificity of 1); it is therefore possibly a better test (Source: McDowell et al, 1997. © Elsevier Science, with permission.)
|
|
|
|---|
The LR+ is the likelihood of a positive test result when the diagnosis is present divided by the likelihood of a positive test result when the disease is absent. It can be calculated from the formula:
![]() |
which provides a single number that can inform how useful a positive test is in clinical practice.
Similarly, the LR can be calculated from:
![]() |
Example 4: The LR+ and LR for
2 on the CAGE
Using the sensitivity and specificity calculated in Example 1 from the values in Table 1
, we can calculate the LR+ and LR for scoring 2 or more on the CAGE:
![]() |
![]() |
You need calculate the likelihood ratios only once, as they will not change for different patients, provided the setting that the test was validated in is similar to that for the patients in question. The likelihood ratios do not change with different prevalences.
Pre- and post-test probabilities
Likelihood ratios are a useful way of informing you how much more (or less) likely a condition is, given a positive (or negative) test. To use likelihood ratios, it is important to have a sensible estimate of the probability that a condition is present before the test is done. This pre-test probability may be based on evidence (such as epidemiological studies of prevalence) and/or clinical intuition after assessing a patient.
Example 5: Estimating pre-test probablity
Try to answer these questions below, using only the information provided and your clinical experience:
Most people would have a different answer for each scenario, with the probability rising significantly for the last one. Our clinical estimation of pre-test probability is quite subtle and is likely to change with each additional piece of information.
Once some sensible estimate has been made, the use of a screening or diagnostic test with a known likelihood ratio can then provide a post-test probability.
Example 6: Estimating a post-test probability
We found in Example 4 that the LR+ in psychiatric patients scoring
2 on the CAGE is 4. Odds is the number of events divided by the number of non-events. From Table 1
, the odds of alcohol dependence in the overall sample (the pre-test odds) are 49/322 or 0.15. The odds of alcohol dependence being present in those positive for the test (the post-test odds) are 45/74 or 0.61. These odds and the LR are linked: Bayes theorem states that
![]() |
In other words, in our example a positive likelihood ratio of 4 means that the odds of a disease being present in those positive for the test are 4 times greater than the odds of diagnosis in the original sample, before the test was applied. Unfortunately, we cannot multiply a probability by a likelihood ratio: probabilities need to be converted to odds before the likelihood ratio can be used.
Example 7: Using odds and probabilities to improve on clinical assessment
If statistics dont captivate you, feel free to skip this example.
Can a patients CAGE score better your clinical assessment of the likelihood that he has alcohol dependence? The following relationships are given:
![]() | (a) |
![]() | (b) |
![]() | (c) |
Assume that you see a man in out-patients and after an initial clinical assessment you think there is a 30% chance he has alcohol dependence. He then completes the CAGE with a cut-off of
2 and scores positive.
The pre-test probability for our CAGE example is 0.3. Convert this to odds using formula (a)
above:
![]() |
Thus, the pre-test odds of this patient having alcohol dependence are 0.43.
![]() |
i.e. the post-test odds of this patient having alcohol dependence are 1.7.
Converting these odds back to a probability using formula (b)
![]() |
Thus, the post test-probability is 0.63. So, by applying the CAGE and finding that the patient scores 2 or more, the likelihood that he has alcohol dependence has risen from 30% (the clinicians initial assessment) to 63%.
The likelihood-ratio nomogram
Fortunately, you do not have to do the maths of converting probabilities to odds and back again to use likelihood ratios. The Fagan likelihoodratio nomogram (Fig. 2
) provides a simple non-mathematical method of determining post-test probabilities. The nomogram is widely available in books on critical appraisal. To use the likelihood-ratio nomogram:
![]() View larger version (17K): [in a new window] |
Fig. 2 Fagan likelihood-ratio nomogram. (Source: Fagan, 1975. © 1975 Massachusetts Medical Society, with permission.)
|
Example 8: Using the likelihood-ratio nomogram
Following steps 1 to 4 and using the data from our CAGE example above for a patient with a positive result (a pre-test probability of 30% and an LR+ of 4) and drawing a line through 30% and 4, we read from from the right-hand column that the post-test probability is just over 60% (in Example 7 we used maths to show that the result is 63%). Thus, you are now more sure that the patient has alcohol dependence.
The LR for CAGE is 0.1. So, for someone who scores less than 2 on the CAGE (a negative test) with a pre-test probability of 30%, the post-test probability falls to about 4%, almost ruling out alcohol dependence.
Note what happens if you change the pre-test probability. If the pre-test probability is 1%, given a positive CAGE result, the post-test probability is still less than 5%. This is unlikely to make any difference to diagnosis or management. This is a general property of diagnostic tests if the pre-test probability is very low, diagnostic tests are of little practical value (cf. Table 2
).
More examples of tests and their likelihood ratios are given in Table 3
. For all of the examples in that table, I have assumed a pre-test probability of 50%. You could experiment with different pre-test probabilities using the nomogram. It is interesting to see that the results for home pregnancy-testing kits and laboratory tests for H. pylori infection are also provided to show some chemical tests fare little better than the tests used in psychiatry!
|
View this table: [in a new window] | Table 3 Sensitivities, specificities and likelihood ratios for some tests used in psychiatry and general health. The post-test probabilities are based on a pre-test probability of 50% for each condition |
|
|
|---|
Validity
When deciding whether to use a diagnostic or screening test, it is important to consider the internal validity of the study evaluating the test (Box 1
).
| Box 1 Critical appraisal: are the results valid? Was the reference standard appropriate for the study? The reference standard with which a test is compared has to have face validity and be relatively accurate. Remember that no test (including those used as reference standards) is 100% accurate. Was there an independent blind comparison with the reference standard? The central issue in studies on diagnostic/screening tests is the comparison of the test under scrutiny with the reference standard. The aim is to identify how the test performs in terms of both identifying the disease/condition when the reference standard identifies it as present (sensitivity), and excluding it when the reference standard says it is absent (specificity). For this comparison to be accurate, the person performing/interpreting the test must be blind to the results of the reference standard and vice versa. Was the test evaluated in a sufficient number of people at risk of the disease? If the study sample was relatively small, the sensitivities and specificities will have low precision and any conclusions on the usefulness of the test may be misleading. Did the sample include a spectrum of patients appropriate to your purposes? The sensitivity and specificity of a test will change according to the population being tested. Therefore there may be little point relying on these measures if the trial involved a population from a specialist tertiary referral centre but you want to know how the test performs in a primary care setting. Did the results of the test influence the decision to perform the reference standard?
This may seem a strange question at first, but some evaluations of screening tests apply the reference standard only to those testing positive for the screening test. This is particularly common where the reference standard is lengthy. If this is done, you cannot know what the true false-negative rate is (see Box 2 Was the method for performing the test described in sufficient detail? Subtle changes in the way a test is performed can make significant changes to the results. Look for a clear description of the method of both the reference standard and the test under scrutiny.
|
Usefulness
The usefulness of the results (Box 2
) of a test are best thought of in terms of likelihood ratios, which you may have to calculate yourself, as they are rarely given in articles.
| Box 2 Critical appraisal: are the results useful? Are the likelihood ratios for the test presented (or can you calculate them)?
Usually, you can create a 2x2 table (such as Table 1
Useful related formulae:
The likelihood ratio (LR) is a useful way of combining sensitivity and specificity. Most articles do not report the LR, but they can be calculated as in Example 4.
You can use the LR to identify the post-test probability using the nomogram shown in Fig. 2 How precise are the sensitivity and specificity? Sensitivities and specificities should be presented with confidence intervals, to provide a measure of precision of the estimates.
| ||||||||||||||||||||||
Applicability
The applicability of a test in clinical practice depends on several issues. First, the study assessing the test should be based on a sample with socio-demographic and clinical characteristics similar to the people on whom you want to use it. Although likelihood ratios do not change with prevalence, they may change significantly when the test is applied in different populations. An interesting example here is the CAGE questionnaire again; the likelihood ratio was much higher when the CAGE was validated in a primary care setting (LR+ = 12) (Liskow et al, 1995) than in a psychiatric setting (LR+ = 4) (Bernadt et al, 1982). This may be because many individuals with mental disorder are more likely to report guilt or get annoyed, even when they do not drink much. Other applicability issues include whether the test is readily available and acceptable to the patient, and whether the results are easy to interpret and lead to a change in management of the patient (Box 3
).
| Box 3 Critical appraisal: will the results help me in caring for my patient? Is the test available and easily performed? Look specifically at how you can perform the test in your setting. Do you require any special equipment? Is there a sensible estimate of pre-test probability? Pre-test probability is the probability a patient has an illness determined before the test is performed. You may use clinical intuition for a particular patient, or base the pre-test probability on existing prevalence data. Will the results change how I manage this patient? This is worth considering. Is it ever necessary to perform a test: (a) if it has such a low likelihood ratio that it has little or no effect on your decision of whether a condition is present; or (b) if the prevalence of a condition is very low? Will the patient be better off if the test is performed? Even if the test is valid and reliable, with a high likelihood ratio, if the patient will not benefit from the disease being identified there may be little point in performing it.
|
Other issues
Other issues that may influence your use or interpretation of a test include interrater reliability, which is a measure of how closely different raters agree when using a particular instrument to assess particular patients. This may be especially important if different members of your team use the same assessment tool. Interrater reliability for categorical variables (such as disease present/disease absent) is measured using kappa, which assesses the degree of concordance after taking into account that some agreement between raters will occur by chance. A kappa value >0.6 is generally held to indicate good interrater agreement (Guyatt & Rennie, 2002). Not all instruments, especially those used in research, have good interrater reliability. For example, the Clinicians Interview-Based Impression of Change (CIBIC), a seven-point scale which is widely used as an outcome measure for dementia studies, has an interrater kappa value of 0.18, indicating very poor reliability (Quinn et al, 2002).
|
|
|---|
We are inclined to believe those whom we do not know because they have never deceived us.
(Samuel Johnson, 17091784)
Samuel Johnsons words sum up the usefulness of understanding critical appraisal of diagnostic tests, by emphasising that nothing should be taken at face value without further exploration and assessment. This applies in particular to areas of psychiatry that have remained relatively undisturbed by the evidence-based medicine bandwagon, for example the utility of diagnostic tests. Without appraisal skills in this area, be prepared to be deceived!
|
|
|---|
|
View this table: [in a new window] | MCQ answers |
|
|
|---|
This article has been cited by other articles:
![]() |
P. S. John-Smith, A. Michael, and T. Davies Coping with a coroner's inquest: a psychiatrist's guide Advan. Psychiatr. Treat., January 1, 2009; 15(1): 7 - 16. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Harrison-Read IQ tests as aids to diagnosis and management in early schizophrenia Advan. Psychiatr. Treat., May 1, 2008; 14(3): 235 - 240. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Wiederkehr, M. Simard, C. Fortin, and R. van Reekum Validity of the Clinical Diagnostic Criteria for Vascular Dementia: A Critical Review. Part II J Neuropsychiatry Clin Neurosci, May 1, 2008; 20(2): 162 - 177. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ercan, A. Kevern, and L. Kroll Evaluation of a mental health website for teenagers Psychiatr. Bull., May 1, 2006; 30(5): 175 - 178. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||