Do NNTs work? | AcademyHealth

I'm a big fan of the NNT or "number needed to treat." But I'd drop my support if it turns out the NNT is not informative or helpful.

The NNT tells you, for a specific therapy and condition, how many patients need to be treated for each additional positive outcome relative to a control group. For instance, if for every 227 people treated for a condition, one additional person is alive after five years compared to a control group, the NNT for this outcome is 227. (Additional description here and, in a video by Aaron, here.)

There is an analogous metric called "number needed to harm" or NNH that measures how many people need to be treated to obtain one additional negative outcome. For example, if for every 97 people treated, one more suffers a stroke within five years, compared to a control group, the NNH for this outcome is 97.

At the website theNNT.com, physicians have amassed a large number of NNTs and NNHs for a wide range of conditions, treatments, and outcomes. These all come from clinical trials. Though they need continuous updating as new evidence comes out, I am not aware of any more comprehensive and accurate collection of such numbers. (More about that site and its creator in this Wired magazine piece.)

The great hope for NNT and NNH is that they motivate more appropriate decision making about which therapies are best for which patients. One patient may be perfectly willing to trade an NNT of 227 for five-year survival for an NNH of 97 for a stroke in that time. Another person might not be. It's subjective. At least the NNT and NNH let patients make that trade off on a basis of evidence. But there are other metrics of benefit and harm that could be used. Maybe they're better.

What's not subjective is how well people comprehend NNTs and NNHs relative to other metrics. Are people able to reason with them rationally, or at least more rationally than with other metrics of treatment effectiveness and harms, like relative and absolute risk reduction? (Relative risk reduction (RRR) is the reduction in risk in the treatment group relative to the control group. Absolute risk reduction (ARR) is the absolute difference in risk. If the risk is 10% in the control group and 5% in the treatment group, the RRR is 50% (=5%/10%), the ARR is 5% (10%-5%), the NNT is 20, or the reciprocal of the ARR.)

Two systematic reviews addressed this question, among others. One is a 2011 Cochrane review by Akl and colleagues. The other, by Zipkin and colleagues, was published in the Annals of Internal Medicine in August of this year. Both reviews found that NNT (and, because they're basically the same, I would assume NNH) is harder for people to understand — in the sense of successfully using them for probabilistic reasoning — than other metrics of risk reduction (or increase). In addition to comprehension, the reviews also examined perception, persuasiveness, satisfaction, and decision making, none of which I address in this post.

To see more specifically what the reviews' conclusions mean, I took a closer look at the studies that informed them. The Cochrane review's conclusion is based on a single study that compared NNT to RRR, Sheridan et al. (2003). The Zipkin et al. review also cited Sheridan et al. (2003) and three other studies. However, two of those three other studies do not assess probabilistic reasoning. The only other one cited that does so is Berry et al. (2006).

Let's look at the two relevant studies in turn.

Sheridan et al. (2003) was a randomized survey of 350 adults, each presented with ARR, RRR, NNT, or all three for two drug treatments of a hypothetical disease. They were then asked which treatment provided greater benefit and to compute the effect of one treatments for a given baseline risk of disease.

When asked to state which of two treatments provided more benefit, subjects who received the RRR format responded correctly most often (60% correct vs 43% for COMBO, 42% for ARR, and 30% for NNT, P = .001). Most subjects were unable to calculate the effect of drug treatment on the given baseline risk of disease, although subjects receiving the RRR and ARR formats responded correctly more often (21% and 17% compared to 7% for COMBO and 6% for NNT, P = .004).

First of all, it's pretty clear that adults, at least those in this study, are terrible at these tasks. Second, in the realm of terrible, NNT was the worst. Still, we should not be satisfied with any of these approaches to communicating risk.

Berry et al. (2006) was principally a study of whether providing baseline risk improved people's reasoning about risk. It does (no surprise). But within the results, some information about NNT (NNH, actually) can be inferred. The study is based on a convenience sample of 268 adult women. Each was randomized to receive ARR, RRR, or NNH information about a harm from second versus third generation oral contraceptives. Each was also randomized to receive baseline risk information (the harm from the second generation pill). Participants were then asked what they think the risk of harm is for each of the pills. The answers are 0.02% and 0.04% for the second and third generation pills, respectively.

First of all, it's impossible to obtain these answers from ARR, RRR, or NNH without baseline information. I suppose participants might have some baseline in mind, with which they can compute the harm for the third generation pill. But I'm not interested, in this post, in whether providing baseline information is helpful. It seems obvious that it would be, and that's what the authors found.

So, let's move right to the comprehension results for the with-baseline sample. Here, those who were provided NNH got closer to the right answers than those who received ARR or RRR information. Still, they were way off. Those receiving NNH information estimated second generation risk of 0.55% and third generation risk of 1.74%, which are 27.5 and 43.5 times larger than the right answers, respectively.

These are terrible! But participants who received ARR and RRR information did far, far worse. So, this is not a study that suggests NNH (or, by extension, NNT) leads to worse comprehension than other measures. It performed better (but still awfully).

So, the evidence base is both thin and inconclusive. There are two relevant studies about NNT comprehension included in two systematic reviews. They point in opposite directions.

My conclusion is that NNTs may, in fact, perform quite well relative to other measures. Or maybe they don't. We don't know. They may perform even better for practitioners than for patients, but we don't know that either. What we can't say from the evidence, however, is that NNTs are harder for people to understand than other metrics of risk. From two studies with conflicting findings, we just don't know that.

Austin B. Frakt, PhD, is a health economist with the Department of Veterans Affairs and an associate professor at Boston University’s School of Medicine and School of Public Health. He blogs about health economics and policy at The Incidental Economist and tweets at @afrakt. The views expressed in this post are that of the author and do not necessarily reflect the position of the Department of Veterans Affairs or Boston University.