March 2nd, 2011

A DOSE of Reality: The Challenges of Comparing Effectiveness

An ideal paper for your next journal club — “Diuretic Strategies in Patients with Acute Decompensated Heart Failure” — was just published in NEJM, by the NHLBI Heart Failure Clinical Research Network.  In this study (called DOSE), patients hospitalized with heart failure were randomized to receive different diuretic regimens based on dose and mode of administration. The authors concluded that “there were no significant differences in patients’ global assessment of symptoms or in the change in renal function when diuretic therapy was administered by bolus as compared with continuous infusion or at a high dose as compared with a low dose.” The editorialist stated, “Since a high-dose regimen may relieve dyspnea more quickly without adverse effects on renal function, that regimen is preferable to a low-dose regimen.”

There are at least five issues here that might be worthy of your attention.

1. The topic. Diuretics were introduced in the 20th century to treat heart failure, replacing the rigid Southey’s tubes that were inserted subcutaneously to drain fluid. Thiazide diuretics were introduced in 1958, and furosemide, the first loop diuretic, was approved in 1966. This agent is now standard therapy for patients with heart failure. What is remarkable — and worth some reflection — is that despite using loop diuretics for 45 years now, we still do not have essential evidence to support decisions about dosing and mode of administration. This study responds to the clamor for comparative effectiveness trials — and should give us pause about what else we are doing in clinical practice with little evidence to guide us.

2. The primary outcomes. The primary efficacy endpoint (there was also a safety endpoint) was the patient’s global assessment of symptoms, which was quantified as the area under the curve (AUC) of serial assessments from baseline to 72 hours. The authors describe the assessment as follows:

Patients were asked to self assess both their general well being (PGA) and their level of dyspnea using a visual analog scale (VAS) method. For PGA, patients marked their global well being on a 10 cm vertical line, with the top labeled “best you have ever felt” and the bottom labeled “worst you have ever felt.” For dyspnea, the labels were “I am not breathless at all” and “I am as breathless I have ever been.” The VAS was scored from 0 to 100 by measuring the distance in millimeters from the bottom of the line. The patient was unaware of the numerical value of their response.

Kudos to the research team for caring about patient-reported outcomes and attempting to translate those outcomes into a useful metric. However, interpreting the results is a challenge. For the comparison of bolus versus continuous infusion, the mean AUCs were 4236 and 4373, respectively (P=0.47). For the comparison of high- versus low-dose therapy, the values were 4430 and 4171, respectively (P=0.06). The authors considered a 600-point difference in AUC to be clinically important based on prior studies and thus concluded that there were no significant between-group differences in the primary efficacy endpoint, either statistically or clinically. Although these conclusions seem appropriate, it’s difficult to know what a 600-point difference really means in terms of the patient’s experience. A few examples from the authors would have been helpful.

3. The power calculation. With only 308 patients, this study was small by the standards of most RCTs measuring patient outcomes in heart failure. The sample size calculation was based on 88% power to detect a 600-point difference in the AUC of global assessment scores — and on 88% power to detect a difference of 0.2 mg/dL in the change in creatinine level between groups (the primary safety endpoint). I have no quibble with these calculations (though I do wonder why they picked 88%), but the small number of patients makes it difficult to do much with exploratory analyses by subgroup or different outcomes. The investigators adjusted the significance level for the primary outcomes, stating that the threshold would be a P value of <0.025. In doing this, they treated each trial within the 2×2 factorial design as a separate study with two endpoints.

One statistic that would have been useful to see in this paper is the confidence intervals for the difference in the primary endpoints, so that we could see what kind of differences cannot be excluded based on the results. Remember, the study was not designed to show that the groups were similar; it was designed to test if they were different. The conclusion, appropriately enough, was that there were no significant differences, but your questions now might be: “Are the treatment groups similar? What kind of differences can be excluded?”

4. The secondary analysis and adjustment for multiple comparisons. The investigators conducted many secondary analyses and, for these, set a P value of 0.05 as the threshold for statistical significance. Most of these endpoints did not differ between the groups, but there were some findings that bear discussion:

  • In the high- versus low-dose comparison, the difference in the area under the curve at 72 hours for dyspnea met the criteria for statistical significance (4668 vs. 4478, respectively; P=0.04) — a point highlighted by the authors in the discussion. However, with so many comparisons conducted, a P value of 0.04 should hardly be considered significant. Furthermore, the difference between groups was <200 points, far below the authors’ predefined threshold for a clinically meaningful result.
  • Change in body weight favored the continuous-infusion and high-dose groups, as might be expected.
  • The high-dose group had a significantly higher proportion of patients with creatinine increases of >0.3 mg/dL than did the low-dose group (23% vs. 14%; P=0.04). Again, we should be careful about interpreting the statistical significance of this, but an absolute difference of 9% for a potent risk factor like worsening renal function is hard to ignore and does raise concerns.

5. The recommendation by the editorialist. The editorialist came out with a strong endorsement for the high-dose regimen, arguing that it reduces dyspnea without worsening renal function. My interpretation is a bit different, but I tend to require better evidence to justify using more of a medication. Here is where the journal club should get interesting: What do you think are the implications of this study? Do you agree with the editorialist that it should change practice? Was the trial designed to address the question you have about how to use diuretics? If not, what would you have done differently? How should the guidelines incorporate this new information, if at all?

I look forward to your thoughts.

For more on the DOSE study, check out Anju Nohria’s Voices blog.

3 Responses to “A DOSE of Reality: The Challenges of Comparing Effectiveness”

  1. Although RCTs rarely provide complete clarity for the issues they are designed to answer, the DOSE study results seem particularly prone to alternative interpretations. This study was well thought out, blinded, multicenter, etc. Yet I wonder how much it will dictate or change practice. The most common responses I’ve heard from colleagues are divergent interpretations that generally support their current practice patterns. I can’t help but feel a little disappointed. Did the study just need to be bigger? Different endpoints? Different patients?

    Competing interests pertaining specifically to this post, comment, or both:
    I previously worked with HFCTN.

  2. The DOSE study was designed to look for some real evidence for a class of drugs used for HF patients for so long. This being said, I wonder why it was not attempted to obtain a much larger sample size to address some more serious questions (e.g. whether any mortality or MACE difference could be found based on the type of furosemide administration, or its dosing). Even though global assessment of symptoms and dyspnea are important, looking for mortality or MACE difference could have been at least as equally important.

    In addition, there is pathophysiological justification that adding a thiazide diuretic could be beneficial in acute heart failure patients refractory to loop diuretics (furosdemide). It remains to be seen whether combination of a thiazide diuretic with furosemide could have led to different outcomes in any of the four major subgroups of the trial.

    Moreover, it has been stated that almost one fourth of participants had normal left ventricular ejection fraction, in which case either such patients had heart failure with preserves systolic function or were misdiagnosed as having acute decompensated heart failure. It would be interesting to know, in either case, how different dosing and types of administration of furosemide worked in this patient subset.

    Competing interests pertaining specifically to this post, comment, or both:

  3. The study shows higher does lasix better than lower dose for relief of symtoms and reducing LOS. A minor increase in creatinine should never have been a part of the “safety” profile. We all know that Bun and Cr often increase with adequete diuresis the increased creatnine is a laboratory finding not saftey- borne out by 30 day results. Although p>.05 the question is not use or nonuse of a new expensive drug where wnat p,0.05. If there is a 90% chance (eg with p 0.1)that high dose lasix better than low dose, no real saftey issues why would I not use?