October 1st, 2010
A Case of Exuberance About a Subgroup in a Clinical Trial
Harlan M. Krumholz, MD, SM
In many clinical trials, researchers investigate whether an overall effect of an intervention is consistent across various subgroups, as I discussed in this Journal Club Series last week. Such subgroup analyses require assessment of what is called an interaction — that is, whether the effect in one group differs from that in another. Do the benefits differ, for example, between older and younger patients — or between men and women? That may sound like an easy issue to explore, but penetrating it demands an appetite for methodological nuance. Are you ready? Here we go . . .
The basic principle of subgroup analysis is that before you assess the effect of an intervention in each subgroup of interest, you must assess whether there is strong evidence of statistically meaningful differences among the subgroups. If older patients and younger patients benefit similarly from a drug, for example, there is no reason to look at one group alone, because the overall trial outcomes apply to both groups.
This week, I focus specifically on a study showing that bleeding risk was significantly greater with prasugrel than with clopidogrel in ACS patients scheduled for PCI. The researchers published a subsequent article focusing just on the patients with ST-segment–elevation MI (STEMI) and concluded that there was no excess risk for bleeding in that subgroup, even though the difference in that risk between STEMI and unstable angina/non–STEMI (UA/NSTEMI) patients was not significant. These studies are instructive for thinking about how to interpret subgroup analyses accurately.
Results Recap
The articles are a 2007 NEJM paper and a 2009 Lancet paper from the industry-funded TRITON-TIMI 38 trial, in which 13,608 patients with moderate-to-high risk ACS who were scheduled for PCI were randomized to receive prasugrel or clopidogrel. Incidence of the primary endpoint — death from cardiovascular causes, nonfatal MI, or nonfatal stroke at 15 months — was significantly lower in the prasugrel group than in the clopidogrel group (9.9% vs. 12.1%), mostly because of fewer nonfatal MIs. However, the incidences of major and life-threatening bleeding were significantly higher with prasugrel. In a 2009 NEJM perspective piece, an FDA official characterized the data as follows:
“For each 1000 patients who were given prasugrel instead of clopidogrel, 24 end-point events were prevented — 21 nonfatal myocardial infarctions and 3 cardiovascular deaths. (The rate of stroke was essentially the same with either drug.) The cost was 10 excess major or minor bleeding events, 2 of which were fatal.”
What the Authors Analyzed
It’s important to emphasize that this trial reported analyses for 7 subgroups on the primary outcome, as well as several secondary outcomes including safety outcomes — quite a few comparisons in all. When there are multiple subgroup comparisons, P values for analyses beyond the primary endpoint must be adjusted to more stringent levels than the usual P<0.05, in order to account for the likelihood that the differences may be due to chance. A common approach is to divide 0.05 by the number of tests performed, but there are other ways to do this adjustment.
When the NEJM findings were published, the authors reported the significant difference between prasugrel and clopidogrel in the overall cohort and reported no evidence of a difference (interaction) between treatment group and type of MI. They then looked specifically at the primary outcome in the UA/NSTEMI and STEMI subgroups and found an advantage of prasugrel, at the P<0.05 level of significance, in each.
In their 2009 Lancet article, the authors focused only on the STEMI subgroup. They declared the primary-endpoint advantage of prasugrel in STEMI patients, reiterating the finding earlier reported in the NEJM. So far, so good: There was no interaction reported in the NEJM paper, and the Lancet paper said the same thing — the result is similar. It’s important to note that even if the effect had been reported as nonsignificant in the Lancet paper, the drug should not be considered ineffective in STEMI patients. Given that the interaction was negative, the efficacy of the intervention in the underpowered STEMI subgroup should be considered the same as it was in the overall cohort. In short, we have no evidence of an intervention-related difference between STEMI and UA/NSTEMI patients.
The authors also looked at safety endpoints. Specifically, they reported in the Lancet that the type of MI did not influence the excess risk for bleeding conferred by prasugrel (P=0.4). Therefore, the interaction between the bleeding risk with prasugrel and type of MI was negative (i.e., no evidence of heterogeneity for type of MI). However, the authors went on to assess the STEMI group alone and did not find a significantly increased risk for bleeding. They write, “TIMI major bleeding unrelated to CABG surgery was similar in the prasugrel and clopidogrel groups at both 30 days and 15 months.” This statement suggests that unlike patients with UA/NSTEMI, those with STEMI are not at increased risk for bleeding.
The problem is twofold: (1) the interaction, as noted, was not significant; and (2) the prasugrel versus clopidogrel hazard ratio for TIMI major bleeding unrelated to CABG was hardly reassuring (HR, 1.11; 95% CI, 0.70–1.77). That HR is not statistically significant, but it is in the direction of harm, and a 77% higher risk with prasugrel cannot be excluded. The best inference is that the excess bleeding risk in the STEMI subgroup is no different than that reported for the overall cohort.
How the Authors Interpreted Their Analyses
The authors conclude their Lancet abstract with this sentence: “In patients with STEMI undergoing PCI, prasugrel is more effective than clopidogrel for prevention of ischaemic events, without an apparent excess in bleeding.” And they end the entire Lancet article with this sentence: “Our findings suggest that prasugrel is an especially attractive alternative to clopidogrel to support PCI in the course of management of patients with STEMI.”
At the conclusion of your journal club, participants can decide whether the evidence from these articles is strong enough to say that STEMI patients gain a particular benefit from prasugrel. For now, please share your perspective with your colleagues here on CardioExchange.
Graet discussion. Another educational set of papers about the interpretation of sub-groups analyses appeared in the July 6, 2010 issue of Annals of Internal Medicine.
See:
http://www.annals.org/content/153/1/8.abstract
http://www.annals.org/content/153/1/8.abstract
Competing interests pertaining specifically to this post, comment, or both:
Christine Lian is teh Editor of Annals of Internal Medicine.
This comment is my point of view and at the same time, virtually a piece of two questions for Harlan Krzumholz.
1/ am I right if I take it that once you have an underpowered subgroup of STEMI patients, you cannot infer any conclusion to this group and it is useless and inappropriate to put this underpowered subgroup to any analysis?
2/ Is it true that some relevant conclusion could be drawn only in case of the trial designed primarily to compare two subgroups (i.e. STEMI and NOSTEMI/UA), both of them being adequately powered?
Thanks.
Milan Kostek, MD, Slovakia (Conflicts of interest: none)
I am completely agree with Dr.Harlan M. Krumholz, MD, SM, that prasugrel increase risk of major bleeding in both STMI & NSTMI over clipidogrel. Sub-group analysis & Composite end points are always problematic. Several articles in BMJ & Annals of Internal medicine raised these concern.P value <0.05 is always used to justify the said benefit. Industry sponsored studies are champion in this respect! Number need to treat(NNT) & harm(NNH) may be useful,I think. Even-though risk of bleeding in individual patient is always difficult to predict.
I afraid that uncanny liaison between Industry & Academia virtually decrease credibility of results of clinical trials.Conflict of interest disclosures are not sufficient to guard this skepticism. Members of FDA virtually sinking in this trench. It is now clearly evident that how mercantile capitalism virtually corrupt everything including our morality & ethics.
Me-Too drugs, exploitation of research participants of developing countries in multi-center trials,10/90 gap in clinical research, makes the ugly picture very visible.Only remedy I can envision is complete breach of relation, how painful it may be, between Academia & Industry! Because quest for knowledge & quest for profit cannot be con-existed.
Ref:
1.The Truth about the Drug Companies by Marcia Angell MD.(Ex-editor in chief of NEJM)
2.Between the needy & the greedy:The quest for a just fair ethics of clinical research;JME 2010 36:500-504.
3.Is a sub-group effect believable? BMJ 17 April 2010.
4.Clinical Trails:Discerning Hype from Substance;Thomas R Fleming PhD,Annals of Internal Medicine 2010;153:400-406.
Thank you very much for your comments. It is true that we need to be astute in reading articles – and there are always issues that are worthy of deep scrutiny beyond whatever appears obvious on first look. To answer the question from Dr. Kostek – the TRITON trial is a bit unusual in its design because it was powered to test the effect of prasugrel in UA/NSTEMI – and they specifically stated in their Methods article that the study was not designed to assess STEMI – they included STEMI, I suppose, so that there would be some experience with that type of MI – but they were always going to be limited in what they could conclude about that group. There appears no difference in the effect of preasugrel compared with clopidogrel in patients with STEMI – compared with those who had UA/NSTEMI – but if you really felt that this was a hypothesis worth testing (that patients with STEMI respond differently) then you would need to design a study specifically for that question.
With respect to the comments by Dr. Sur – it is hard to disentangle industry influences, which can be subtle – but I believe these investigators to be honorable – I just disagree with what they did and what they concluded – and believe it to be an incorrect approach – but I do not mean to impugn their reputations. These types of discussions are best made about the science – leaving the politics aside.
So I am perusing my current issue of JACC and I notice that there is an ad for Effient (prasugrel) that is all about the subgroups in TRITON. And the language is misleading in my opinion. They tout the reductions in thrombotic CV events in diabetes subgroups and state ” The greater reduction in the primary composite endpoint in patients with diabetes treated with Effient plus ASA compared with Plavix plus ASA was consistent with those observed in the overall UA/NSTEMI and STEMI populations.” By saying the ‘greater reduction’ there is the suggestion that there is a signal of great benefit – but the rest of the sentence concedes that the effect in diabetics is no different than non-diabetics. So why the emphasis on the greater reduction – many who are not familiar with the study might think the drug is more effective in diabetics – which the trial does not indicate. And there is more. For bleeding they present a beautiful figure that shows that Effient has a higher risk of non-CABG related major bleeding in the entire study group (2.2% vs 1.7%) — beside it is another bar graph suggesting that the rates in diabetes are similar (2.2% vs 2.3%) – that would be great except that there is no evidence that the bleeding risk in diabetics is different than non-diabetics. And in small font – very small font – under the figures, it says “P value not provided because the trial was not designed to prospectively evaluate bleeding differences in subgroups.” I wish that were in a larger font. Take a look at the ad and see what you think.
The problem with subgroup analysis is that you should specify before the study begins how many subgroups you wish to examine and compare. You are really mathematically increasing the (n) of the group. It’s not unlike specifying you are going to use a two-tailed t-test in case the drug can harm as well as help the patient, and then just doing the one-tailed test because using the two-tailed test did not show any results of significance. As soon as you increase the number of studies in any manner, then you need more “positive” results to reach statistical significance. If you do a separate subgroup analysis, the result, if significant, should be treated as a suggested hypothesis, and then another study must be done to verify the hypothesis.
Dr. Krumholz,
What is the best method of correcting the interaction for multiple testing between subgroups? Is there a standard or does this vary by what gives the “best” result? It seems that the Bonferroni correction (0.05 divided by the number of subgroups or comparisons examined) may be considered too strict (especially if there are a large number of subgroups), and the choice of a p value seems somewhat arbitrary.
Competing interests pertaining specifically to this post, comment, or both:
None
Dr. Rosenstein,
This is a really excellent question – and one that does not have a particularly good answer. The OASIS7 group, in the face of many comparisons (including 13 subgroups) set 0.01 as their significance level without any justification – and they mentioned in in the Discussion instead of the Methods. I agree that Bonferroni is strict. I wonder if we could get one of the NEJM statistical reviewers to give us their sense of what is acceptable to them.
HK
The perils of subgroup analyses are well documented. They are, in essence, cohort analyses within a randomized clinical trial. They are not always pre-specified, and even when they are, the important tests for significance (interaction tests) are often not reported or are not adjusted for multiple comparisons.
Even when subgroup analyses are properly performed, they require further validation in subsequent trials. Wouldn’t it be great if leading journals provided for their advertisements even a fraction of the oversight and rigor they provide for their published manuscripts?
The results from both the standpoint of efficacy and safety in the STEMI population of TRITON are predictable. As noted earlier, the benefits in the STEMI population mirror those of the entire cohort. They are significant in this subgroup analysis because of the size of the subgroup and the frequency of the endpoints ~10% to 12%. The lack of any finding of non-CABG bleeding results from the fact that this outcome is so much more rare (~2% to 3%). This makes the analysis much less likely to detect a difference in safety even if it were to exist.
I agree with Paul’s statement above that subgroup analyses have lots of problems and need to be carefully interpreted. We have relied on them too long for understanding the extent of heterogeneity a treatment may have in its risks and benefits. A colleague at my institution, Rod Hayward, has been pushing for more innovative ways to analyze these large trials in light of a patient’s multivariable risk – not just a single-factor in a subgroup. For those who haven’t read it, I would suggest checking out his paper with David Kent in JAMA from a few years ago (JAMA 2007;298:1209-1212). It’s an eye-opener.
Competing interests pertaining specifically to this post, comment, or both:
None.
We write in response to Dr. Rosenstein’s question about the best method for correcting for multiple comparisons when performing several tests for interaction between treatment and variables defining subgroups.
We agree with the general approach described by Dr. Krumholz, namely, that one should always begin a subgroup analysis with a test for interaction and then claim that treatment effects differ between patient subgroups only if the test for interaction is statistically significant. The question then is how to assess tests for interaction when several subgroup analyses are performed.
The Bonferroni and related procedures control the Family-wise Error Rate (FWER), defined as the probability of making one or more type I errors among all the hypotheses tested. For confirmatory testing in randomized clinical trials, it is appropriate to control the FWER (Koch and Lansky 1996). The Bonferroni method, dividing alpha by the number of tests, is a simple procedure that is always valid regardless of the correlation among tests, but it is conservative and the conservatism increases with the number of tests and the correlation among tests. Several modifications of the Bonferroni procedure have been developed (Holm 1979, Simes 1986, Hochberg 1988, Hommel 1988). These procedures offer improvement in power over the Bonferroni method and are in general preferable.
We make a sharp distinction between confirmatory and exploratory testing. In any study, one should explore the data thoroughly to discover nuances in the data and generate new hypotheses. In these exploratory analyses, one can consider approaches that control the false discovery rate (FDR), defined as the expected proportion of falsely rejected hypotheses (Benjamini and Hochberg 1995). This approach has been used with some success in analyses of genome-wide association studies.
As an alternative to formal adjustment for multiple comparisons, the investigator can specify how many tests of true null hypotheses would be expected to be significant depending on the number of total tests performed (Lagakos 2006). This can be a useful way to put the statistical tests reported in a paper into proper perspective. This gives a global assessment of whether the overall subgroup findings could be explained by chance, rather than a corrected p-value for each test. If the number of significant subgroup findings is less than or equal to the number we expect by chance, we suspect that these findings are false positives. If we observe more significant subgroup findings than would be expected by chance, we are more confident that some of these findings are true positives. It would then be important to examine each of these findings more carefully, considering effect size, strength of evidence and scientific plausibility, to determine which hypotheses merit further investigation.
Rui Wang, Ph.D.
James H. Ware, Ph.D.
Harvard School of Public Health
References:
Benjamini Y., Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B 57, 289-300.
Hochberg, Y. (1988). A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75, 800-802.
Holm, S. (1979). A simple sequential rejective multiple test procedure. Scandinavian Journal of Statistics 6, 65-70.
Hommel, G. (1988). A stagewise rejective multiple test procedure on a modified Bonferroni test. Biometrika 75, 383-386.
Lagakos SW (2006). The challenges of subgroup analysis: Reporting without distorting. N Engl J Med 354, 1667-1669.
Koch, G.G., & Gansky, S.A. (1996). Statistical considerations for multiplicity in confirmatory protocols. Drug Information Journal, 30, 523-33.
Simes, J.R. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika, 73, 751-754.
Competing interests pertaining specifically to this post, comment, or both:
None.