March 23rd, 2015
Putting PROMISE in Greater Context
Daniel Mark and Larry Husten, PHD
The PROMISE trial, which was presented and discussed in a Journal Club forum at ACC.15, prompted some interesting discussion when CardioExchange covered it as a news story on March 14. Dr. Daniel B. Mark, one of the PROMISE investigators, now sheds further light on the trial in response to questions from CardioExchange news writer Larry Husten.
Click here for CardioExchange’s news coverage of PROMISE and the related discussion.
Click here for a short interview with Dr. Pamela S. Douglas, PROMISE’s lead investigator.
THE INTERVIEW
Husten: Why did the NIH shorten the follow-up period rather than increase it (or stop the trial)?
Mark: I am not in a position to speak for the NIH, nor am I privy to its decision making process. But given the recent budgetary pressures it has faced, I suspect that it is almost never in a position to add funding to an existing trial. At least that has been the feedback I have heard repeatedly from the NIH about multiple large RCTs. Also, it’s worth remembering that many of the primary outcome events were seen in the first year, so this may be a good example of a trial, where longer follow-up does not add treatment related events as efficiently as enrolling more patients who are followed for a shorter period. Given a fixed amount of funds, one has to decide the best way to meet the trial objectives, never a simple decision in an ongoing trial.
Why would they stop the trial? That question suggests that you believe that the only benefit of a trial is to hit the target p value on the primary endpoint. I would not agree with that position at all.
Husten: Given the failure to establish either superiority or noninferiority, how can the trial support a change in guidelines, as Dr. Douglas suggested several times at ACC?
Mark: As we discussed in the Journal Club session at ACC, to formulate a project as a fundable grant to the NIH, it is almost always advisable to pose it as a test of superiority. Noninferiority may work well enough for FDA drug approval, but NIH trials are very different.
That said, proposing a superiority hypothesis does not mean that the expert consensus was that this was likely to be true. It was possibly true, but also possibly not true. What it does is to take a complex question and make it tractable. By specifying a superiority hypothesis with all the associated parameters, we can use the model of Neyman–Pearson hypothesis testing to calculate sample size/power. But most of what goes into that model is a guess. We clothe the process in an aura of scientific rigor, as if that imbues it with some special authority — but in the end it’s mostly educated guesswork. That is not a bad thing; it is the reality of how we figure things out. We currently have no better way to decide how big a trial needs to be.
There are also budgetary constraints. NIH does not have unlimited funds to invest in every RCT, so one has to negotiate a reasonable sample size for a trial based on both a reasonable power story and an acceptable budget. As you know, trials are enormously expensive and we therefore require much more of them than is probably reasonable. That is to say, we expect each trial to definitely settle all our disputes and resolve all our uncertainties — and that simply is not a realistic expectation. Each trial is, in scientific terms, one set of measurements. To really understand something in science, we need to do repeated sets of measurements (multiple trials), but we mostly can afford only the one set. So we try to find clever ways of statistical argument that make it seem as if the one can serve for the many we actually need.
The FDA set up this whole set of fixed rules years ago, in order to operationalize its mandate from the Congress to establish that FDA-approved drugs are safe and effective. Because the FDA rules seemed so rigorous, others started adopting them, and we are now at a place where many clinicians and some statisticians actually think these rules ensure “correct” science. If only it were that simple! The rules are a form of theater. They are probably quite reasonable for FDA decision making, given the pressure cooker of interests that converge on each decision that agency makes. But there is no reason for the rest of us to adopt those rules as gospel, except that it liberates us from the responsibility to think about things more deeply and insightfully — and to exercise judgment. Judgment is an essential part of the scientific process but is often disparaged by those who seek to make science completely objective, above all need for human interpretation.
PROMISE is incredibly valuable because we enrolled the group of patients we wanted, those with about a 50% pretest probability of disease, and got a sample of 10,000 (more than 50% women) with real outcome data, something never before achieved. They are high-quality randomized data from which we can learn much. Yes, some judgment is required to interpret the findings, but that is always the case. Think about what we know now in the context of the anxieties about CTA before the trial was done. If creation of guidelines were simply a matter of applying the fixed rules, perhaps PROMISE would not be so useful.
But that is not the way guidelines are actually written, so it is quite likely, in my view, that the guideline committees will look at these data and alter the recommendations they make around the use of CTA in the context of PROMISE. Guidelines in this area do not state that one test is better than another. How could they? They state that various tests are reasonable to use for a given problem, given the evidence available. But, of course, we have no knowledge of any actual official guideline action or position, so this is simply my opinion.
Husten: Are you concerned that the higher rate of revascularization with CTA did not lead to improved outcomes?
Mark: With respect to extra revascularization procedures, PROMISE was not nearly large enough to detect the effect of this level of shift in revascularization procedures on outcome. Consider that ISCHEMIA is studying revascularization versus optimal medical therapy in a higher-risk population of 7000 patients with moderate or severe ischemia to get at this answer, and it quickly becomes clear that PROMISE cannot answer the question of how these patient outcomes were altered by these incremental procedures. But just as we cannot prove that the patients benefited from the procedures, it would be an error to conclude that the patients did not benefit. It’s like trying to look at a virus with an optical microscope — you just do not have the resolution power to do it. But that does not mean that viruses do not exist.
Husten: Do the findings support a no-imaging/watchful-waiting strategy?
Mark: No imaging/no stress testing was not studied in PROMISE, so the study data do not address that option. Remember that all of these patients went to their doctor with chest pain/dyspnea that the doctor thought could be due to obstructive CAD, and the doctor felt that noninvasive stress testing was indicated to clarify the nature of the patient’s problem. The patients wanted to know if they were ok, and the doctors did not feel comfortable just reassuring them on the basis on an office exam alone.
Husten: If you had known the results of this trial 15 years ago, would we or should we have spent many millions of dollars to make this technology ubiquitous?
Mark: The ability to see clearly inside the body without having to cut patients open is a huge advance in medicine. The 1979 Nobel Prize was awarded for development of CT. It’s clearly worth the money. The question of how to use it wisely is still a work in progress. I believe that PROMISE gets us one big step further along that road. It is also worth noting that the CTA technology is not currently widely reimbursed in the outpatient setting, in large part because of the pre-PROMISE controversies about how this technology would affect practice. The PROMISE data should also provide more comfort on that score.
One final point: It is worth remembering what evidence led us to think that functional stress testing with echo and nuclear methods was the standard for high-quality cardiovascular care. It was primarily sensitivity and specificity data, mostly from different cohorts — that is, echo data in one, nuclear in another. ECG was the only one also measured in each, but the selection biases in creating those stress-imaging cohorts, relative to a standard stress ECG cohort, were mostly ignored by people interested in proving that their technology was better.
JOIN THE DISCUSSION
What’s your take on PROMISE and Dr. Mark’s analysis of it?
I think that the study does not allow any conclusive comparison between the CTA or functional stress tests. Yes, there is other data, like risk estimates, which are meaningful. This is a “negative trial”, but still worth publishing.
I found Dr. Mark’s discussion of trials as scientific measurements very insightful. It will be interesting to see how guidelines change as a result of the trial.
So I PROMISE CTA is no better or worse than functional test and even the higher rate of revascularization with CTA did not lead to improved outcomes, so one more time against all the scientific evidence, now we need a new trial to evaluate the noninferiority for FDA drug approval vs the noninferiority NIH approval. Which is the better?
So I PROMISE CTA is no better or worse than functional test and even the higher rate of revascularization with CTA did not lead to improved outcomes, one more time the result follow all the scientific contemporary evidence.
Nnow after all this, do we need a new trial to evaluate the noninferiority for FDA drug approval vs the noninferiority NIH approval. Which is the better?
The comments from Dr. Mark are insightful in blending the art and science of clinical trial interpretation.
A few observations come to mind in response.
1. There is a prevailing school of thought that underpowered trials yielding inconclusive results are not only unscientific but also unethical. Despite this underpowered trials are still quite prevalent. Of course, from a Bayesian perspective (where learning is deemed an iterative process), even an underpowered trial that gives inconclusive results can contribute to the development of a more complete picture.
2. The effect sizes observed in trials seldom match the assumptions that go into the power calculations, highlighting the disconnect between the expectations of theory and the realities of practice. It is no secret that power calculation (and sample size estimation) are often driven more by trial feasibility (and expediency) than realistic expectations.
3. Guidelines are more likely to be adopted universally (by all stakeholders) if they are informed by clinically and statistically persuasive and high-quality evidence. Shifting from a class IIb (‘may be considered’) to class IIa (‘is reasonable to consider’) might be arguably justified by the PROMISE results, but it might hardly be considered an upgrade from the status quo. The PROMISE results are insufficient for an upgrade to class I (‘should be considered’ as ‘first line tool for patients with stable chest pain’) recommendation. As responsible custodians, the PROMISE investigators should not ‘overinterpret’ the trial results and allow their enthusiasm to exceed the evidence.
The reality remains that revascularization does not change outcomes except in the circumstance of aborting an acute MI.
I question whether CTA would be superior to CAC in assessing risk and stratifying therapy. The use of CAC would prevent the temptation of unnecessary and valueless revascularization as seen with CTA.