In previous postings (Intro, Part 1, Part 2, Part 3), I have provided background on KV’s Makena (17a-hydroxyprogesterone caproate injection aka 17P). The development and regulatory history contains many lessons. In this posting I’d like to examine the difference between statistical and clinical significance. Please note that this is not meant as a rigorous statistics topic, but an discussion about how FDA evaluates a study.
The original NDA filing was based on a single published study. The NDA was given a normal review by the various disciplines, including statistics and clinical. Let’s start with the statistical review.
All review disciplines within FDA have guidances. The FDA statistician stated she was guided by Providing Clinical Evidence of Effectiveness for Human Drug and Biological Products. The overwhelming obstacle for the reviewer was the single study – FDA regulations generally call for two studies, though recent legislation has defined the basis for accepting a single study.
The objective of the single study was to evaluate 17P in the reduction of preterm births. Thus, in her review, the statistician evaluated the results at delivery <37-, <35- and <32 weeks and summarized her findings in the following table extracted from her review.
The statistician listed her main objections:
- The treatment effect at 37 weeks does not appear to be consistent among groups defined by gestational age at randomization. This finding may be confounded with race and study center.
- Lack of consistency of efficacy results among subgroups defined by race.
- For subjects who were black, the benefit of 17P compared with Placebo appears to emerge at around 24 weeks.
- For subjects who were non-blacks, a treatment benefit does not emerge until 35 weeks gestation.
- Lack of consistency of safety results at Week 24 among subgroups defined by race.
- Among subjects who were black, the estimated rate of fetal and neonatal losses was 6% for subjects, regardless of treatment assignment.
- Among subjects who were non-black, subjects randomized to Placebo did not have any fetal or neonatal losses compared with an estimated rate of 9% among those randomized to 17P.
- The doubling of the treatment effect from <35 weeks to <37 weeks is likely due to the increased number of deliveries among non-black subjects randomized to Placebo.
The clinical reviewer cited the same table above to conclude in her review that “[t]he Applicant submitted a single phase 3 clinical trial which demonstrated a statistically strong (p<.001) reduction in the incidence of preterm births prior to 37 weeks gestation, the protocol pre-specified primary endpoint.” Further, “[t]he reduction in preterm births at earlier gestational ages (i.e., <35 weeks and < 32 weeks), although statistically significant, did not meet the level of statistical significance generally expected to support approval of a drug product based on the findings from a single clinical trial.”
This difference in interpretation between statitistical and clinical reviewers was strong enough that the NDA was presented to the Advisory Committee (see my discussion of this meeting in Part 2). This Committee agreed the findings of this study were strong enough to warrant approval for <37 weeks gestation. Importantly, the clinical reviewer augmented the single study with literature: ” There is recent evidence that “late preterm births” (births between 340/7 and 366/7), which comprise 71.3% of all preterm births, are increasing, and suffer greater neonatal and childhood morbidity and mortality than previously thought (Adams-Chapman 20061, Tomashek 20072, McIntire 20083, Martin 20094, The Consortium on Safe Labor 20105).” Thus, she found clinical evidence to support the statistical finding – that the observed effect was statistically and clinically relevant.
Let’s look at this in another way. It should be apparent that a result can be statistically significant but clinically irrelevant. The converse is also true. The following chart illustrates the point.
In the case of Makena, the results were all statistically significant but the lower limit of the confidence intervals for delivery <35- and <32 weeks were close to nil (-0.4% and -0.3%, respectively). Hence, the need for an additional, confirmatory study was made a condition of NDA approval.