Or: How much should we distrust significant results from tests with low power?
Disclaimer: I’ll explain it further down below, but at no point should this blog post be used as an excuse to conduct underpowered studies! Also, I realized that this post turned out to be more “opinionated” than I planned. So please be aware that what I write here is just my opinion – if you disagree, I would be happy to know about it in the comments or twitter!
Some time ago, a colleague came to me and asked me for advice. He had found a nice and theoretically interesting significant result – but for various reason, his sample size was rather small, and he deemed it likely that his test had been underpowered. He was therefore worried that the result was not credible (his exact word was „robust“), and asked me if there was any way to show that the test indicated a real effect, despite the low power. Something about his concern felt weird to me, but I couldn‘t quite nail it down. I said I didn’t really know how to help at the moment, and the issue resolved itself eventually – but I kept wondering what felt so off for me about his question. Additionally, I started noticing similar concerns more often. Talking with colleagues, following discussions online or reading reviews, time and again I heard or read things like „the result is significant, but the sample is quite small, so it’s likely to be just a fluke“, or „Because this study obviously has very low power, I don’t find these results credible“.
So, at least in my community (psychologists and neuroscientists), there appeared to be something of an unspoken consensus that significant results from low-powered tests were in some way less credible than significant results from highly-powered studies. I guess it did sound right… if living through the replication crisis had taught us one thing, it was that „High power = GOOD! Low power = BAD“. But I realized eventually that the notion felt so weird to me because it conflicted with my intuition of how frequentist significance tests are supposed to work. If test results from small samples are meaningless, why do statistical programs give me critical values of the t-distribution for dfs < 10, instead of telling me „these values don’t make any sense, you idiot“ – or a more polite version of the same? Of course a low-powered test is bad at detecting an existing effect – but if it does deliver a significant result, is it then also more probable that the effect does not really exist? But wasn’t the whole shtick of frequentist significance tests that they would, at least, nicely control the number of false-positives? Something didn‘t add up for me. So, I now finally decided to think these things through a bit, and hope that you find some of my ruminations useful.
First of all, let‘s translate the vague concern about the credibility of low-powered tests into statements that we can talk about in formal statistical terms. I believe that the concerns described above can all be translated into one of the following two propositions:
- Tests with low power are more likely to give a significant result although there is no effect. In other words, tests with low power have an increased type-I-error rate.
- Significant results from low-powered tests are less likely to be due to a real effect. In other words, the probability that H1 is true in view of a significant result from a low-powered test is not as high as it would be if the test had greater power.
In terms of statistical „schools“ of inference, proposition #1 is something that a frequentist analyst might worry about, while proposition #2 indicates a more Bayesian perspective. My research community is mostly frequentists, with some Bayesianism sprinkled on top whenever useful (how I feel about such statistical eclecticism should become more apparent later on). Therefore, I think it might make sense to look at these concerns from both a frequentist’s and a Bayesian’s perspective , and see what there really is to worry about.
Let‘s first put ourselves into the somewhat worn-out, but comfortable, old-fashioned but solid-until-you-look-closer shoes of a frequentist. To recap, when we perform a frequentist significance test, we …
- … make assumptions regarding what our data should look like if the null hypothesis were true;
- … identify the range of statistics that are so extreme that we would observe one of them in only α% of cases if the null hypothesis were true;
- … check if the statistic from our sample belongs to the α% of cases. If it does, we call the result „significant“, and regard it as evidence that the null hypothesis is not true.
Provided that our assumptions are correct, we can rest assured that we will make a type-I-error in only α% of the cases, meaning that we will obtain a significant result in only so many cases if the null hypothesis is actually true. Now, the thought that keeps every devout frequentist up at night is that the type-I-error might actually be higher than α. The frequentist spent so much effort, sacrificed so many things, just to be comfortably wrong in a maximum of α% of cases, that an actually increased error rate would spell certain doom. Does low power invoke this miniature apocalypse, summon forth the Beelzebub  of failed error control? Well, in principle, no! The power of the test does not enter into the calculation of critical thresholds or p-values. If the assumptions of your test hold, and the null hypothesis is true, you will obtain a significant result in only α% of the cases, independent of your sample size or power. So, if you believe that the test assumptions are met, and that p < 0.05 tells you something meaningful about the world, you should be as convinced by a significant result from a sample of ten subjects as from a sample of one thousand . However, the „ifs“ in the preceding sentences are very important: Most tests assuming that the data is normally distributed (read: nearly all them tests) will only be robust to deviations from normality if the sample size is considerably large. If the sample size is too small, not meeting the assumptions is likely to change the type-I-error rate – but it depends on the test and the type of the deviation if the error rate will be increased or decreased. So, if you have good reason to worry that the assumptions of your test are not met, small samples are a problem, and you might end up with an increased type-I-error rate. However, modern statistics provide robust alternatives to most tests, so even this issue can be resolved in most cases.
So, as long as the assumptions of a test are fulfilled, or a robust alternative is available, a frequentist should not be worried that low power leads to an increased type-I-error rate. But what about the other concern, that significant results from low-powered studies come with a lower probability that the H1 is true? Well, it is true that significant results tell the frequentist little about this probability if the test had had low power – but that’s also the case for a perfectly powered test. In fact, no test can tell the frequentist anything about the probability of H1, because it is not even a valid concept within his/her school of thought. (To the best of my understanding, the only probabilities a frequentist can meaningfully assign to hypotheses are 0 and 1, for false and true hypotheses respectively. Of course, whether or not a hypothesis is true or false is not known in most cases, and if it is, there is no need to perform inference.)
We can therefore now step out of the frequentist’s shoes, and get comfortable in the hip sneakers of a Bayesian. We can safely skip proposition #1 (Type-I-errors are of as much interest to a Bayesian as the Immaculate Conception is to a Buddhist  ) and proceed with #2: should we be worried that low power decreases the probability that significant results indicate a true effect?
This question was famously, and influentially, discussed by Katherine Button, John Ioannidis and colleagues in their paper Power failure: why small sample size undermines the reliability of neuroscience in Nature Reviews Neuroscience. In fact, I am quite certain that many of the worries about low power that my colleagues and friends voiced in the intro stem directly or indirectly from this article. I’d therefore like to address proposition #2 in the form of a short review of Button et al.‘s argumentation. To summarize in advance: the paper raises, without question, many valid points about the perils of a research field seemingly dedicated to underpowered studies. However, I think that it also incited some major misunderstandings regarding the interpretation of significant results from low-powered tests, mainly due to two reasons:
- The authors are not transparent about the fact that their argumentation is deeply dependent on the (Bayesian) assumption that ascribing probabilities to hypotheses is meaningful – a notion that is not consistent with frequentist thought. At the same time, they make heavy use of frequentist vocabular, misleading the reader into thinking that their argument is based on canonical frequentist theory, when it is really not.
- Under the assumption that probabilities can be assigned to hypotheses, the authors deduce correctly that significant results from low-powered tests are less likely to indicate a true effect than significant results from tests with higher power. However, the authors downplay the surprisingly strong support that significant results from even very underpowered studies give to the hypothesized effects.
But let’s look at the paper in more detail. After the truism that low power comes with a low probability of detecting a true effect, the author’s put the following statement into the summary bulletpoints:
Perhaps less intuitively, low power also reduces the likelihood that a statistically significant result reflects a true effect.
So, that sounds like a proper Bayesian statement – and to be very honest, I started writing this post with the vague memory that Button et al. explicitly invoked Bayes‘ theorem to support their claim. I was thus very surprised when, upon re-reading, I could not find the name „Bayes“ anywhere in the paper, except for a few off-hand mentions of „alternative Bayesian approaches“. What the authors do refer to are two of Ioannidis‘ own previous papers, neither of which contain any prominent reference to Bayesian theory either. Instead, the concept at the center of the authors‘ argumentation is the „positive predictive value“, or PPV – a term that I believe is not used often in psychology and neuroscience, but appears to be more prominent in epidemiology. The PPV of a diagnostic test is the probability that some condition is really present (e.g., a patient really has some disease), given that the test provided a positive result. Button et al. now simply calculated the „PPV“ of a statistical test, by exchanging „patient has some disease“ with „hypothesis is true“, and „positive result“ with „significant result“, to arrive at the probability that the hypothesis is true, given that the test provided a significant result. If you are now wondering „wait, isn‘t that just the posterior probability of the hypothesis being true?“, you are absolutely right. To see that more clearly, consider Bayes‘ theorem for the posterior probability of H1 being true, given a significant result:
where in the second line I just divided nominator and denominator by P(H0) to show that it is enough to consider the prior odds P(H1)/P(H0). Now, let’s give the terms in the equation their frequentist names, where possible: P(sig|H0) is just good old α, and P(sig|H1) is the power of the test, commonly written as (1-β), where β is the type-II-error rate. The prior odds (which I’ll call R, to be consistent with Button et al.) don’t have a frequentist equivalent because, of course, prior probabilities are not a meaningful concept in frequentism. We thus arrive at
which is the exact formula that Button et al. report – without any mention of Bayes‘ theorem, or of the fact that assigning prior odds to hypotheses isn’t exactly a very „frequentist“ thing to do. Now, with no reference to Bayesianism in sight, but heavy focus on frequentist significance tests and power, any quick reader can be forgiven to assume that the authors‘ arguments are firmly grounded in frequentist theory. However, no one can tell me that sentences like this are not prime examples of Bayesian thought:
[…] the lower the power of a study, the lower the probability that an observed effect that passes the required threshold of claiming its discovery (that is, reaching nominal statistical significance, such as p < 0.05) actually reflects a true effect.
If it walks like a Bayesian duck, and talks like a Bayesian duck, it is a Bayesian duck… right? Now, I do understand that the PPV can be meaningfully applied to medical diagnostic tests without any further reference to the Bayesian school of inference – because the prior probability of some disease being present in a randomly selected member of the population might be known, or can at least be estimated. (In fact, that’s exactly how Bayes‘ theorem is taught in every introductory course on probability theory: If you have a test with sensitivity X and specificity Y and a prevalence of the disease of Z, what is the probability yada yada yada…). However, I am pretty sure that we cannot just treat the probability of a scientific hypothesis being true the same way as the prevalence of some disease, without taking a huge leap into Bayesian territory – or at least, out of frequentist territory, where the assignment of probabilities to hypotheses is not valid. So, to re-emphasize my first point of criticism, I believe it to be a great oversight by the authors to not make clear to their readers that they are criticizing the use of frequentist inference based on notions that are not consistent with frequentist inference. (I mean, I’m thankful if a doctor tells me I have some problem – but I’d probably like to know if he/she is basing that diagnosis on a medical tradition I believe in.)
But enough of that – let’s give the authors a break, and look at the problem from the perspective of a Bayesian (or anybody who is comfortable with assigning odds to hypotheses). If we take the above formula and hold α and R constant, we get the posterior probability of H1/the PPV as a function of the power of the test. Some quick calculus tells us that this function is monotonically increasing for all allowed values of α and R. Thus, Button et al. are definitely, absolutely, undisputably mathematically right when they say that with lower power comes lower posterior probability (or PPV). However, that doesn’t mean that low power always leads to practically relevant differences in evidence. Let’s just try out the formula with some representative numbers. Let’s use the typical α of 0.05, and say that we have no prior preference for one hypothesis over the other, so R = 1. When we obtain a significant result from a study with the rather low power of 50%, what is your guess for the posterior probability of H1 being true? If it is anything below 90%, reconsider: it is actually 90.09090…%, or H1 is 10 times more likely to be true than H0, given the significant result. The generally agreed upon „good“ power of 80% would increase that number by about 4%. But okay, you say, a power of 50% isn’t actually that bad, many studies have much lower power than that. So let’s consider a study with the truly abyssmal power of 20%. Clearly, a significant result from such a trainwreck of a failure of a bad excuse of a study will not convince me of anything, with its puny posterior probability of … 80%? So, the H1 is 4 times more likely to be true than the H0? Huh, that doesn’t actually sound that bad… I mean, sure, it’s not as good as 94% or 90%, but from the way people were talking about low power lately, I assumed that the posterior probability of H1 would be so low it would fall out of the graph, stand up, look at me, climb creepily out of my computer screen and kill everyone I ever loved, before slowly devouring me, and then receding back into the monitor, into the void, waiting, waiting, forever waiting… Or, you know, be at least a bit more closer to 50%.
Here’s the graph (turquoise curve):
Now, of course, a Bayesian would choose the prior odds R very carefully. Button et al. themselves chose to report the numbers for R = 0.25, reflecting a situation in which we give H1 only a 20% chance of being true. This curve is also in the graph (in red). Of course, P(H1|sig) looks worse under this assumption – but also not too bad, considering how little faith we put into H1 to begin with. In any case, I don‘t quite understand why the authors present such low prior odds as „typical“ (in one of his prior papers, Ioannidis presented graphs similar to the one given here, but with R = 1 as the upper extreme of the scale). For me, it is difficult to believe that many (neuro)scientists would take the very great trouble of conducting studies if they did not have the (implicit or explicit) prior belief that their hypothesis had at least a 50% chance of being true.
But these mindgames about what prior odds a Bayesian would choose bring me to the another issue: A “true” Bayesian would also not bother about p-values and significance tests to begin with! Bayesian statistics offer so many “nicer” ways to perform inference (full posterior distributions, model comparison, model averaging, the odd Bayes factor…) that there is not really a good reason to reduce the data to a Bernoulli trial like that. In contrast, a frequentist who takes their own school of thought seriously cannot assign prior probabilities to H0 and H1, turning the whole enterprise into nothing more than a mindgame for them.
That is not to say that I don’t see the merit of Button et al.‘s perspective: looking at a frequentist field from a Bayesian (or “epidemiological“) point of view, we must definitely come to the conclusion that the research field as a whole has accumulated less evidence for its diverse H1’s than it could have if studies had had higher power. I do see a problem, however, if this argument is used to dismiss any single significant result as a fluke just because of low power. If a frequentist criticizes the result on this grounds, it is inconsistent, because he/she shouldn’t even care about the posterior probability to begin with – and if a Bayesian (or scientific „epidemiologist“) does it, he/she needs additional arguments, because, as we saw, even significant results from very low-powered studies put a surprising amount of posterior probability mass on H1.
So, what to make of this? Importantly, none of these points should ever be used as an excuse to deliberately plan a study with too little power. All of the reasons why low-powered studies are bad a priori still hold: you probably won’t be able to detect your effect of interest, thus you will waste time and resources, and when you don’t find your effect you’ll be tempted to squeeze findings out of a weak study using questionable research practices, etc. etc… I am just trying to argue that, if we „believe“ in the frequentist logic of inference, there is little ground to criticize significant results from a low-powered study a posteriori. If a robust test obtains a significant result, however low the power, we cannot just dismiss it as a fluke without implicitly questioning the logic of frequentist inference as a whole. I am sympathetic to the intuition that a significant result from a study of ten people should be less credible than a significant result from a study of ten thousand – but this intuition is not reflected in the logic of significance tests. (The sample size only affects how small effects can be to still be labelled as significant – but the test remains completely agnostic to whether or not the effects are more or less „credible“). Indeed, if frequentists wanted to obtain some form of posterior measure of credibility, they would probably just have to adapt Bayesian inference – but there is limited place for significance tests in Bayes‘ brave new world.