Short(ish) guide to statistical significance testing

Damir Valput

2018-12-12 16:43:40
Reading Time: 5 minutes

It is fairly common today to encounter news titles of the following form: Researchers have shown that drinking hot tea might increase your risk of developing cancer. There is evidence a certain contraception increases chance of a thrombosis occurring in a vein two times. An observational study provides new evidence that skipping breakfast might lead to a heart attack. Lastly, there is a popular similar example from aviation, often discussed in the media: It has been shown that one is significantly less likely to die in an airplane accident if one is seated in a middle seat of the back rows of an aircraft. These alarmingly sounding pieces of information are often received by general public far more seriously than they should be. Usually, it’s safe to assume that the public may believe that if a piece like that is produced by a very rigorous scientific method, it ought to be taken as a better-safe-than-sorry rule for one’s daily life. More alarmingly, the reader will find that word “significant” in the text, which does carry with itself an ominous connotation of something not to be played with nor ignored. However, how many people reading these headlines are aware of the scientific method behind such findings?

New literacies: internet, information, computer…. and now data!

Finding ourselves in the midst of the data science and data revolution, more of us are becoming aware of the role that (big?) data and statistics play in our day-to-day lives. With the data revolution, the term ‘data literacy’ has emerged as well. Wikipedia defines data literacy as “the ability to read, understand, create and communicate data as information”. However, it seems to me that data literacy is still something reserved mostly for experts working with data, difficult to spread outside the “expert zone”. And as data is becoming one of the most valuable and highly priced resources in businesses, and in algorithms one of the most impactful influencers of our decision-making processes, I feel everyone would benefit from putting effort into understanding the most common statistical practices used today (without getting into all the bells and whistles of statistics).

How do you decide whether you should really stop drinking hot tea, never skip breakfast or always sit in the back of a plane? After all, the findings are there and they have been confirmed to be a statistically significant piece of evidence. Only, what exactly is significance testing and why might it be completely – well, pragmatically irrelevant?

How does statistical significance test work? (Improbable is not impossible.)

To understand statistical significance testing, we first have to understand the null hypothesis. The null hypothesis is basically an assumption of a scenario under which “nothing will happen” (under that assumption, the intervention we are studying will introduce no difference, have no effect whatsoever). For example, the null hypothesis would be that seating in the back of a plane does not affect one’s chance of surviving an airplane accident.

Therefore, significance testing is a form of a statistical test where a researcher assumes something is true (that is null hypothesis, commonly denoted with H0). Later on, the researcher observes (using the data at hand) an outcome O that is very improbable, but that follows under the assumption H0. As a result, the researcher declares the hypothesis H0 also very improbable. Notice – we are dealing with probabilities and trying to deduce that something is highly improbable, but not impossible! That truly does make a lot of difference. Certain questions might immediately pop into one’s mind after this explanation: What exactly does it mean very improbable? What does that mean in practice? Is this really a proof of a claim? Et cetera.

The value chosen as that measure of improbability for the statistical significance test is called p-value. The traditionally chosen p-value and the one commonly accepted today in the research community is 5%. That would mean the effect we were searching for appears in less than 1 in 20 samples. How was that value chosen? Pretty much ad hoc (more on that later on). Choosing a lower p-value would make the result more meaningful (aka, more significant in statistical, but still not necessarily in practical sense), because it is less likely that the effect we are reporting is just noise in the signal.

A few pitfalls of statistical significance

The criticism of this test is widely discussed in research papers, books and various internet sources (to be fair, people are not speaking so much against the test itself, but more against the interpretation and usage of it: after all, the test is not to be blamed, it is doing what it was designed for). In the continuation, in order to keep this post sort-of short and retain its original purpose of mind-tingling, awareness-raising read, I list only some of the most common objections to this method. An interesting paper touching this subject in the area of aviation research was written by D. C. Ison, titled An Analysis of Statistical Power in Aviation Research, and it can be found here. Indeed, no data-driven research area, including aviation research, is immune to these pitfalls.

  1. Arbitrary choice of p-values. As already mentioned, the choice of setting the p-value at 0.05 stems from the historical convention: R. A. Fisher, the pioneer of the statistical significance testing, back in 1925 argued the value of 0.05 to be “good enough”. That value stuck and it is prevalent to this day. However, this way we are treating a continuous variable as a binary one: if an effect falls below this p-value, it is statistically significant, if not, it is not significant. The opponents of this interpretation suggest, for example, to use confidence intervalsinstead (that is, reporting a parameter estimated with the interval that tells us how confident we are that the estimated parameter is true), or adapting the p-value accordingly to the project and effect we are searching for. After all, the way the value of 0.05 was chosen was entirely arbitrary.
  2. Underpowered/overpowered studies. The power of a statistical significance test is the probability, given that the null hypothesis H0 is false, that the test will correctly reject it. It is a valuable piece of information when doing statistical significance testing, and often it is omitted from research analysis (as it has been shown in this paper). For example, an underpowered study is not going to be able to detect effects of practical importance. Such mistakes often go unnoticed because of what is commonly called a file drawer problem: when a study does not detect yield significance (simply because of that binary threshold of p-value), it gets forgotten in a drawer and never published.
  3. Small probabilities and risk ratio.A study provides new evidence that the drug A increases the risk of illness B two times.” Sounds pretty scary, no? But what does that mean in practice? What if the risk of illness B is a very small number, for example, 1 in a million people succumb to B? Multiplying that number by 2, it turns out that your risk of getting that disease increased to a risk you can easily ignore (for all practical purposes). Whereas, the consequence of not taking the drug A might be much more damaging than that ignorable risk. This is a case in practice more often than one might think. People rarely think about small numbers and ratios, so it does not occur to us that multiplying something 2, 5 or 20 times even does not make much of a difference in practical cent. If all you had was 1 cent, and somebody promised to give you 20 times more money than what you have at the moment, would you feel fortunate?
  4. Connotation the word significance carries. This one is more of a philosophical conundrum than mathematical, but many argue that saying that something is significant in colloquial sense has quite a different meaning than the one used in statistics. Hence, the word itself is misleading and unnecessarily sounds threatening. Just because the significance test can detect an effect doesn’t mean the effect practically matters, but the word significant triggers our attention to it.
  5. The null hypothesis is always false. H0 assumes that some intervention will make no difference. And that is true….never? Often, maybe even always, there will be some difference that will appear given enough data. And with a little bit of tweaking, which is not an uncommon practice (it even has a name of its own: it is often called p-hacking), one can stretch the truth a bit in order to obtain a more appealing, and ultimately publishable, story.
Author: Damir Valput