FREE AstroScience SEARCH ENGINE

Thursday, March 4, 2021

What does “statistically significant” mean?


6:19 PM | , , ,

What does “statistically significant” mean

In recent times, the internet has been abuzz with individuals sharing research related to numerous aspects of COVID. While this is potentially beneficial, it's evident that many lack a solid grasp of scientific understanding, especially when it comes to the term "statistically significant." In a nutshell, "statistically significant" suggests a low likelihood that the observed result, or one greater, could occur in the absence of the factor being tested. The purpose of statistical tests is to determine how probable it is that the study's sample accurately represents the entire population from which it was drawn. 


Consider an example of a randomized conrolled drug trial involving 100 patients. They are divided into two groups of 50, one receiving a placebo, the other the drug, and the number of patients who develop a specific disease over the coming month is recorded. In the control group, 20% (10 individuals) contract the disease, while in the treatment group, only 10% (5 patients) do. So, does the drug work?


This is where the confusion often lies. Many would argue that the drug is effective because the 10% figure is lower than 20%. But, a statistical test (chi-square test, in this case) reveals that this result is not statistically significant (P = 0.263), implying that this study does not provide evidence that the drug prevents the disease. It might seem odd. How can we assert that the drug doesn't improve outcomes when there was an apparent difference between the groups? How can 10% not be different from 20%?


To comprehend this, it's crucial to understand the distinction between a population and a sample, and why we conduct these tests. This hypothetical experiment did show a difference between the groups for the studied individuals. The treatment and control groups were different in this sample, but that's not the big picture. What we are actually interested in is whether this result is generalizable. We want to know if, generally speaking, taking the drug for the entire population will reduce your chances of contracting the disease.



To elaborate on that, in statistics, we are interested in the population mean (or percentage). This may be a literal population of people (as in my example) but it applies more generally, and is simply the distribution of data from which a sample was taken. The only way to actually know the population mean (or percentage) is to test the entire population, but that is clearly not possible. So instead, we take a sample or subset of the population, and test it, then apply that result to the population. So, in our example, those 100 people are our sample, and the percentages we observed (10% and 20%) are our sample percentages.


I know this is starting to get complicated, but we are almost there, so bear with me. Now that we have sample percentages we want to know how confident we can be that they accurately represent the population percentages. This is where statistics come in. We need to know how likely it is that we could get a result like ours or greater if there is no effect of the drug, and that’s precisely what statistical tests do. They take a data set and look at things like the mean (or proportion), the sample size, and the variation in the data, and they determine how likely it is that a result as great or greater than the one that was observed could have arisen if there is no effect of treatment. In other words, they assume that the treatment (drug in this case) does not do anything (i.e., they assume that all results are from chance), then they see that how likely it is that a result as great or greater than the observed result could be observed given the assumption that all results are due to chance. Sample size becomes important here, because the larger the sample size, the more confident we can be that a sample result reflects the true population value.


So, in our case, we got P = 0.263. What that means is that if the drug doesn’t do anything, there is still a 26.3% chance of getting a result as great or greater than ours. In other words, even if the drug doesn’t work, there is a really good chance that we’d get the type of difference we observed (10% and 20%). Thus, we cannot be confident that our results were not from chance variation, and we cannot confidently apply those percentages to entire population.


Having said that, let’s see what happens if we increase the sample size. Imagine we have 1,000 people, but still get 10% for the treatment group and 20% for the control group. Now we get a highly significant result of P = 0.00001. In other words, if the drug doesn’t do anything, there is only a 0.001% chance of getting a difference as great or greater than the one we observed. Why? Well, quite simply, the larger the sample the more representative it is of the population, and the less likely we are to get spurious results. From this, we’d conclude that the drug probably does have an effect.



Another useful way to think about this is to imagine a jar full of red and blue marbles (this is your population). You want to know if there are more of one color than the other, so you reach in and randomly grab several (this is your sample). Suppose you get more blue than red, can you conclude that there are more blue marbles than red marbles in the jar (population)? This clearly depends on the size of your sample. The larger it is, the more confident you can be that your sample represents the population.


Consider the scenario of tossing a coin. To determine if the coin is biased, imagine flipping it 10 times, resulting in 4 heads and 6 tails. This 40% head outcome is your sample. The question now is whether the coin is biased. If you extrapolate this to 10,000 flips, would you expect a similar 40% heads outcome? How certain are you that your sample reflects the whole population? 


In another experiment, flip the coin 100 times and record 40 heads and 60 tails. This new sample might make you a bit more confident, but you still acknowledge the possibility of chance variation. A statistical analysis yields a P-value of 0.2008, implying a 20.08% likelihood of observing a 40/60 split assuming the coin isn't biased.


Now, imagine flipping the coin 1,000 times and noting 400 heads and 600 tails. A high number of flips might lead you to believe the coin is biased. The sample probably mirrors the population, and further flips would likely yield a similar 40/60 ratio. Running statistics, we find a P-value less than 0.00001, indicating a less than 0.001% chance of obtaining a similar or greater result if the coin isn't biased. This suggests the coin (population) is likely biased, and our results aren't due to sample variation.


This illustrates what frequentist statistical tests aim to do. We deem a result "statistically significant" when the probability of a similar or greater outcome, assuming no treatment effect, is below a pre-set threshold, usually P = 0.05. 


Note: This discussion doesn't touch on confidence intervals, a crucial concept showing the range of potential population values for a sample value. If you have a mean of 20 and a 95% confidence interval of 10-30, you can be 95% confident that the population mean lies within this range.


This post was revised to clarify that the P-value indicates the likelihood of a similar or greater outcome, assuming no treatment effect.


You Might Also Like :


0 commenti:

Post a Comment