Let’s imagine a randomized controlled drug trial where we take 100 patients, randomly split them into two groups (50 people each), give one group a placebo, give the other group the drug, then record how many of them develop a particular disease over the next month. In the control group, 20% of patients (10 individuals) develop the disease, whereas in the treatment group only 10% (5 patients) developed it. Does the drug work?

This is where the confusion comes in. Many people would look at those results and say, “obviously it helped, because 10% is lower than 20%”. When we do a statistical test on it (in this case a chi-square test), however, we find that it is not statistically significant (P = 0.263), from which we should conclude that this study failed to find evidence that the drug prevents the disease. You may be wondering how that is possible. How can we say that taking the drug doesn’t result in an improvement when there was clearly a difference between our groups? How can 10% not be different from 20%?

To understand this, you need to understand the difference between a population and sample and the reason that we do these tests. This hypothetical experiment did find a difference between the groups for the individuals in the study. In other words, the treatment and control groups were different in this sample, but that’s not very useful. What we really want to know is whether or not this result can be generalized. We really want to know whether, in general, for the entire population, taking the drug will reduce your odds of getting the disease.

To elaborate on that, in statistics, we are interested in the population mean (or percentage). This may be a literal population of people (as in my example) but it applies more generally, and is simply the distribution of data from which a sample was taken. The only way to actually know the population mean (or percentage) is to test the entire population, but that is clearly not possible. So instead, we take a sample or subset of the population, and test it, then apply that result to the population. So, in our example, those 100 people are our sample, and the percentages we observed (10% and 20%) are our sample percentages.

I know this is starting to get complicated, but we are almost there, so bear with me. Now that we have sample percentages we want to know how confident we can be that they accurately represent the population percentages. This is where statistics come in. We need to know how likely it is that we could get a result like ours or greater if there is no effect of the drug, and that’s precisely what statistical tests do. They take a data set and look at things like the mean (or proportion), the sample size, and the variation in the data, and they determine how likely it is that a result as great or greater than the one that was observed could have arisen if there is no effect of treatment. In other words, they assume that the treatment (drug in this case) does not do anything (i.e., they assume that all results are from chance), then they see that how likely it is that a result as great or greater than the observed result could be observed given the assumption that all results are due to chance. Sample size becomes important here, because the larger the sample size, the more confident we can be that a sample result reflects the true population value.

So, in our case, we got P = 0.263. What that means is that if the drug doesn’t do anything, there is still a 26.3% chance of getting a result as great or greater than ours. In other words, even if the drug doesn’t work, there is a really good chance that we’d get the type of difference we observed (10% and 20%). Thus, we cannot be confident that our results were not from chance variation, and we cannot confidently apply those percentages to entire population.

Having said that, let’s see what happens if we increase the sample size. Imagine we have 1,000 people, but still get 10% for the treatment group and 20% for the control group. Now we get a highly significant result of P = 0.00001. In other words, if the drug doesn’t do anything, there is only a 0.001% chance of getting a difference as great or greater than the one we observed. Why? Well, quite simply, the larger the sample the more representative it is of the population, and the less likely we are to get spurious results. From this, we’d conclude that the drug probably does have an effect.

Another useful way to think about this is to imagine a jar full of red and blue marbles (this is your population). You want to know if there are more of one color than the other, so you reach in and randomly grab several (this is your sample). Suppose you get more blue than red, can you conclude that there are more blue marbles than red marbles in the jar (population)? This clearly depends on the size of your sample. The larger it is, the more confident you can be that your sample represents the population.

To try to illustrate all of this, imagine flipping a coin. You want to know if a coin is biased, so you flip it 10 times and get 4 heads and 6 tails. That is your sample: 40% heads. Now, is the coin biased? In other words, if you flipped the coin 10,000 times, would you expect, based on your sample, that you’d get roughly 40% heads? How confident are you that your sample result applies to the population? You probably aren’t very confident. We all intuitively know that it is entirely possible for even a totally fair coin to give 4 heads and 6 tails in a mere 10 flips. No one would scream, “but in your test the coin was biased!” We all realize that the sample may not be representative of the population.

Now, however, imagine that you flip it 100 times and get 40 heads and 60 tails. This is your new sample. Now how confident are you? Probably more confident than before, but also probably not that confident. Again, we all realize that there is chance variation. Indeed, if we ran that actual stats on this, we’d get P = 0.2008. In other words, this test says, “assuming that the coin is not biased, there is a 20.08% chance of getting a difference as great or greater than a 40%/60% split,” but what if we did 1,000 flips and got 400 heads and 600 tails. Now, we’d probably think that the coin truly was biased. At that point, we’d expect that our sample probably does apply to the population and continuing to flip the coin will continue to yield percentages of roughly 40% and 60%. If we actually run the stats on that, our conclusion would be justified. The P values is less than 0.00001, meaning that if the coin is not biased, there is less than a 0.001% chance of getting a result as great or greater than ours. This would be good evidence that the coin itself (the population) is likely biased, and our results are unlikely to be from chance variation in our sample.

That is, in a nutshell, what statistical tests (at least frequentists statistical tests) are doing, and we only consider something to be “statistically significant” when the probability that a result like it (or greater) could arise (given the assumption that there is no effect of treatment) is below some pre-defined threshold. In most fields, that threshold is P = 0.05. In other words, we only conclude that the sample results apply to the population if there is less than a 5% chance of getting a result as great or greater than the one we observed if the thing being tested actually has no effect.

Note: An important topic not covered here is confidence intervals. These show you the range of possible population values for a given sample value. So, for example, if you had a mean of 20 and a 95% confidence interval of 10-30, that would mean that you can be 95% sure that the population mean is between 10 and 30.

Note: This post was revised to change statements that a P value showed the probability that a result as great or greater than the one that was observed could arise by chance to statements that the P value showed the probability that a result as great or greater than the one that was observed could arise if there is no effect of the the treatment.

## No comments:

## Post a Comment