Hundreds of Hypotheses

Say you are testing hundreds of hypotheses, each with t-test. What considerations would you take into account when doing this?

This is the same question as problem #10 in the Statistics Chapter of Ace the Data Science Interview!

The primary consideration is that, as the number of tests increases, the chance that a stand-alone p-value for any of the t-tests is statistically significant becomes very high due to chance alone. As an example, with 100 tests performed and a significance threshold of α = 0.05, you would expect 5 of the experiments to be statistically significant due only to chance.

That is, you have a very high probability of observing at least one significant outcome. Therefore, the chance of incorrectly rejecting a null hypothesis (i.e., committing Type I error) increases.

To correct for this effect, we can use a method called the Bonferroni correction, wherein we set the significance threshold to *α*/*m*, where *m* is the number of tests being performed. In the above scenario with 100 tests, we can set the significance threshold to instead be 0.05/100 = 0.0005.

While this correction helps to protect from Type I errors, the tests are still prone to Type II errors (i.e., failing to reject the null hypothesis when it should be rejected). In general, the Bonferroni correction is mostly useful when there is a smaller number of multiple comparisons of which a few are significant. If the number becomes sufficiently high such that many tests yield statistically significant results, the number of Type II errors may also increase significantly.