Stats for Experiments: the T-Test, Welch's T-Test, and Pairwise T-Test

While conducting research comparing different loss functions for machine learning model, I've found some basic lessons that might be useful.

For people who haven't done any experiments in a while, this may be a useful and quick refresher. If you are trying to prove that your special technique or architecture or hyperparameter delivers better results, it's not good enough to find one example of superior performance and declare victory. What if you just picked a lucky example? What you generally want to do is to run a large number of experiments with your modification. The sample data should be drawn either independently from a larger sample, or, in a very repeatable, simple explainable way.

For example, I did research on imbalanced classes. MNIST is a classic set of 70,000 handwritten digits, roughly 7,000 samples per digit. My approach is to build 100 sets of data. Each set of data takes a digit, say "0", and labels 700 of them True. The other 6,300 are thrown away. The remaining 63,000 digits are labeled False. I then take another 700 "0" digits, and so on. In this way, I have 100 experiments to run, and it's very explainable - it doesn't sound like I found some contorted way to find 10 special samples (because contorted data generation approaches will cause readers to think you cherrypicked data that made your approach look successful). (Note that the above approach is a bit simplified because MNIST doesn't have exactly 7,000 samples per digit; and I rebuild and retrain models 200 times).

Let's say you have a method that you think increases accuracy. What you should do is run a control model against your 100 data sets, then your test (or experimental) model against the same 100 data sets. Then report on the difference in average and the number of samples (and maybe some other fun stats like median, max, min, variance, stdev, skew, kurtosis, etc).

But you aren't done yet. What if your control accuracy is 95%, and your test accuracy is 95.1%. Are you sure your approach is better, or, is it just noise? If the range of accuracies for each test ranges between say 90% and 100%, it's very likely your “improvement” is just noise. But if the range of accuracies for your control is between 94.99% and 95.01%, and your test accuracies are between 95.09% and 95.11%, then you are in really good shape. How to formalize this?

The next step is to report a p-value. But there are a couple of ways to generate a p-value.

Student's T-Test

The first method is the Student's T-Test. In Python:

scipy.stats.ttest_ind

This function returns the t-test statistic, which you can throw away or report in your final paper; and a p-value. The p-value has a long and storied history from the battle between Fisher and Bayes for supremacy of Frequentist and Bayesian statistics. It's generally misunderstood. In most natural sciences, a p-value of less than 0.05 is considered good. It means that there's only a 5% chance that the mean of two samples, one control and one experimental, are the same. In other words, there's a 95% chance you found a statistically meaningful difference. But in computer science, it's usually cheap to run more and more samples... which allows you to drive down the p-value. So you should endeavor to run a lot of examples and try to get a p-value that's ridiculously low, like 0.000001.

Welch's T-Test

The generic SciPy test assumes your samples are normally distributed and have the same variance. That's not a good assumption. What you should do is always set the flag to assume different variances of the two samples. This is the Welch's T-Test. If your variances are the same, the result will be the same as the Student's T-test. So you should always use Welch's T-Test. Use it like this.

scipy.stats.ttest_ind(a, b, axis=0, equal_var=False)

Note that if your distribution is non-normal, some resources online say Welch's t-test is robust, other says it requires a normal distribution. So you'll have to do your own research into whether this is an appropriate t-test.

Pairwise T-Test

Unfortunately, the story isn't over. Fortunately, this third and final approach will generally deliver better p-values (and for valid reasons). When comparing the difference between your control and experimental neural network, you also have additional information. The 100 control and 100 test datapoints are paired up. So it is too aggressive to run the Welch's test, which assumes the two sets are randomly drawn samples. Suppose the test accuracy is exactly 0.1% more accurate than the control accuracy, for 100 samples. That's clearly pretty good! But what if the accuracies range between 10% and 90%. If you don't tell SciPy that the 200 samples are really 100 pairs of control/test, it won't factor that information in, and assume your 0.1% is merely noise.

What the pairwise T-Test does is to analyze the difference between pairs, specifically the variance, rather than the variance of the samples directly. Use it like this:

scipy.stats.ttest_rel(a, b, axis=0)

Conclusion

You are now ready to report statistically significant results in a rigorous way that academic journals will want to see. Happy Experimenting!