Monday, December 12, 2022

The Simpson’s Paradox when analyzing data and taking decisions

When we want to study relationships in data (eg, in observations of the world), we can plot, cross-tabulate, or model that data. When we do this, we might come across cases where the relationships we see from two different views of a single dataset lead us to opposing conclusions. These are cases of Simpson’s Paradox.

Finding these cases can help us understand our data better and discover interesting relationships. This article gives some examples of where these cases happen, discusses how and why they happen, and suggests ways to automatically detect these situations in your own data.

Simpson’s Paradox refers to a situation where you believe you understand the direction of a relationship between two variables, but when you consider an additional variable, that direction appears to reverse.


Simpson’s Paradox happens because disaggregation of the data (e.g., splitting it into subgroups) can cause certain subgroups to have an imbalanced representation compared to other subgroups. This might be due to the relationship between the variables, or simply due to the way that the data has been partitioned into subgroups.

Here is an example of this, with a scientific evidence showing that unvaccinated people are more likely to develop severe COVID-19 and die, contrary to claim in viral social media posts about Germany data.