▮ Leakage
Data leakage refers to the phenomenon when a form of a label “leaks” into the set of features used for making predictions even though this information is not available during inference in production.
Data leakage is hard to detect and can cause ML models to perform well during the evaluation phase but fail miserably in production.
So for this post, I’d like to share the common causes of data leakage and how to detect them.
▮ Common Causes
Mistake 1: Scaling before splitting
When you scale your inputs, you often use global statistics such as mean and variance.
Scaling before splitting means that the average and variance are calculated using the data from the test set as well even though it should not be used.
This can cause performance degradation in production. You should always split your data first before you scale your data.
Mistake 2: Filling in missing data with statistics from the test split
As we’ve discussed in mistake 1, this also is another form of data leakage
Mistake 3: Poor data duplication handling
If you have duplicates in your data and fail to remove them, the same data may be divided into both the train and the test split.
It is important to check if there are any duplicates in data before splitting them.
Mistake 3: Data Generation Process
Let’s say we want to develop a model that can identify pills/capsules at a hospital.
After training and having high accuracy for the pills/capsules from hospital A, you find that the model performs terribly for the capsules from hospitals B and C.
After immense debugging, you’ve noticed that hospital A uses a different camera for capsules than other hospitals and turns out the model was taking into account the slight difference in how the images of capsules are captured.
This type of data leakage which happens during the data generation process can be quite hard to avoid, but you can mitigate the risk by keeping track of the sources of data and how it is collected.
▮ Detecting Data Leakage
Data leakage can happen not only to newcomers but also senior-level data scientists.
Here are some tips to detect data leakage.
1. If a feature has an unusually high correlation, investigate why
2. When you remove a feature and the model performance changes dramatically, investigate why
3. When you add a new feature and it has a high correlation, investigate why
4. Be careful every time you look at the test set