403. Data Distribution Shifts

▮ Data Shift

After deploying your model, you’ll need to keep on maintaining them because data is constantly changing.
So for this post, I’d like to share three different types of data distribution shifts that may occur which can degrade your model performance.

▮ Concept Drift

This occurs when the relation of each feature with the target variable changes.

For example, a model was trained to predict housing prices and at the time, the distance from the train station mattered a lot. However, recently due to work-at-home culture, people don’t value proximity as much as before.

▮ Covariate Drift

This data shift happens when the distribution of the new data drifts away from the distribution of the initial training data.

For example, a model was trained using pictures of white cats to identify cats, but now, the model has to make predictions for brown cats.

▮ Target Drift

This happens when the input stays the same, but the target variable changes. In other words, this is the opposite of covariate drift.

For example, you want to train a model that identifies whether an email is a spam or not. Let’s say that 50% of the training set is spam, but in production, only 10% of the data is spam. Instead of impacting the input distribution, this influences the output prediction.

This problem only occurs in Y → X problems and is commonly associated with naive Bayes.

▮ Data Shift

▮ Concept Drift

▮ Covariate Drift

▮ Target Drift

Related Posts

405. POCs For ML Projects

402. Data Leakage

397. Finding Data