383. Test In Production

▮ After Deployment

Previously, I shared how to evaluate your model offline; before production. So for this post, I’d like to share several model evaluation methods after deployment.

Blog: Model Offline Evaluation
Reference: Designing Machine Learning Systems

▮ Shadow Deployment

This is one of the safest ways to evaluate your model.

This test takes the following steps.

Deploy new model
Send requests to both existing and new model
Send back prediction to user only from existing model
Log prediction made by the new model

Since you don’t serve the prediction made by the new model until everything checks out, it won’t affect any users. However, this can be costly because it simply doubles the number prediction the system has to generate.

▮ A/B Testing

This test takes the following steps.

Deploy new model
Route a percentage of traffic to the existing model, and the rest to the new model
Monitor and analyze both results

You should note that, in some cases, one model’s prediction may affect another model’s prediction. In such cases, you might want to serve your variant alternatively; ex. serve one model on the first day and serve the second model on the second day.

When testing, you should be careful you are considering the following.

The traffic routed to each model has to be truly random
This test should be run on a sufficient number of samples

▮ Canary Release

This test takes the following steps.

Deploy new model(canary model)
Route a portion of traffic to the canary model
Adjust Traffic
1. If the performance of the canary model is well, increase traffic
2. If not, abort the canary model
Stop when the canary model serves all the traffic

This testing method is quite similar to the previous A/B testing. However, you can do this canary analysis without doing the A/B test because you don’t need to randomize the incoming traffic to the new model.

A/B testing’s main objective is to determine which model performed better, while the canary release’s objective is to gradually roll out the new model to a small subset of users before rolling it out to the entire users. This can reduce risk when replacing the new model within the ML system.

▮ Interleaving Experiments

Unlike A/B testing where you split the user into groups and have the corresponding model make predictions, this evaluation method has multiple models making predictions to the same group and having the user pick from all of the options.

▮ After Deployment

▮ Shadow Deployment

▮ A/B Testing

▮ Canary Release

▮ Interleaving Experiments

Related Posts

405. POCs For ML Projects

403. Data Distribution Shifts

392. Artifacts To Keep Track During Model Development