383. Test In Production

▮ After Deployment

Previously, I shared how to evaluate your model offline; before production. So for this post, I’d like to share several model evaluation methods after deployment.

Blog: Model Offline Evaluation
Reference: Designing Machine Learning Systems

▮ Shadow Deployment

This is one of the safest ways to evaluate your model.

Fig.1 – Shadow Deployment

This test takes the following steps.

  1. Deploy new model
  2. Send requests to both existing and new model
  3. Send back prediction to user only from existing model
  4. Log prediction made by the new model

Since you don’t serve the prediction made by the new model until everything checks out, it won’t affect any users. However, this can be costly because it simply doubles the number prediction the system has to generate.

▮ A/B Testing

This test takes the following steps.

  1. Deploy new model
  2. Route a percentage of traffic to the existing model, and the rest to the new model
  3. Monitor and analyze both results
Fig.2 – A/B Testing

You should note that, in some cases, one model’s prediction may affect another model’s prediction. In such cases, you might want to serve your variant alternatively; ex. serve one model on the first day and serve the second model on the second day.

When testing, you should be careful you are considering the following.

  1. The traffic routed to each model has to be truly random
  2. This test should be run on a sufficient number of samples

▮ Canary Release

This test takes the following steps.

  1. Deploy new model(canary model)
  2. Route a portion of traffic to the canary model
  3. Adjust Traffic
    1. If the performance of the canary model is well, increase traffic
    2. If not, abort the canary model
  4. Stop when the canary model serves all the traffic

This testing method is quite similar to the previous A/B testing. However, you can do this canary analysis without doing the A/B test because you don’t need to randomize the incoming traffic to the new model.

A/B testing’s main objective is to determine which model performed better, while the canary release’s objective is to gradually roll out the new model to a small subset of users before rolling it out to the entire users. This can reduce risk when replacing the new model within the ML system.

▮ Interleaving Experiments

Unlike A/B testing where you split the user into groups and have the corresponding model make predictions, this evaluation method has multiple models making predictions to the same group and having the user pick from all of the options.

Fig.3 – Interleaving Experiments