379. Model Offline Evaluation

▮ Evaluating your model

How do you know whether your ML model is any good?

Lacking a clear understanding of how to evaluate your model may not immediately mean that your ml project is going to fail, but it makes it almost impossible to find the best solution for your need, even more, convince your client.

Ideally, evaluation methods for both the development and production phases should be the same. However, that is quite unlikely to happen because you have annotation data during the development phase but you don’t during the production phase. So for this post, I’d like to share several methods to evaluate your model before production.

Reference: Designing Machine Learning Systems

▮ Creating baselines

First, we would need to create a baseline to evaluate against.

Evaluation metrics by themselves are not that informative because you can’t tell how good it is. If someone said they got 95 points in an exam, you can’t tell if that is good or not because you don’t know whether the test is out of 100 points. Maybe it’s out of 1000 points. It is essential to keep in mind that you should always have a baseline to evaluate against when assessing your model.

Which baselines you should use depends on your specific case, so here are several examples to set your baselines.

Random Baseline

How well would it perform if you just randomly chose the output?

Simple Heuristic

If you just make predictions based on simple heuristics how well would it perform?

For example, if you want to build a ranking system to rank items on a user’s newsfeed to have them spend more time, what would happen if you just rank all items in reverse chronological order?

Human Baseline

In most ML cases, the objective is to automate tasks otherwise people would do. You can compare your model performance with human experts.

Existing Solutions

If your ML system is designed to replace existing solutions, it’s crucial to compare your new model to those solutions. It would be off-point if your ML system works great considering other baselines but not great compared to the current solution.

▮ Evaluation Methods

Unlike in academic settings where in most cases the performance metrics are fixed, in production, we want our model to be robust and have fairness to a certain extent.

Once you have decided your baseline, here are a couple of methods to evaluate your model.

Perturbation Test

Being able to perform well on training datasets is one thing, and performing well in production with noisy data is another.

Fig.1 – Perturbation

ML models being too sensitive to noises can make it hard for future maintenance because data will always slightly start to change after the model is deployed.

To evaluate how well your model performs even in those situations, you can slightly change the test splits to help you understand how much the performance changes.

Invariance Test

In most cases, when the input changes the output changes as well. However, there are some inputs, even if it changes, that should not change the output.

For example, race information shouldn’t change the outcome of average annual income.

To avoid these biases, one thing you can do is change only the inputs with sensitive information and see how the output change. Adding to that, removing sensitive information from the training datasets in the first place may be an even better approach.

Directional Expectation Test

There are some outputs that are expected to increase or decrease depending on the input.

Fig.2 – Directional expectation

For example, when predicting housing prices, if the AREA of a house increases the price would probably increase as well.
If the model is outputting something different, there might be a problem with the data or the training procedures.

Slice-Based Evaluation

When evaluating a model, focusing too much on overall metrics performance can lead to misinterpretation of model performance.

For example, let’s say a company wants to evaluate its model performance and the currently available data has 2 subgroups. The majority occupies 90% of the data and the minority occupies 10%.

Fig.3 – Slice-Based

Like in the table, if you ONLY consider overall metrics, model A seems better, but if you look at metrics for each subgroup that may not be the case.

By considering each subgroup(Slice-based evaluation), not only can you evaluate overall performance, but you can also find potential bias depending on critical data.

Model Calibration

This evaluation method may be the most important one when it comes to forecasting.

Checking whether a model is well-calibrated means that if a model predicts there is a 40% chance that A would happen, the actual frequency of occurrence of A should also be close to 40%.

Here is a plot from Scikit-Learn.

Fig.4 – Calibration Curve

To measure a model’s calibration, count the number of times your model outputs the probability X as your X axis and the frequency Y of that prediction as your Y axis. You can easily plot this using tools such as sklearn.calibration.calibration_curve.