▮ Model Explainability
Inside an ML model can easily become a black box. Increasing the explainability of an ML model can help developers debug and also communicate with the client about why the model is predicting a certain outcome.
Here is one way to increase model explainability by visualizing which features contribute most to the outcome using SHAP values.
For this example, I will be using the information of NBA players(name, position, age, etc) to predict whether a player is in a “winning season” or not.
Reference: Practical MLOps
▮ Implementation
Please use a Jupyter notebook to run the codes below.
1. Import
import shap
import xgboost
from sklearn.model_selection import train_test_split
import pandas as pd
- Download and read csv file
#read csv data
player_data = "https://raw.githubusercontent.com/noahgift/socialpowernba/master/data/nba_2017_players_with_salary_wiki_twitter.csv"
df = pd.read_csv(player_data)
df.head()
- Add “winning_season” Column
# Function to generate new data
# if "win" count is larger than 42,
# consider that player as an "winning_season"
def winning_season(wins):
if wins > 42:
return 1
return 0
#Create new column using the function above
df["winning_season"] = df["W"].apply(winning_season)
- Clean data and divide it into “target” and “features”.
# Create new data frame and clean data
df2 = df[["AGE", "POINTS", "SALARY_MILLIONS", "PAGEVIEWS",
"TWITTER_FAVORITE_COUNT", "winning_season", "TOV"]]
df = df2.dropna()
# Divide data to target and features
target = df["winning_season"]
features = df[["AGE", "POINTS", "SALARY_MILLIONS", "PAGEVIEWS",
"TWITTER_FAVORITE_COUNT", "TOV"]]
- Split data to train and test set
# Split data
x_train, x_test, y_train, y_test = train_test_split(features, target,
test_size=0.25,
random_state=0)
# Train model
model_xgboost = xgboost.train(
{"learning_rate": 0.01}, xgboost.DMatrix(x_train, label=y_train), 100)
- Initiate SHAP and plot results
# load JS visualization code to notebook
shap.initjs()
# explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(model_xgboost)
shap_values = explainer.shap_values(features)
shap.summary_plot(shap_values, features, plot_type="bar")
From the plot above, you can tell that PAGEVIEWS and TWITTER_FAVORITE_COUNT have the 2 highest SHAP values, which means these 2 features are contributing to the output the most.