387. Model Explainability With SHAP

▮ Model Explainability

Inside an ML model can easily become a black box. Increasing the explainability of an ML model can help developers debug and also communicate with the client about why the model is predicting a certain outcome.

Here is one way to increase model explainability by visualizing which features contribute most to the outcome using SHAP values.

For this example, I will be using the information of NBA players(name, position, age, etc) to predict whether a player is in a “winning season” or not.

Reference: Practical MLOps

▮ Implementation

Please use a Jupyter notebook to run the codes below.
1. Import

import shap
import xgboost
from sklearn.model_selection import train_test_split
import pandas as pd
  1. Download and read csv file
#read csv data
player_data = "https://raw.githubusercontent.com/noahgift/socialpowernba/master/data/nba_2017_players_with_salary_wiki_twitter.csv"
df = pd.read_csv(player_data)
df.head()
Fig.1 – Dataframe
  1. Add “winning_season” Column
# Function to generate new data
# if "win" count is larger than 42, 
# consider that player as an "winning_season"
def winning_season(wins):
    if wins > 42:
        return 1
    return 0

#Create new column using the function above
df["winning_season"] = df["W"].apply(winning_season)
  1. Clean data and divide it into “target” and “features”.
# Create new data frame and clean data
df2 = df[["AGE", "POINTS", "SALARY_MILLIONS", "PAGEVIEWS",
          "TWITTER_FAVORITE_COUNT", "winning_season", "TOV"]]
df = df2.dropna()

# Divide data to target and features
target = df["winning_season"]
features = df[["AGE", "POINTS", "SALARY_MILLIONS", "PAGEVIEWS",
               "TWITTER_FAVORITE_COUNT", "TOV"]]
  1. Split data to train and test set
# Split data
x_train, x_test, y_train, y_test = train_test_split(features, target,
                                                    test_size=0.25,
                                                    random_state=0)
# Train model
model_xgboost = xgboost.train(
    {"learning_rate": 0.01}, xgboost.DMatrix(x_train, label=y_train), 100)
  1. Initiate SHAP and plot results
# load JS visualization code to notebook
shap.initjs()

# explain the model's predictions using SHAP values
explainer = shap.TreeExplainer(model_xgboost)
shap_values = explainer.shap_values(features)
shap.summary_plot(shap_values, features, plot_type="bar")
Fig.2 – SHAP value visualization

From the plot above, you can tell that PAGEVIEWS and TWITTER_FAVORITE_COUNT have the 2 highest SHAP values, which means these 2 features are contributing to the output the most.