Underfitting and Overfitting Exercise

Beginning

This is the fourth part of kaggle's Introduction to Machine Learning tutorial - Overfitting and Underfitting.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "underfitting-and-overfitting-exercise"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

data = pandas.read_csv(
    Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()/"train.csv")

Middle

Preliminary 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

Our target is SalePrice.

Y = data.SalePrice

Preliminary 2: Create X

Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

  • LotArea
  • YearBuilt
  • 1stFlrSF
  • 2ndFlrSF
  • FullBath
  • BedroomAbvGr
  • TotRmsAbvGrd
FEATURES = [
    "LotArea",
    "YearBuilt",
    "1stFlrSF",
    "2ndFlrSF",
    "FullBath",
    "BedroomAbvGr",
    "TotRmsAbvGrd",
]
X = data[FEATURES]

Split up the data into training and validation sets.

x_train, x_validate, y_train, y_validate = train_test_split(X, Y, random_state=1)

Preliminary 3: Specify and Fit Model

A Linear Regression Model

As a baseline, I'll fit a simple Linear Regression (ordinary-least-squares) model.

regression = LinearRegression()
scores = cross_val_score(regression, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
regression = regression.fit(x_train, y_train)
print(f"Training R^2: {regression.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {regression.score(x_validate, y_validate):0.2f}")
0.66 (+/- 0.17)
Training R^2:  0.68
Validation R^2: 0.77

Decision Tree

Create a DecisionTreeRegressor and save it as iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

tree = DecisionTreeRegressor()
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.54 (+/- 0.32)
Training R^2:  1.00
Validation R^2: 0.75

So our linear regression actually does better than the tree does. It looks like the tree might be overfitting on the training data.

Preliminary 4: Make Some Predictions

tree_predict = tree.predict(x_validate)
regression_predict = regression.predict(x_validate)

Preliminary 5: Calculate the Mean Absolute Error in Validation Data

tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)

print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
Tree MAE:  29371.52
Regression MAE:  27228.88

The tree's error is a little higher than the regression line's.

Step 1: Compare Different Tree Sizes

Write a loop that tries the following values for max_leaf_nodes from a set of possible values.

Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes that gives the most accurate model on your data.

def get_mae(max_leaf_nodes, train_X=x_train, val_X=x_validate, train_y=y_train, val_y=y_validate):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

Write a loop to find the ideal tree size from candidate_max_leaf_nodes.

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
outcomes = [(get_mae(nodes), nodes) for nodes in candidate_max_leaf_nodes]
best = min(outcomes)
print(best)
best_tree_size = best[1]
(27282.50803885739, 100)
mae = pandas.DataFrame(dict(nodes=candidate_max_leaf_nodes, mae = [outcome[0] for outcome in outcomes]))
plot = mae.hvplot(x="nodes", y="mae").opts(title="Node Mean Absolute Error",
                                           width=Plot.width,
                                           height=Plot.height)
source = Embed(plot=plot, file_name="node_mean_absolute_error")()
print(source)
: :

Figure Missing

:

Looking at the plot you can see that the error drops until you hit 100 nodes and then begins to rise again as it overfits the data with more nodes.

Let's see how much this improves our model using \(r^2\).

tree = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.60 (+/- 0.26)
Training R^2:  0.93
Validation R^2: 0.76

We've improved it slightly, it's probably still overfitting the model but not as much.

Step 2: Fit Model Using All Data

You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.

final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size)
final_model.fit(X, Y)
predictions_first = tree.predict(X)
predictions_final = final_model.predict(X)

x_y_tree = pandas.DataFrame(dict(predicted=predictions_first, actual=Y))
x_y_line = pandas.DataFrame(dict(predicted=predictions_final, actual=Y))
ideal = pandas.DataFrame(dict(x=Y, y=Y))

tree_plot = x_y_tree.hvplot.scatter(x="actual", y="predicted", label="Default")
line_plot = x_y_line.hvplot.scatter(x="actual", y="predicted", label="Tuned")
ideal_plot = ideal.hvplot(x="x", y="y")
plot = (tree_plot * line_plot * ideal_plot).opts(title="Decision Tree Actual Vs Predictions",
                                    width=Plot.width,
                                    height=Plot.height)
source = Embed(plot=plot, file_name="decision_tree_actual_vs_predicted")()
print(source)
: :

Figure Missing

:

The tuned model seems closer to the predicted.

End

That's a basic way to tune hyperparameters to improve your model. But our decision tree still isn't doing as well as the regression line. Next up we'll try an ensemble method - Random Forests.

Model Validation Exercise

Beginning

This is the third part of kaggle's Introduction to Machine Learning tutorial - Model Validation.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "a-first-machine-learning-model"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

data = pandas.read_csv(
    Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()/"train.csv")

Middle

Step 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

for column in sorted(data.columns):
    print(f"- {column} ({data[column].dtype})")
- 1stFlrSF (int64)
- 2ndFlrSF (int64)
- 3SsnPorch (int64)
- Alley (object)
- BedroomAbvGr (int64)
- BldgType (object)
- BsmtCond (object)
- BsmtExposure (object)
- BsmtFinSF1 (int64)
- BsmtFinSF2 (int64)
- BsmtFinType1 (object)
- BsmtFinType2 (object)
- BsmtFullBath (int64)
- BsmtHalfBath (int64)
- BsmtQual (object)
- BsmtUnfSF (int64)
- CentralAir (object)
- Condition1 (object)
- Condition2 (object)
- Electrical (object)
- EnclosedPorch (int64)
- ExterCond (object)
- ExterQual (object)
- Exterior1st (object)
- Exterior2nd (object)
- Fence (object)
- FireplaceQu (object)
- Fireplaces (int64)
- Foundation (object)
- FullBath (int64)
- Functional (object)
- GarageArea (int64)
- GarageCars (int64)
- GarageCond (object)
- GarageFinish (object)
- GarageQual (object)
- GarageType (object)
- GarageYrBlt (float64)
- GrLivArea (int64)
- HalfBath (int64)
- Heating (object)
- HeatingQC (object)
- HouseStyle (object)
- Id (int64)
- KitchenAbvGr (int64)
- KitchenQual (object)
- LandContour (object)
- LandSlope (object)
- LotArea (int64)
- LotConfig (object)
- LotFrontage (float64)
- LotShape (object)
- LowQualFinSF (int64)
- MSSubClass (int64)
- MSZoning (object)
- MasVnrArea (float64)
- MasVnrType (object)
- MiscFeature (object)
- MiscVal (int64)
- MoSold (int64)
- Neighborhood (object)
- OpenPorchSF (int64)
- OverallCond (int64)
- OverallQual (int64)
- PavedDrive (object)
- PoolArea (int64)
- PoolQC (object)
- RoofMatl (object)
- RoofStyle (object)
- SaleCondition (object)
- SalePrice (int64)
- SaleType (object)
- ScreenPorch (int64)
- Street (object)
- TotRmsAbvGrd (int64)
- TotalBsmtSF (int64)
- Utilities (object)
- WoodDeckSF (int64)
- YearBuilt (int64)
- YearRemodAdd (int64)
- YrSold (int64)

That is a huge number of features.

Our target is SalePrice.

Y = data.SalePrice

Step 2: Create X

Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

  • LotArea
  • YearBuilt
  • 1stFlrSF
  • 2ndFlrSF
  • FullBath
  • BedroomAbvGr
  • TotRmsAbvGrd
FEATURES = [
    "LotArea",
    "YearBuilt",
    "1stFlrSF",
    "2ndFlrSF",
    "FullBath",
    "BedroomAbvGr",
    "TotRmsAbvGrd",
]
X = data[FEATURES]

Split up the data into training and validation sets.

x_train, x_validate, y_train, y_validate = train_test_split(X, Y, random_state=1)

Step 3: Specify and Fit Model

A Linear Regression Model

As a baseline, I'll fit a simple Linear Regression (ordinary-least-squares) model.

regression = LinearRegression()
scores = cross_val_score(regression, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
regression = regression.fit(x_train, y_train)
print(f"Training R^2: {regression.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {regression.score(x_validate, y_validate):0.2f}")
0.66 (+/- 0.17)
Training R^2:  0.68
Validation R^2: 0.77

Decision Tree

Create a DecisionTreeRegressor and save it as iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

tree = DecisionTreeRegressor()
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.53 (+/- 0.32)
Training R^2:  1.00
Validation R^2: 0.73

So our linear regression actually does better than the tree does. It looks like the tree might be overfitting on the training data.

Make Some Predictions

tree_predict = tree.predict(x_validate)
regression_predict = regression.predict(x_validate)

Step 4: Calculate the Mean Absolute Error in Validation Data

tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)

print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
Tree MAE:  29853.86
Regression MAE:  27228.88

The tree's error is a little higher than the regression line's.

End

Next up is underfitting and overfitting.

A First Machine Learning Model

Beginning

This part 2 of a re-do of the kaggle Intro to Machine Learning tutorial - Your First Machine Learning Model.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "a-first-machine-learning-model"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

data = pandas.read_csv(
    Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()/"train.csv")

Middle

Step 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

for column in sorted(data.columns):
    print(f"- {column} ({data[column].dtype})")
- 1stFlrSF (int64)
- 2ndFlrSF (int64)
- 3SsnPorch (int64)
- Alley (object)
- BedroomAbvGr (int64)
- BldgType (object)
- BsmtCond (object)
- BsmtExposure (object)
- BsmtFinSF1 (int64)
- BsmtFinSF2 (int64)
- BsmtFinType1 (object)
- BsmtFinType2 (object)
- BsmtFullBath (int64)
- BsmtHalfBath (int64)
- BsmtQual (object)
- BsmtUnfSF (int64)
- CentralAir (object)
- Condition1 (object)
- Condition2 (object)
- Electrical (object)
- EnclosedPorch (int64)
- ExterCond (object)
- ExterQual (object)
- Exterior1st (object)
- Exterior2nd (object)
- Fence (object)
- FireplaceQu (object)
- Fireplaces (int64)
- Foundation (object)
- FullBath (int64)
- Functional (object)
- GarageArea (int64)
- GarageCars (int64)
- GarageCond (object)
- GarageFinish (object)
- GarageQual (object)
- GarageType (object)
- GarageYrBlt (float64)
- GrLivArea (int64)
- HalfBath (int64)
- Heating (object)
- HeatingQC (object)
- HouseStyle (object)
- Id (int64)
- KitchenAbvGr (int64)
- KitchenQual (object)
- LandContour (object)
- LandSlope (object)
- LotArea (int64)
- LotConfig (object)
- LotFrontage (float64)
- LotShape (object)
- LowQualFinSF (int64)
- MSSubClass (int64)
- MSZoning (object)
- MasVnrArea (float64)
- MasVnrType (object)
- MiscFeature (object)
- MiscVal (int64)
- MoSold (int64)
- Neighborhood (object)
- OpenPorchSF (int64)
- OverallCond (int64)
- OverallQual (int64)
- PavedDrive (object)
- PoolArea (int64)
- PoolQC (object)
- RoofMatl (object)
- RoofStyle (object)
- SaleCondition (object)
- SalePrice (int64)
- SaleType (object)
- ScreenPorch (int64)
- Street (object)
- TotRmsAbvGrd (int64)
- TotalBsmtSF (int64)
- Utilities (object)
- WoodDeckSF (int64)
- YearBuilt (int64)
- YearRemodAdd (int64)
- YrSold (int64)

That is a huge number of features.

Our target is SalePrice.

Y = data.SalePrice

Step 2: Create X

Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

  • LotArea
  • YearBuilt
  • 1stFlrSF
  • 2ndFlrSF
  • FullBath
  • BedroomAbvGr
  • TotRmsAbvGrd
FEATURES = [
    "1stFlrSF",
    "2ndFlrSF",
    "BedroomAbvGr",
    "FullBath",
    "LotArea",
    "TotRmsAbvGrd",
    "YearBuilt",
]
X = data[FEATURES]

Split up the data into training and validation sets.

x_train, x_validate, y_train, y_validate = train_test_split(X, Y, random_state=1)

Step 3: Specify and Fit Model

A Linear Regression Model

As a baseline, I'll fit a simple Linear Regression (ordinary-least-squares) model.

regression = LinearRegression()
scores = cross_val_score(regression, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
regression = regression.fit(x_train, y_train)
print(f"Training R^2: {regression.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {regression.score(x_validate, y_validate):0.2f}")
0.66 (+/- 0.17)
Training R^2:  0.68
Validation R^2: 0.77

Decision Tree

Create a DecisionTreeRegressor and save it as iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

tree = DecisionTreeRegressor()
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.54 (+/- 0.30)
Training R^2:  1.00
Validation R^2: 0.74

So our linear regression actually does better than the tree does. It looks like the tree might be overfitting on the training data.

Make Predictions

Make predictions with the model's predict command using X as the data. Save the results to a variable called predictions.

The whole data set

predictions_tree = tree.predict(X)
predictions_line = regression.predict(X)

x_y_tree = pandas.DataFrame(dict(predicted=predictions_tree, actual=Y))
x_y_line = pandas.DataFrame(dict(predicted=predictions_line, actual=Y))
ideal = pandas.DataFrame(dict(x=Y, y=Y))

tree_plot = x_y_tree.hvplot.scatter(x="actual", y="predicted", label="Decision Tree")
line_plot = x_y_line.hvplot.scatter(x="actual", y="predicted", label="Linear Regression")
ideal_plot = ideal.hvplot(x="x", y="y")
plot = (tree_plot * line_plot * ideal_plot).opts(title="Actual Vs Predictions",
                                    width=Plot.width,
                                    height=Plot.height)
source = Embed(plot=plot, file_name="actual_vs_predicted")()
print(source)
: :

Figure Missing

:

Despite the \(r^2\) value the decision tree looks like it did better than the linear model.

The Validation Set

Think About Your Results

predictions_tree = tree.predict(x_validate)
predictions_line = regression.predict(x_validate)

x_y_tree = pandas.DataFrame(dict(predicted=predictions_tree, actual=y_validate))
x_y_line = pandas.DataFrame(dict(predicted=predictions_line, actual=y_validate))
ideal = pandas.DataFrame(dict(x=y_validate, y=y_validate))
tree_plot = x_y_tree.hvplot.scatter(x="actual", y="predicted", label="Decision Tree")
line_plot = x_y_line.hvplot.scatter(x="actual", y="predicted", label="Linear Regression")
ideal_plot = ideal.hvplot(x="x", y="y")
plot = (tree_plot * line_plot * ideal_plot).opts(title="Actual Vs Predictions (Validation Set)",
                                    width=Plot.width,
                                    height=Plot.height)
source = Embed(plot=plot, file_name="actual_vs_predicted_validation")()
print(source)
: :

Figure Missing

:

It's hard to see which model really does better here - based on the \(r^2\) value the linear model did better, but the plot makes it look like it kind of goes off as the values get bigger.

Step 4: Calculate the Mean Absolute Error in Validation Data

val_mae = __

#print(val_mae)

End

Sources

  • De Cock D. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education. 2011 Nov 1;19(3). Link to PDF

Exploring Housing Data

Beginning

This is a re-do of the kaggle Intro to Machine Learning tutorial - Explore Your data.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import ElasticNet, Lasso, LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "exploring-housing-data"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

data = pandas.read_csv(
    Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()/"train.csv")

Middle

Summary Statistics

print(data.describe())
                Id   MSSubClass  LotFrontage        LotArea  OverallQual  \
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000   

       OverallCond    YearBuilt  YearRemodAdd   MasVnrArea   BsmtFinSF1  ...  \
count  1460.000000  1460.000000   1460.000000  1452.000000  1460.000000  ...   
mean      5.575342  1971.267808   1984.865753   103.685262   443.639726  ...   
std       1.112799    30.202904     20.645407   181.066207   456.098091  ...   
min       1.000000  1872.000000   1950.000000     0.000000     0.000000  ...   
25%       5.000000  1954.000000   1967.000000     0.000000     0.000000  ...   
50%       5.000000  1973.000000   1994.000000     0.000000   383.500000  ...   
75%       6.000000  2000.000000   2004.000000   166.000000   712.250000  ...   
max       9.000000  2010.000000   2010.000000  1600.000000  5644.000000  ...   

        WoodDeckSF  OpenPorchSF  EnclosedPorch    3SsnPorch  ScreenPorch  \
count  1460.000000  1460.000000    1460.000000  1460.000000  1460.000000   
mean     94.244521    46.660274      21.954110     3.409589    15.060959   
std     125.338794    66.256028      61.119149    29.317331    55.757415   
min       0.000000     0.000000       0.000000     0.000000     0.000000   
25%       0.000000     0.000000       0.000000     0.000000     0.000000   
50%       0.000000    25.000000       0.000000     0.000000     0.000000   
75%     168.000000    68.000000       0.000000     0.000000     0.000000   
max     857.000000   547.000000     552.000000   508.000000   480.000000   

          PoolArea       MiscVal       MoSold       YrSold      SalePrice  
count  1460.000000   1460.000000  1460.000000  1460.000000    1460.000000  
mean      2.758904     43.489041     6.321918  2007.815753  180921.195890  
std      40.177307    496.123024     2.703626     1.328095   79442.502883  
min       0.000000      0.000000     1.000000  2006.000000   34900.000000  
25%       0.000000      0.000000     5.000000  2007.000000  129975.000000  
50%       0.000000      0.000000     6.000000  2008.000000  163000.000000  
75%       0.000000      0.000000     8.000000  2009.000000  214000.000000  
max     738.000000  15500.000000    12.000000  2010.000000  755000.000000  

[8 rows x 38 columns]

What is the average lot size (rounded to nearest integer)?

print(f"{data.LotArea.mean().round().astype(int):,}")
10,517

As of today, how old is the newest home (current year - the date in which it was built)

print(f"{datetime.now().year - data.YearBuilt.max()} years")
10 years

Think About Your Data

The newest house in your data isn't that new. A few potential explanations for this:

  1. They haven't built new houses where this data was collected.
  2. The data was collected a long time ago. Houses built after the data publication wouldn't show up.

If the reason is explanation #1 above, does that affect your trust in the model you build with this data? What about if it is reason #2?

If they haven't built new homes but the data is current, then that's not a problem, the problem would come if the data is old.

How could you dig into the data to see which explanation is more plausible?

print(data.YrSold.max())
2010

The fact that there haven't been any sales in the last ten years leans credence to the idea that the dataset is kind of old.

End

Sources

  • De Cock D. Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education. 2011 Nov 1;19(3). Link to PDF

Advanced Uses Of SHAP Exercise

Beginning

This is my notes on the Advanced Uses of SHAP Exercise that's part of the Machine Learning Explainability tutorial on kaggle.

Imports

Python

from pathlib import Path

PyPi

from matplotlib import pyplot
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV

import numpy
import pandas
import seaborn
import shap

Others

from graeae import EnvironmentLoader, Timer

Set Up

Plotting

SLUG = "advanced-uses-of-shap-exercise"
OUTPUT_PATH = Path("../../files/posts/tutorials")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
FIGURE_SIZE = (20, 10)
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "font.size": 22,
                "figure.figsize": FIGURE_SIZE},
            font_scale=1)

The Timer

TIMER = Timer()

The Environment

ENVIRONMENT = EnvironmentLoader()

The Data

Once again we're going to use the Medical Data and Hospital Readmissions dataset from kaggle.

path = Path(ENVIRONMENT["HOSPITAL-READMISSIONS"]).expanduser()
data = pandas.read_csv(path)

We are only going to use a subset of the features.

FEATURES = [
    'A1Cresult_None',
    'diabetesMed_Yes',
    'diag_1_414', 
    'diag_1_428',
    'gender_Female',
    'medical_specialty_?',
    'num_lab_procedures', 
    'num_medications',
    'num_procedures',
    'number_diagnoses',
    'number_emergency', 
    'number_inpatient',
    'number_outpatient',
    'payer_code_?',
    'time_in_hospital',
]

Now we'll split up the features and the target. According to the notebook some versions of shap don't work when you combine numeric and boolean types so we'll convert all the features to floats.

X = data[FEATURES].astype(float)
Y = data.readmitted

Now the training and validation split.

x_train, x_validation, y_train, y_validation = train_test_split(X, Y, random_state=1)

Middle

Build the Model

For some reason the notebook switches from a classifier to a regressor, even though this seems to be a classification problem. Since I don't have an explanation for it I'll try it using a classifier instead.

# model = RandomForestRegressor()
model = RandomForestClassifier()
estimators = list(range(30, 300, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(x_train, y_train)
model = search.best_estimator_
print(f"CV Training Accuracy: {search.best_score_:0.2f}")
print(f"Training Accuracy: {model.score(x_train, y_train): 0.2f}")
print(f"Validation Accuracy: {model.score(x_validation, y_validation):0.2f}")
print(search.best_params_)
CV Training Accuracy: 0.63
Training Accuracy:  0.70
Validation Accuracy: 0.63
{'n_estimators': 140, 'max_depth': 10}

Shap Values

READMITTED = 1
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(x_validation)

Summary Plot

shap.summary_plot(shap_values[READMITTED], x_validation)
figure = pyplot.gcf()
output = "shap_summary.png"
#figure.subplots_adjust(left=0.3)
figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

Question 1

Which of the following features has a bigger range of effects on predictions (i.e. larger difference between most positive and most negative effect)

  • diag_1_428 or
  • payer_code_?

diag_1_428 appears to have a bigger spread than payer_code.

Question 2

Do you believe the range of effects sizes (distance between smallest effect and largest effect) is a good indication of which feature will have a higher permutation importance? Why or why not?

If the range of effect sizes measures something different from permutation importance: which is a better answer for the question "Which of these two features does the model say is more important for us to understand when discussing readmission risks in the population?"

Run the following line after you've decided your answer.

I don't think that the range of effect size is a good indication of which feature will have a higher permutation importance, because it just indicates the spread of the values, not their overall effect.

I think that diag_1_428 is more important because it has a higher Feature Value than payer_code_?.

Note: Although the question says "which of these two features does the model say is more important", the answer given is that permutation importance is a better measure of what's important to the model, since it's a robust measurement, while the range of effects is influenced by outliers.

Question 3

Both diag_1_428 and payer_code_? are binary variables, taking values of 0 or 1.

From the graph, which do you think would typically have a bigger impact on predicted readmission risk:

  • Changing diag_1_428 from 0 to 1
  • Changing payer_code_? from 0 to 1

I think that changing diag_1_428 to 1 would have a bigger impact, since the dots are more heavily right skewed for it than is the case with payer_code_?.

Question 4

Some features (like number_inpatient) have reasonably clear separation between the blue and pink dots. Other variables like num_lab_procedures have blue and pink dots jumbled together, even though the SHAP values (or impacts on prediction) aren't all 0.

What do you think you learn from the fact that num_lab_procedures has blue and pink dots jumbled together? Once you have your answer, run the line below to verify your solution.

I think that this means that it is more difficult to make predictions based on the num_lab_procedures - there isn't a clear separation of outcomes based on the value of num_lab_procedures. The feature probably has interacting effects with other features.

Question 5

Consider the following SHAP contribution dependence plot.

The x-axis shows feature_of_interest and the points are colored based on other_feature.

zFdHneM.png

Is there an interaction between feature_of_interest and other_feature?

If so, does feature_of_interest have a more positive impact on predictions when other_feature is high or when other_feature is low?

It looks like feature_of_interest has a more positive impart on predictions when the other_feature is high.

Question 6

Both num_medications and num_lab_procedures share that jumbling of pink and blue dots.

Aside from num_medications having effects of greater magnitude (both more positive and more negative), it's hard to see a meaningful difference between how these two features affect readmission risk. Create the SHAP dependence contribution plots for each variable, and describe what you think is different between how these two variables affect predictions.

figure, axe = pyplot.subplots(figsize=FIGURE_SIZE)
output = "num_medications.png"
shap.dependence_plot("num_medications", shap_values[READMITTED], x_validation,
                     ax=axe)
figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

figure, axe = pyplot.subplots(figsize=FIGURE_SIZE)
output = "num_lab_procedures.png"
shap.dependence_plot("num_lab_procedures", shap_values[READMITTED], x_validation,
                     interaction_index="num_medications", ax=axe)
figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

The num_medications value appears to peak at 20 and then after that the SHAP value starts to go down as num_medications goes up, while num_lab_procedures makes a more gradual climb but appears to not have this inverted-curve shape.

End

Advanced Uses Of SHAP

Beginning

This is a re-do of the Kaggle tutorial on Advanced Uses of SHAP Values. SHAP values let us see how much a given feature changed our prediction for a given row of data, but it can also be used to look at groups of features to get a bigger picture look at our model.

Imports

Python

from pathlib import Path

PyPi

from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

import numpy
import pandas
import seaborn
import shap

Others

from graeae import EnvironmentLoader

Set Up

Plotting

SLUG = "advanced-uses-of-shap"
OUTPUT_PATH = Path("../../files/posts/tutorials")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()

get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")
FIGURE_SIZE = (20, 10)
seaborn.set(style="whitegrid",
            rc={"axes.grid": False,
                "font.family": ["sans-serif"],
                "font.sans-serif": ["Open Sans", "Latin Modern Sans", "Lato"],
                "font.size": 22,
                "figure.figsize": FIGURE_SIZE},
            font_scale=1)

The Environment

ENVIRONMENT = EnvironmentLoader()

The Dataset

path = Path(ENVIRONMENT["FIFA-2018"]).expanduser()
data = pandas.read_csv(path)

y = data["Man of the Match"] == "Yes"

FEATURES = [column for column in data.columns
            if data[column].dtype == numpy.int64]
x = data[FEATURES]
x_train, x_validation, y_train, y_validation = train_test_split(
    x, y, random_state=1)
model = RandomForestClassifier(random_state=0).fit(x_train, y_train)

Middle

Setup the SHAP Values

explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(x_validation)

Summary Plots

Like Permutation Importance, SHAP Summary Plots show you the relative importance of different features for your model, but unlike Permutation Importance, Summary Plots can show you how much more each feature influences the predictions.

MAN_OF_THE_MATCH = 1
shap.summary_plot(shap_values[MAN_OF_THE_MATCH], x_validation, show=False)
figure = pyplot.gcf()
# figure.set_size_inches(FIGURE_SIZE)
output = "shap_summary.png"
figure.subplots_adjust(left=0.3)
figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

As with permutation importance, Goal Scored was the most important feature, with a greater impact on not being the man of the match than being the man of the match. This looks sort of like a combination of Permutation importance and Partial Dependence Plots (PDP) except all in one place.

SHAP Dependence Contribution Plots

Dependence Contribution Plots work much like PDP plots except with more detail. Rather than showing you the means the Dependence Contribution Plot shows you the individual rows so you can get a sense of the shape of the outcomes.

figure, axe = pyplot.subplots(figsize=FIGURE_SIZE)
output = "ball_possession_dcp.png"
shap.dependence_plot("Ball Possession %", shap_values[MAN_OF_THE_MATCH], x_validation,
                     interaction_index="Goal Scored", ax=axe)
figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

Interestingly, our plot doesn't look like the one in the tutorial, we appear to have much fewer points and less spread in the values. But, anyway, the way to interpret this is as you move from left to right you are increasing the percentage of the time that a team had the ball, and as you move up, you are increasing the likelihood that the team had a man of the match. Additionally, the colors tell you how many goals the team scored.

Looking closer, it looks like the tutorial went with the entire dataset rather than just the validation set, probably because there are so few data points. While that does reveal a more interesting pattern, I'm not sure that that's what you would do, normally.

Anyway, this plot shows that having a very low ball possession percentage decreases the likelihood that you would have the man of the match and generally speaking as it goes up, so does the likelihood of having man of the match, although it seems to level off around 45%.

End

SHAP Values Exercise

Beginning

This is the SHAP Exercise from the kaggle Machine Learing Explainability Tutorial. It's using the kaggle Hospital Readmissions dataset (I think).

The Scenario

A hospital has struggled with "readmissions," where they release a patient before the patient has recovered enough, and the patient returns with health complications.

The hospital wants your help identifying patients at highest risk of being readmitted. Doctors (rather than your model) will make the final decision about when to release each patient; but they hope your model will highlight issues the doctors should consider when releasing a patient.

The hospital has given you relevant patient medical information.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

import random

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot
from pdpbox import pdp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from tabulate import tabulate

import eli5
import hvplot.pandas
import pandas
import shap

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plotting

SLUG = "shap-values-exercise"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()

Plot = Namespace(
    height=800,
    width=1000,
)
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)

The Table Printer

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)

The Timer

TIMER = Timer()

The Environment

ENVIRONMENT = EnvironmentLoader()

The Data

PATH = Path(ENVIRONMENT["HOSPITAL-READMISSIONS"]).expanduser()
assert PATH.is_file()
data = pandas.read_csv(PATH)

Middle

Looking At the Data

for column in sorted(data.columns):
    print(f" - {column}")
  • A1Cresult_None
  • acarbose_No
  • acetohexamide_No
  • age_[40-50)
  • age_[50-60)
  • age_[60-70)
  • age_[70-80)
  • age_[80-90)
  • change_No
  • chlorpropamide_No
  • citoglipton_No
  • diabetesMed_Yes
  • diag_1_414
  • diag_1_428
  • diag_1_786
  • diag_2_250
  • diag_2_276
  • diag_2_427
  • diag_2_428
  • diag_3_250
  • diag_3_276
  • diag_3_401
  • diag_3_428
  • examide_No
  • gender_Female
  • glimepiride-pioglitazone_No
  • glimepiride_No
  • glipizide-metformin_No
  • glipizide_No
  • glyburide-metformin_No
  • glyburide_No
  • insulin_No
  • max_glu_serum_None
  • medical_specialty_?
  • medical_specialty_Cardiology
  • medical_specialty_Emergency/Trauma
  • medical_specialty_Family/GeneralPractice
  • medical_specialty_InternalMedicine
  • metformin-pioglitazone_No
  • metformin-rosiglitazone_No
  • metformin_No
  • miglitol_No
  • nateglinide_No
  • num_lab_procedures
  • num_medications
  • num_procedures
  • number_diagnoses
  • number_emergency
  • number_inpatient
  • number_outpatient
  • payer_code_?
  • payer_code_BC
  • payer_code_HM
  • payer_code_MC
  • payer_code_SP
  • pioglitazone_No
  • race_AfricanAmerican
  • race_Caucasian
  • readmitted
  • repaglinide_No
  • rosiglitazone_No
  • time_in_hospital
  • tolazamide_No
  • tolbutamide_No
  • troglitazone_No

So there are a lot of columns.

Here are some quick hints at interpreting the field names:

  • Your prediction target is readmitted
  • Columns with the word diag indicate the diagnostic code of the illness or illnesses the patient was admitted with. For example, diag_1_428 means the doctor said their first illness diagnosis is number "428". What illness does 428 correspond to? You could look it up in a codebook, but without more medical background it wouldn't mean anything to you anyway.
  • A column names like glimepiride_No mean the patient did not have the medicine glimepiride. If this feature had a value of False, then the patient did take the drug glimepiride
  • Features whose names begin with medical_specialty describe the specialty of the doctor seeing the patient. The values in these fields are all True or False.

Set Up X and Y

y = data.readmitted
TARGET = "readmitted"
FEATURES = data.columns[data.columns != TARGET]

X = data[FEATURES]

Split the Data

x_train, x_validate, y_train, y_validate = train_test_split(X, y,
                                                            random_state=1)

A Simple Model

You have built a simple model, but the doctors say they don't know how to evaluate a model, and they'd like you to show them some evidence the model is doing something in line with their medical intuition. Create any graphics or tables that will show them a quick overview of what the model is doing?

They are very busy. So they want you to condense your model overview into just 1 or 2 graphics, rather than a long string of graphics.

Train the Model

with TIMER:
    model_1 = RandomForestClassifier(n_estimators=30, random_state=1).fit(
        x_train, y_train)

print(f"Training R^2: {model_1.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {model_1.score(x_validate, y_validate):0.2f}")
2020-02-14 14:02:08,392 graeae.timers.timer start: Started: 2020-02-14 14:02:08.391737
2020-02-14 14:02:09,011 graeae.timers.timer end: Ended: 2020-02-14 14:02:09.011136
2020-02-14 14:02:09,011 graeae.timers.timer end: Elapsed: 0:00:00.619399
Training R^2:  1.00
Validation R^2: 0.60
model_2 = RandomForestClassifier()

estimators = list(range(50, 300, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)
search = RandomizedSearchCV(estimator=model_2,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(x_train, y_train)
first_model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {first_model.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {first_model.score(x_validate, y_validate):0.2f}")
print(search.best_params_)
2020-02-14 14:02:10,418 graeae.timers.timer start: Started: 2020-02-14 14:02:10.418784
2020-02-14 14:04:09,802 graeae.timers.timer end: Ended: 2020-02-14 14:04:09.802077
2020-02-14 14:04:09,802 graeae.timers.timer end: Elapsed: 0:01:59.383293
CV Training R^2: 0.63
Training R^2:  0.70
Validation R^2: 0.63
{'n_estimators': 90, 'max_depth': 10}

We get a slight improvement with a much more complex model, but not a lot. Our validation \(r^2\) is nearly as good as the training \(r^2\) so it looks like we aren't overfitting.

Permutation Importance

permutor = PermutationImportance(first_model, random_state=1).fit(
        x_validate, y_validate)
ipython_html = eli5.show_weights(permutor,
                                 feature_names=x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight Feature
0.0640 ± 0.0071 number_inpatient
0.0108 ± 0.0046 number_emergency
0.0084 ± 0.0058 number_outpatient
0.0025 ± 0.0034 number_diagnoses
0.0021 ± 0.0010 diabetesMed_Yes
0.0020 ± 0.0017 payer_code_?
0.0019 ± 0.0015 race_AfricanAmerican
0.0015 ± 0.0015 num_lab_procedures
0.0013 ± 0.0008 age_[80-90)
0.0011 ± 0.0030 diag_1_428
0.0011 ± 0.0023 medical_specialty_?
0.0010 ± 0.0012 payer_code_HM
0.0008 ± 0.0043 num_procedures
0.0008 ± 0.0012 payer_code_BC
0.0008 ± 0.0008 age_[40-50)
0.0008 ± 0.0010 max_glu_serum_None
0.0008 ± 0.0015 race_Caucasian
0.0005 ± 0.0032 num_medications
0.0005 ± 0.0022 diag_2_250
0.0004 ± 0.0006 rosiglitazone_No
… 44 more … … 44 more …

The most important features weren't in the data description. What is number_inpatient?

Partial Dependence Plot

Since the most important feature is the "number_inpatient" let's see how much it changes the re-admissions.

FEATURE = "number_inpatient"
pdp_dist = pdp.pdp_isolate(model=first_model,
                           dataset=x_validate,
                           model_features=FEATURES,
                           feature=FEATURE)
pdp.pdp_plot(pdp_dist, FEATURE)
output = f"{FEATURE}_pdp_plot.png"
figure = pyplot.gcf()
figure.subplots_adjust(top=0.2)
figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

SHAP Values

random.seed(0)
ROW = random.randrange(len(x_validate))
row_data = x_validate.iloc[ROW]
row_data_matrix = row_data.values.reshape(1, -1)
print(first_model.classes_)
print(first_model.predict_proba(row_data_matrix))
[0 1]
[[0.49342394 0.50657606]]
explainer = shap.TreeExplainer(first_model)
shap_values = explainer.shap_values(row_data_matrix)
READMIT = 1
figure = shap.force_plot(explainer.expected_value[READMIT],
                         shap_values[READMIT],
                         row_data_matrix,
                         feature_names=FEATURES,
                         matplotlib=True, show=False)
filename = "shap_zero.png"

figure.savefig(OUTPUT_PATH/filename)
print(f"[[file:{filename}]]")

nil

  • Try one where num_inpatients was 1
    row_data = x_validate[x_validate.number_inpatient==1].sample().iloc[0]
    row_data_matrix = row_data.values.reshape(1, -1)
    print(first_model.classes_)
    print(first_model.predict_proba(row_data_matrix))
    
    [0 1]
    [[0.36731038 0.63268962]]
    
    explainer = shap.TreeExplainer(first_model)
    shap_values = explainer.shap_values(row_data_matrix)
    figure = shap.force_plot(explainer.expected_value[READMIT],
                             shap_values[READMIT],
                             row_data_matrix,
                             feature_names=FEATURES,
                             matplotlib=True, show=False)
    filename = "shap_one.png"
    
    figure.savefig(OUTPUT_PATH/filename)
    print(f"[[file:{filename}]]")
    

    nil

  • Try one where num_inpatients was 2
    INPATIENTS = 2
    row_data = x_validate[x_validate.number_inpatient==INPATIENTS].sample().iloc[0]
    row_data_matrix = row_data.values.reshape(1, -1)
    print(first_model.classes_)
    print(first_model.predict_proba(row_data_matrix))
    
    [0 1]
    [[0.38485607 0.61514393]]
    
    explainer = shap.TreeExplainer(first_model)
    shap_values = explainer.shap_values(row_data_matrix)
    figure = shap.force_plot(explainer.expected_value[READMIT],
                             shap_values[READMIT],
                             row_data_matrix,
                             feature_names=FEATURES,
                             matplotlib=True, show=False)
    filename = "shap_two.png"
    
    figure.savefig(OUTPUT_PATH/filename)
    print(f"[[file:{filename}]]")
    

    nil

  • Try one where num_inpatients was 3
    INPATIENTS = 3
    row_data = x_validate[x_validate.number_inpatient==INPATIENTS].sample().iloc[0]
    row_data_matrix = row_data.values.reshape(1, -1)
    print(first_model.classes_)
    print(first_model.predict_proba(row_data_matrix))
    
    [0 1]
    [[0.47141351 0.52858649]]
    
    explainer = shap.TreeExplainer(first_model)
    shap_values = explainer.shap_values(row_data_matrix)
    figure = shap.force_plot(explainer.expected_value[READMIT],
                             shap_values[READMIT],
                             row_data_matrix,
                             feature_names=FEATURES,
                             matplotlib=True, show=False)
    filename = "shap_three.png"
    
    figure.savefig(OUTPUT_PATH/filename)
    print(f"[[file:{filename}]]")
    

    nil

  • Try one where num_inpatients was the Maximum
    INPATIENTS = x_validate.number_inpatient.max()
    row_data = x_validate[x_validate.number_inpatient==INPATIENTS].sample().iloc[0]
    row_data_matrix = row_data.values.reshape(1, -1)
    print(first_model.classes_)
    print(first_model.predict_proba(row_data_matrix))
    
    [0 1]
    [[0.19208238 0.80791762]]
    
    explainer = shap.TreeExplainer(first_model)
    shap_values = explainer.shap_values(row_data_matrix)
    figure = shap.force_plot(explainer.expected_value[READMIT],
                             shap_values[READMIT],
                             row_data_matrix,
                             feature_names=FEATURES,
                             matplotlib=True, show=False)
    filename = "shap_max.png"
    
    figure.savefig(OUTPUT_PATH/filename)
    print(f"[[file:{filename}]]")
    

    nil

    So it seems to be that the greater number_inpatient the more it contributes to re-admission (although note that since we're only using one row the cases with small values doesn't always look like what I plotted above - it depends on the patient).

Time In Hospital

The doctors think it's a good sign that increasing the number of inpatient procedures leads to increased predictions. But they can't tell from this plot whether that change in the plot is big or small. They'd like you to create something similar for time_in_hospital to see how that compares.

FEATURE = "time_in_hospital"
pdp_dist = pdp.pdp_isolate(model=first_model,
                           dataset=x_validate,
                           model_features=FEATURES,
                           feature=FEATURE)
pdp.pdp_plot(pdp_dist, FEATURE)
output = f"{FEATURE}_pdp_plot.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

TIME_IN_HOSPITAL = 8

row_data = x_validate[x_validate.time_in_hospital==TIME_IN_HOSPITAL].sample().iloc[0]
row_data_matrix = row_data.values.reshape(1, -1)
print(first_model.classes_)
print(first_model.predict_proba(row_data_matrix))
[0 1]
[[0.39445183 0.60554817]]
explainer = shap.TreeExplainer(first_model)
shap_values = explainer.shap_values(row_data_matrix)
figure = shap.force_plot(explainer.expected_value[READMIT],
                         shap_values[READMIT],
                         row_data_matrix,
                         feature_names=FEATURES,
                         matplotlib=True, show=False)
filename = "shap_hospital_one.png"

figure.savefig(OUTPUT_PATH/filename)
print(f"[[file:{filename}]]")

nil

It looks like time in hospital has an effect - but it is a small one.

Raw Readmissions

Whoa! It seems like time_in_hospital doesn't matter at all. The difference between the lowest value on the partial dependence plot and the highest value is about 5%.

If that is what your model concluded, the doctors will believe it. But it seems so low. Could the data be wrong, or is your model doing something more complex than they expect?

They'd like you to show them the raw readmission rate for each value of time_in_hospital to see how it compares to the partial dependence plot.

training = pandas.concat([x_train, y_train], axis="columns")
grouped = training.groupby(["time_in_hospital"]).agg({"readmitted": "mean"})
plot = grouped.hvplot.bar().opts(
    title="Readmission Rates for time_in_hospital",
    width=Plot.width,
    height=Plot.height
)

source = Embed(plot=plot, file_name="time_in_hospital_readmission_rates")()
print(source)
: :

Figure Missing

:

It sort of looks like time_in_hospital does affect readmission rates, just not as much as the number of admissions, I guess.

SHAP Creator

Now the doctors are convinced you have the right data, and the model overview looked reasonable. It's time to turn this into a finished product they can use. Specifically, the hospital wants you to create a function patient_risk_factors that does the following

  • Takes a single row with patient data (of the same format you as your raw data)
  • Creates a visualization showing what features of that patient increased their risk of readmission, what features decreased it, and how much those features mattered.

It's not important to show every feature with every miniscule impact on the readmission risk. It's fine to focus on only the most important features for that patient.

def patient_risk_factors(row_data: pandas.Series, tag: str) -> None:
    row_data_matrix = row_data.values.reshape(1, -1)
    explainer = shap.TreeExplainer(first_model)
    shap_values = explainer.shap_values(row_data_matrix)
    figure = shap.force_plot(explainer.expected_value[READMIT],
                             shap_values[READMIT],
                             row_data_matrix,
                             feature_names=FEATURES,
                             matplotlib=True, show=False)
    filename = f"shap_{tag}.png"
    figure.savefig(OUTPUT_PATH/filename)
    print(f"[[file:{filename}]]")
    return row_data_matrix
row = x_validate.sample()
row_matrix = patient_risk_factors(row, "random_one")

nil

print(first_model.predict_proba(row_matrix))
[[0.70043997 0.29956003]]

In this case it looks like the number of diagnoses was the most important - having many diagnoses with no hospital admissions predicts you won't be readmitted. Should people who were never admitted be considered? Perhaps "readmission" just means admitted later.

End

Sources

SHAP Values

Beginning

SHAP values interpret the impact of a certain value for a given feature when compared to the prediction you'd make if that feature instead took a baseline value. This helps us interpret predictions given specific values for our features. We'll do this using the SHAP library, naturally.

Imports

Python

from argparse import Namespace
from pathlib import Path

PyPi

from matplotlib import pyplot
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

import numpy
import pandas
import seaborn
import shap

Others

from graeae import EnvironmentLoader, Timer

Set Up

Plotting

SLUG = "shap-values"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
get_ipython().run_line_magic('matplotlib', 'inline')
get_ipython().run_line_magic('config', "InlineBackend.figure_format = 'retina'")

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

data = pandas.read_csv(Path(ENVIRONMENT["FIFA-2018"]).expanduser())
y = data["Man of the Match"] == "Yes"
FEATURES = [column for column in data.columns if data[column].dtype == numpy.int64]
X = data[FEATURES]
x_train, x_validate, y_train, y_validate = train_test_split(X, y, random_state=1)


model = RandomForestClassifier()

estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(x_train, y_train)
first_model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {first_model.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {first_model.score(x_validate, y_validate):0.2f}")
print(search.best_params_)
2020-02-10 18:09:33,716 graeae.timers.timer start: Started: 2020-02-10 18:09:33.716565
The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
2020-02-10 18:09:36,830 graeae.timers.timer end: Ended: 2020-02-10 18:09:36.830883
2020-02-10 18:09:36,832 graeae.timers.timer end: Elapsed: 0:00:03.114318
CV Training R^2: 0.70
Training R^2:  1.00
Validation R^2: 0.69
{'n_estimators': 100, 'max_depth': 30}

Middle

A Single Row

ROW = 5
row_data = x_validate.iloc[ROW]
row_data_matrix = row_data.values.reshape(1, -1)
print(first_model.classes_)
print(first_model.predict_proba(row_data_matrix))
[False  True]
[[0.25 0.75]]

The predict_proba method tells us the probability for the data for each class. So this team has a 70% chance that they do have a man of the match.

explainer = shap.TreeExplainer(first_model)
shap_values = explainer.shap_values(row_data_matrix)
print(explainer.shap_values.__doc__)
Estimate the SHAP values for a set of samples.

       Parameters
       ----------
       X : numpy.array, pandas.DataFrame or catboost.Pool (for catboost)
           A matrix of samples (# samples x # features) on which to explain the model's output.

       y : numpy.array
           An array of label values for each sample. Used when explaining loss functions.

       tree_limit : None (default) or int
           Limit the number of trees used by the model. By default None means no use the limit of the
           original model, and -1 means no limit.

       approximate : bool
           Run fast, but only roughly approximate the Tree SHAP values. This runs a method
           previously proposed by Saabas which only considers a single feature ordering. Take care
           since this does not have the consistency guarantees of Shapley values and places too
           much weight on lower splits in the tree.

       check_additivity : bool
           Run a validation check that the sum of the SHAP values equals the output of the model. This
           check takes only a small amount of time, and will catch potential unforeseen errors.
           Note that this check only runs right now when explaining the margin of the model.

       Returns
       -------
       For models with a single output this returns a matrix of SHAP values
       (# samples x # features). Each row sums to the difference between the model output for that
       sample and the expected value of the model output (which is stored in the expected_value
       attribute of the explainer when it is constant). For models with vector outputs this returns
       a list of such matrices, one for each output.

The array returned by the shap_values method has two rows - one for each of our target classes. In this case we're asking if a team had a man of the match so we'll just look at the second array.

shap.initjs()
figure = shap.force_plot(explainer.expected_value[1],
                         shap_values[1],
                         row_data_matrix,
                         feature_names=FEATURES,
                         matplotlib=True, show=False)
filename = "shap_one.png"

figure.savefig(OUTPUT_PATH/filename)
print(f"[[file:{filename}]]")

nil

Our predicted probability that this team had a man of the match was 0.7, but the base-value for the set as a whole is 0.4979 (you can't see it in this plot for some reason), so our team is more likely to have one than most teams. The pink section of the plot shows the features that increased the probability and the part in blue shows the features that decreased the probability. The size of each feature's block is proportional to the amount the feature contributed, so the biggest block (Goal Scored) contributed the most. The greatest negative feature was Ball Possession %.

End

See Also

  • The SHAP Documentation
  • The SHAP github repository
  • Lundberg SM, Lee SI. A unified approach to interpreting model predictions. InAdvances in neural information processing systems 2017 (pp. 4765-4774). (PDF Available Here)

Exercise In Partial Dependence Plots

Beginning

This is my re-working of the Kaggle Machine Learning Explainability exercise in Partial Dependece Plots. It uses data from the Taxi Fare Prediction competition.

Imports

Python

from pathlib import Path

PyPi

from matplotlib import pyplot
from matplotlib.pyplot import rcParams
from pdpbox import pdp
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV

import pandas

Others

from graeae import EnvironmentLoader, Timer

Set Up

The Environment

ENVIRONMENT = EnvironmentLoader()

The Timer

TIMER = Timer()

Plotting

SLUG = "exercise-in-partial-dependence-plots"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG

rcParams["figure.figsize"] = (6, 4)

The Data

ROWS = 5 * 10**4
data = pandas.read_csv(Path(ENVIRONMENT["NY-TAXI"]).expanduser()/"train.csv",
                       nrows=ROWS)
data = data.query('pickup_latitude > 40.7 and pickup_latitude < 40.8 and ' +
                  'dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and ' +
                  'pickup_longitude > -74 and pickup_longitude < -73.9 and ' +
                  'dropoff_longitude > -74 and dropoff_longitude < -73.9 and ' +
                  'fare_amount > 0'
                  )
y = data.fare_amount
base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude']

X = data[base_features]

x_train, x_validate, y_train, y_validate = train_test_split(X, y, random_state=1)

Middle

The First Model

first_model = RandomForestRegressor(n_estimators=180,
                                    max_depth=50, random_state=1).fit(x_train, y_train)
print(f"Training R^2: {first_model.score(x_train, y_train):0.2f}")
print(f"Validation R^2: {first_model.score(x_validate, y_validate):0.2f}")
Training R^2: 0.92
Validation R^2: 0.43

Question 1

FEATURE = 'pickup_longitude'
pdp_dist = pdp.pdp_isolate(model=first_model,
                           dataset=x_validate,
                           model_features=base_features,
                           feature=FEATURE)
pdp.pdp_plot(pdp_dist, FEATURE)
output = f"{FEATURE}_pdp_plot.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")
[[file:pickup_longitude_pdp_plot.png]]

nil

nilnil

Why does the partial dependence plot have this U-shape?

At the extremes you have the locations that can potentially travel the furthest, creating the biggest fairs, but as you move to the center you reduce the amount you can possibly travel - although the change isn't symmetric so this isn't the only explanation, otherwise if it were then you would expect the two ends to have the highest values and the nadir to be at the halfway point (-73.95).

Does your explanation suggest what shape to expect in the partial dependence plots for the other features?

Create all other partial plots.

for FEATURE in base_features:
    pdp_dist = pdp.pdp_isolate(model=first_model,
                               dataset=x_validate,
                               model_features=base_features,
                               feature=FEATURE)
    pdp.pdp_plot(pdp_dist, FEATURE)
    output = f"{FEATURE}_pdp_plot.png"
    pyplot.savefig(OUTPUT_PATH/output)
    print(f"[[file:{output}]]")
[[file:pickup_longitude_pdp_plot.png]]
[[file:pickup_latitude_pdp_plot.png]]
[[file:dropoff_longitude_pdp_plot.png]]
[[file:dropoff_latitude_pdp_plot.png]]

nil nil nil nil

nil nil nilnil

Question 2

Now you will run a 2D partial dependence plot. As a reminder, here is the code from the tutorial.

Create a 2D plot for the features pickup_longitude and dropoff_longitude.

FEATURES = "pickup_longitude dropoff_longitude".split()
interaction  =  pdp.pdp_interact(model=first_model,
                                 dataset=x_validate,
                                 model_features=base_features,
                                 features=FEATURES)
pdp.pdp_interact_plot(pdp_interact_out=interaction,
                      feature_names=FEATURES, plot_type='contour')
output = "longitude_interaction.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")
[[file:longitude_interaction.png]]

nil

nil

Our plot shows that the fares are highest at the top-left and bottom-right corners, as you might expect, since this would be the furthest distance from pickup to dropoff.

Question 3

Consider a ride starting at longitude -73.92 and ending at longitude -74. Using the graph from the last question, estimate how much money the rider would have saved if they'd started the ride at longitude -73.98 instead?

I don't exactly agree with the interpretation given by the kaggle notebook. Looking at the plot, -73.92 to -74 appears to cost 27, while a -73.92 to -74 would cost 9 - but the notebook says that -73.92 to -74 costs 24. So I would say there would be a saving of 18 while the given answer is 15. To reconcile the difference (kind of) we might say that -73.92 to -74 costs 12, not 9 - it's not really easy to tell by the plot, in which case I would also say the savings is 15.

Question 4

In the PDP's you've seen so far, location features have primarily served as a proxy to capture distance traveled. In the permutation importance lessons, you added the features `abs_lon_change` and `abs_lat_change` as a more direct measure of distance.

Create these features again here.

After you run it, identify the most important difference between this partial dependence plot and the one you got without absolute value features. The code to generate the PDP without absolute value features is at the top of this code cell.

FEATURE = 'pickup_longitude'
pdp_dist_original = pdp.pdp_isolate(model=first_model,
                                    dataset=x_validate,
                                    model_features=base_features,
                                    feature=FEATURE)
pdp.pdp_plot(pdp_dist_original, FEATURE)
output = "pre_distance.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nilnil

data['abs_lon_change'] = abs(data.pickup_longitude - data.dropoff_longitude)
data['abs_lat_change'] = abs(data.pickup_latitude - data.dropoff_longitude)

features_2  = ['pickup_longitude',
               'pickup_latitude',
               'dropoff_longitude',
               'dropoff_latitude',
               'abs_lat_change',
               'abs_lon_change']

X = data[features_2]
new_x_train, new_x_validate, new_y_train, new_y_validate = train_test_split(X, y, random_state=1)

estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)

model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)
with TIMER:
    search.fit(new_x_train, new_y_train)
second_model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {first_model.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {first_model.score(x_validate, y_validate):0.2f}")
print(search.best_params_)
2020-02-09 17:36:14,915 graeae.timers.timer start: Started: 2020-02-09 17:36:14.915123
2020-02-09 17:43:16,053 graeae.timers.timer end: Ended: 2020-02-09 17:43:16.053011
2020-02-09 17:43:16,053 graeae.timers.timer end: Elapsed: 0:07:01.137888
CV Training R^2: 0.46
Training R^2:  0.92
Validation R^2: 0.43
{'n_estimators': 190, 'max_depth': 30}
FEATURE = 'pickup_longitude'
pdp_dist = pdp.pdp_isolate(model=second_model,
                           dataset=new_x_validate,
                           model_features=features_2,
                           feature=FEATURE)

pdp.pdp_plot(pdp_dist, FEATURE)
output = "pickup_longitude_with_distance_added.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nilnil

Question 5

Consider a scenario where you have only 2 predictive features, which we will call `feat_A` and `feat_B`. Both features have minimum values of -1 and maximum values of 1. The partial dependence plot for `feat_A` increases steeply over its whole range, whereas the partial dependence plot for feature B increases at a slower rate (less steeply) over its whole range. Does this guarantee that `feat_A` will have a higher permutation importance than `feat_B`. Why or why not?

It doesn't - the partial dependence plot shows how the predictions change based on the inputs, but it isn't the same thing as the feature importance - it might be the case that a few inputs create a large difference but most points don't, in which case the feature importance won't be very large.

Partial Dependence Plots

Beginning

What is this about?

These are my notes/re-write of the Partial Dependence Plots tutorial on kaggle. Partial Dependence Plots are a complement to Permutation Importance in that Permutation Importance can tell you which features are contributing to the model but don't tell you how features change with different inputs. Given that they have the word "Plot" in their name you can guess that this is a visualization method and since humans have a hard time visualizing things with more than three dimensions, using them requires selecting a subset of features (or maybe just one feature) to visualize at a time, so the fact that Permutation Importance ranks features by their contribution might come in handy in deciding which features to use.

How does it work?

In a simplistic view we might say that it takes input data and then repeatedly alters the values in our feature of choice, freezing the other values so that we can see how varying the feature changes the prediction. You can then repeat this for multiple rows of data and take the average prediction for each value of our feature.

Imports

Python

from pathlib import Path

import os

PyPi

from dotenv import load_dotenv
from matplotlib import pyplot
from pdpbox import pdp, get_dataset, info_plots
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

import graphviz
import numpy
import pandas

Set Up

Plotting

SLUG = "partial-dependence-plots"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG

The Data

path = Path("~/.env").expanduser()
load_dotenv(path, override=True)
data = pandas.read_csv(Path(os.getenv("FIFA-2018")).expanduser())

Middle

A Decision Tree Model

The data is the same set from the Permutation Importance exercise (team data to predict whether one of them would become the Budweiser Man of the Match). Out initial model will be a shallow decision tree so that we can compare its structure to the plots.

y = data["Man of the Match"] == "Yes"
features = [column for column in data.columns if data[column].dtype == numpy.int64]
X = data[features]

x_train, x_validate, y_train, y_validate = train_test_split(X, y, random_state=1)

tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(x_train, y_train)

Visualizing the Decision Tree

tree_graph = tree.export_graphviz(tree_model, out_file=None, feature_names=features)
graph = graphviz.Source(tree_graph)
output = "decision_tree.dot"
graph.render(OUTPUT_PATH/output, format="png")
print(f"[[file:{output}.png]]")

nil

Partial Dependency Plot

To create the plot we can use the PDPBox library.

  • Create the Data to Plot
    FEATURE = "Goal Scored"
    pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=x_validate,
                                model_features=features, feature=FEATURE)
    
  • Plot It
    pdp.pdp_plot(pdp_goals, FEATURE)
    output = "pdp_goals_scored.png"
    pyplot.savefig(OUTPUT_PATH/output)
    print(f"[[file:{output}]]")
    

    nil

    Looking at the plot we can interpret it as saying that scoring that first goal is helpful in getting someone on the team the Man of the Match, but after that, more goals doesn't help. The PDPBox documentation has no real information about interpreting the output (or anything they provide, really), but according to the Kaggle tutorial the y-axis is the change in the prediction from the baseline as x changes.

Distance Covered

We can also look at how much the distance the players cover on the field matters.

FEATURE = "Distance Covered (Kms)"
pdp_distance = pdp.pdp_isolate(model=tree_model, dataset=x_validate,
                               model_features=features, feature=FEATURE)
pdp.pdp_plot(pdp_distance, FEATURE)
output = "pdp_distance.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

It looks like there's one distance at which the probabilities increase and then going further doesn't matter.

print(pdp_distance.feature_grids[3])
102.0

So you want your team to cover at least 102 Kilometers, but covering more won't help you.

A Random Forest Model

The Decision Tree was useful because the simplicity made it easy to interpret, but an ensemble method like a Random Forest probably makes a better model so let's see what it reveals.

forest = RandomForestClassifier(random_state=0).fit(x_train, y_train)

FEATURE = "Goal Scored"
pdp_goals = pdp.pdp_isolate(model=forest, dataset=x_validate,
                            model_features=features, feature=FEATURE)
pdp.pdp_plot(pdp_goals, FEATURE)
output = "pdp_forest_goals_scored.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

So, our random forest says that the greatest gain comes from the first goal, but there is in fact a greater probability as you increase the number of goals scored.

What about distance covered?

FEATURE = "Distance Covered (Kms)"
pdp_distance = pdp.pdp_isolate(model=forest, dataset=x_validate,
                               model_features=features, feature=FEATURE)
pdp.pdp_plot(pdp_distance, FEATURE)
output = "pdp_forest_distance.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

print(pdp_distance.feature_grids[pdp_distance.pdp.argmax()])
102.0

Our forest says that, once again, covering 102 kilometers maximizes the probability, but in this case it appears that going beyond that actually decreases the probability that your team would have the man of the match.

2D Partial Dependence Plots

Rather than plot a single feature against the probability of becoming the man of the match you can plot how two features interact to affect the probability.

FEATURES = ["Goal Scored", "Distance Covered (Kms)"]
interaction = pdp.pdp_interact(model=tree_model, dataset=x_validate,
                               model_features=features,
                               features=FEATURES)

pdp.pdp_interact_plot(pdp_interact_out=interaction, feature_names=FEATURES,
                      plot_type="contour")
output = "goals_vs_distance.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

Note: The version of PDPBox on pypi currently has a bug (as of February 8, 2020) that won't let you create the previous plot, install it from github instead.

The plot shows a more succinct version of the two plots we had seen before - the optimal point for these two features is 1 goal and 102 Kms covered.

Forest From the Trees

Let's re-run the same plot using the Random Forest instead of the Decision Tree.

FEATURES = ["Goal Scored", "Distance Covered (Kms)"]
interaction = pdp.pdp_interact(model=forest, dataset=x_validate,
                               model_features=features,
                               features=FEATURES)

pdp.pdp_interact_plot(pdp_interact_out=interaction, feature_names=FEATURES,
                      plot_type="contour")
output = "forest_goals_vs_distance.png"
pyplot.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

The random forest suggests a slightly different scenario - here you want 2 gooalds and between 100 and 110 Kms (or thereabouts) covered to get the best probability.

End

This showed how to use the PDPBox library to create Partial Dependence Plots to look at how input values for features affects the predicted outcomes. Unfortunately it looks like PDPBox might have been abandoned, but while it works it's a nice library.