Exercise in Permutation Importance
Beginning
This is my re-do of the Machine Learning Explainability Permutation Importance exercise on kaggle. It uses the data from the New York City Taxi Fare Prediction dataset on kaggle.
Imports
From Python
from argparse import Namespace
from functools import partial
from pathlib import Path
From PyPi
from eli5.sklearn import PermutationImportance
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from tabulate import tabulate
import eli5
import hvplot.pandas
import numpy
import pandas
Others
from graeae import EmbedHoloviews, EnvironmentLoader, Timer
Set Up
A Timer
TIMER = Timer()
Plotting
SLUG = "exercise-in-permutation-importance"
PATH = Path("../../files/posts/tutorials/")/SLUG
Plot = Namespace(
width=1000,
height=800,
)
Embed = partial(EmbedHoloviews, folder_path=PATH)
The Environment
ENVIRONMENT = EnvironmentLoader()
Table Printer
TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys", showindex=False)
Middle
The Dataset
Since this is about permutation importance we're just going to load a subset (there are over five million rows in the dataset) and use previously discovered values to get rid of outliers.
ROWS = 5 * 10**4
PATH = Path(ENVIRONMENT["NY-TAXI"]).expanduser()
assert PATH.is_dir()
data = pandas.read_csv(PATH/"train.csv", nrows=ROWS)
print(TABLE(data.iloc[0].reset_index().rename(columns={
"index": "Column", 0: "Value"})))
Column | Value |
---|---|
key | 2009-06-15 17:26:21.0000001 |
fare_amount | 4.5 |
pickup_datetime | 2009-06-15 17:26:21 UTC |
pickup_longitude | -73.844311 |
pickup_latitude | 40.721319 |
dropoff_longitude | -73.84161 |
dropoff_latitude | 40.712278000000005 |
passenger_count | 1 |
- Trim Outliers
print(TABLE(data.describe(), showindex=True))
fare_amount pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude passenger_count count 50000 50000 50000 50000 50000 50000 mean 11.3642 -72.5098 39.9338 -72.5046 39.9263 1.66784 std 9.68556 10.3939 6.22486 10.4076 6.01474 1.28919 min -5 -75.4238 -74.0069 -84.6542 -74.0064 0 25% 6 -73.9921 40.7349 -73.9912 40.7344 1 50% 8.5 -73.9818 40.7527 -73.9801 40.7534 1 75% 12.5 -73.9671 40.7674 -73.9636 40.7682 2 max 200 40.7835 401.083 40.851 43.4152 6 to_plot = data[[column for column in data.columns if "latitude" in column or "longitude" in column]] plot = to_plot.hvplot.box().opts(title="Column Box-Plots", width=Plot.width, height=Plot.height) Embed(plot=plot, file_name="column_box_plots")()
So you can see that there are negative fares, which seems wrong, and some outliers.
This uses the pandas query method which let's you write slightly more readable code for boolean slicing.
print(f"{len(data):,}") data = data.query("pickup_latitude > 40.7 and pickup_latitude < 40.8 and " + "dropoff_latitude > 40.7 and dropoff_latitude < 40.8 and " + "pickup_longitude > -74 and pickup_longitude < -73.9 and " + "dropoff_longitude > -74 and dropoff_longitude < -73.9 and " + "fare_amount > 0" ) print(f"{len(data):,}")
50,000 31,289
Set Up the Training and Test Sets
y = data.fare_amount
base_features = ['pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'passenger_count']
X = data[base_features]
x_train, x_validate, y_train, y_validate = train_test_split(X, y, random_state=1)
print(f"{len(x_train):,}")
print(f"{len(x_validate):,}")
23,466 7,823
Build and Train the Model
estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]
grid = dict(n_estimators=estimators,
max_depth=max_depth)
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
param_distributions=grid,
n_iter=40,
n_jobs=-1,
random_state=1)
with TIMER:
search.fit(x_train, y_train)
first_model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {first_model.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {first_model.score(x_validate, y_validate):0.2f}")
print(search.best_params_)
2020-02-10 13:40:59,617 graeae.timers.timer start: Started: 2020-02-10 13:40:59.616742 /home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning. warnings.warn(CV_WARNING, FutureWarning) /home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. "timeout or by a memory leak.", UserWarning 2020-02-10 13:42:04,387 graeae.timers.timer end: Ended: 2020-02-10 13:42:04.387425 2020-02-10 13:42:04,390 graeae.timers.timer end: Elapsed: 0:01:04.770683 CV Training R^2: 0.45 Training R^2: 0.92 Validation R^2: 0.42 {'n_estimators': 170, 'max_depth': 70}
So it isn't really a great model, but we'll ignore that for now.
Questions
Question 1
The first model uses the following features:
- pickup_longitude
- pickup_latitude
- dropoff_longitude
- dropoff_latitude
- passenger_count
Before running any code… which variables seem potentially useful for predicting taxi fares? Do you think permutation importance will necessarily identify these features as important?
I think that pickup and dropoff latitude might be important, since this would reflect where in the city the person was and wanted to go. Passenger count might make a difference as well, but I don't know if there's a greater charge for more people. Longitude might also be useful, but my guess would be that the North-South location is more indicative of the type of place you are in (uptown or downtown) and thus how far you have to travel (I have a vague notion that New York City is longer vertically than horizontally, but I don't know if this is true). This would be even more important if the fares change by location, but I don't know if that's the case.
with TIMER:
permutor = PermutationImportance(first_model, random_state=1).fit(
x_validate, y_validate)
ipython_html = eli5.show_weights(permutor,
feature_names=x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight | Feature |
---|---|
0.8413 ± 0.0171 | dropoff_latitude |
0.8135 ± 0.0223 | pickup_latitude |
0.5723 ± 0.0370 | pickup_longitude |
0.5324 ± 0.0257 | dropoff_longitude |
-0.0014 ± 0.0015 | passenger_count |
So it looks like latitude and longitude are important, with latitude a little more important than longitude and passenger count isn't important.
A New Model
Question 4
Without detailed knowledge of New York City, it's difficult to rule out most hypotheses about why latitude features matter more than longitude.
A good next step is to disentangle the effect of being in certain parts of the city from the effect of total distance traveled.
The code below creates new features for longitudinal and latitudinal distance. It then builds a model that adds these new features to those you already had.
Feature Engineering
We're going to estimate the distance traveled by using the differences in latitude and longitude from the pickup to the dropoff. This should give us a taxicab-distance estimate.
data['absolute_change_longitude'] = abs(
data.dropoff_longitude - data.pickup_longitude)
data['absolute_change_latitude'] = abs(
data.dropoff_latitude - data.pickup_latitude)
features_2 = ['pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'absolute_change_latitude',
'absolute_change_longitude']
X = data[features_2]
new_x_train, new_x_validate, new_y_train, new_y_validate = train_test_split(
X, y, random_state=1)
estimators = list(range(100, 250, 10))
max_depth = list(range(10, 50, 10)) + [None]
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
param_distributions=grid,
n_jobs=-1,
random_state=1)
with TIMER:
search.fit(new_x_train, new_y_train)
second_model = search.best_estimator_
print(f"Mean Cross-Validation Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {second_model.score(new_x_train, new_y_train): 0.2f}")
print("Validation R^2: "
f"{second_model.score(new_x_validate, new_y_validate):0.2f}")
print(search.best_params_)
2020-02-10 13:42:52,104 graeae.timers.timer start: Started: 2020-02-10 13:42:52.104493 /home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning. warnings.warn(CV_WARNING, FutureWarning) 2020-02-10 13:43:24,554 graeae.timers.timer end: Ended: 2020-02-10 13:43:24.554581 2020-02-10 13:43:24,556 graeae.timers.timer end: Elapsed: 0:00:32.450088 Mean Cross-Validation Training R^2: 0.48 Training R^2: 0.70 Validation R^2: 0.47 {'n_estimators': 190, 'max_depth': 10}
Still a pretty bad model, but that's not the point, I guess.
The Permutation Importance
permutor = PermutationImportance(second_model, random_state=1).fit(
new_x_validate, new_y_validate)
ipython_html = eli5.show_weights(permutor,
feature_names=new_x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight | Feature |
---|---|
0.5976 ± 0.0306 | absolute_change_latitude |
0.4429 ± 0.0496 | absolute_change_longitude |
0.0339 ± 0.0216 | pickup_latitude |
0.0232 ± 0.0032 | dropoff_latitude |
0.0214 ± 0.0068 | dropoff_longitude |
0.0159 ± 0.0055 | pickup_longitude |
The distance traveled seems to be the most important feature for the fare, even more than the actual locations, probably because taxis charge by distance.
Question 5
This question is about the scale of the parameters. Here's a sample.
print(TABLE(new_x_train.sample(random_state=1).iloc[0].reset_index()))
index | 31975 |
---|---|
pickup_longitude | -73.9706 |
pickup_latitude | 40.7613 |
dropoff_longitude | -73.9806 |
dropoff_latitude | 40.7483 |
absolute_change_latitude | 0.01302 |
absolute_change_longitude | 0.010067 |
And here's some statistics about each.
print(new_x_validate.describe())
pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude \ count 7823.000000 7823.000000 7823.000000 7823.000000 mean -73.976957 40.756877 -73.975293 40.757591 std 0.014663 0.018064 0.015877 0.018669 min -73.999977 40.700400 -73.999992 40.700293 25% -73.988180 40.745044 -73.987078 40.746345 50% -73.979933 40.757881 -73.978427 40.758602 75% -73.968008 40.769486 -73.966296 40.770561 max -73.900123 40.799865 -73.901790 40.799984 absolute_change_latitude absolute_change_longitude count 7823.000000 7823.000000 mean 0.015091 0.013029 std 0.012508 0.011554 min 0.000000 0.000000 25% 0.006089 0.004968 50% 0.011745 0.010110 75% 0.020781 0.017798 max 0.084413 0.087337
A colleague observes that the values for
absolute_change_longitude
andabsolute_change_latitude
are pretty small (all values are between -0.1 and 0.1), whereas other variables have larger values. Do you think this could explain why those coordinates had larger permutation importance values in this case?Consider an alternative where you created and used a feature that was 100X as large for these features, and used that larger feature for training and importance calculations. Would this change the outputted permutation importance values?
Why or why not?
for column in ("pickup_longitude pickup_latitude dropoff_longitude "
"dropoff_latitude absolute_change_latitude "
"absolute_change_longitude").split():
print(f"{column}: {new_x_validate[column].max() - new_x_validate[column].min():0.3f}")
pickup_longitude: 0.100 pickup_latitude: 0.099 dropoff_longitude: 0.098 dropoff_latitude: 0.100 absolute_change_latitude: 0.084 absolute_change_longitude: 0.087
Intuitively I would think that the difference in the scales would make a difference.
data["bigger_pickup_longitude"] = data.pickup_longitude * 100
data["bigger_absolute_change_longitude"] = data.absolute_change_longitude * 100
features_3 = ['pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'absolute_change_latitude',
'absolute_change_longitude',
'bigger_pickup_longitude',
'bigger_absolute_change_longitude'
]
X = data[features_3]
big_x_train, big_x_validate, big_y_train, big_y_validate = train_test_split(
X, y, random_state=1)
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
param_distributions=grid,
n_jobs=-1,
random_state=1)
with TIMER:
search.fit(big_x_train, big_y_train)
big_model = search.best_estimator_
print(f"Mean Cross-Validation Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {big_model.score(big_x_train, big_y_train): 0.2f}")
print("Validation R^2: "
f"{big_model.score(big_x_validate, big_y_validate):0.2f}")
print(search.best_params_)
2020-02-09 15:06:45,693 graeae.timers.timer start: Started: 2020-02-09 15:06:45.693742 /home/athena/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak. "timeout or by a memory leak.", UserWarning 2020-02-09 15:09:15,559 graeae.timers.timer end: Ended: 2020-02-09 15:09:15.559561 2020-02-09 15:09:15,560 graeae.timers.timer end: Elapsed: 0:02:29.865819 Mean Cross-Validation Training R^2: 0.49 Training R^2: 0.70 Validation R^2: 0.47 {'n_estimators': 190, 'max_depth': 10}
permutor = PermutationImportance(big_model, random_state=1).fit(
big_x_validate, big_y_validate)
ipython_html = eli5.show_weights(permutor,
feature_names=big_x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight | Feature |
---|---|
0.6034 ± 0.0436 | absolute_change_latitude |
0.1794 ± 0.0126 | bigger_absolute_change_longitude |
0.1366 ± 0.0062 | absolute_change_longitude |
0.0326 ± 0.0217 | pickup_latitude |
0.0242 ± 0.0040 | dropoff_latitude |
0.0194 ± 0.0083 | dropoff_longitude |
0.0188 ± 0.0085 | pickup_longitude |
0.0116 ± 0.0018 | bigger_pickup_longitude |
Making the pickup longitude didn't change its ranking relative to the other features so I wouldn't say that the scale had an effect.
Question 6
You've seen that the feature importance for latitudinal distance is greater than the importance of longitudinal distance. From this, can we conclude whether travelling a fixed latitudinal distance tends to be more expensive than traveling the same longitudinal distance?
No, the feature importance indicates that it is useful in predicting fares, but it doesn't automatically mean that the fares will increase with the change in latitude. It might be the case that the change in longitude affects the cost of a change in latitude as well, so a fixed latitude distance might change depending on the longitude or latitude + longitude combination.
Euclidean Distance
Instead of keeping latitudinal and longitudinal distances separate, what if we used the euclidean distance?
data["euclidean_distance"] = numpy.sqrt(data.absolute_change_latitude**2
+ data.absolute_change_longitude**2)
features = ['pickup_longitude',
'pickup_latitude',
'dropoff_longitude',
'dropoff_latitude',
'absolute_change_latitude',
'absolute_change_longitude',
"euclidean_distance",
]
X = data[features]
euclid_x_train, euclid_x_validate, euclid_y_train, euclid_y_validate = train_test_split(X, y, random_state=1)
model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
param_distributions=grid,
n_jobs=-1,
random_state=1)
with TIMER:
search.fit(euclid_x_train, euclid_y_train)
euclidean_model = search.best_estimator_
print(f"Mean Cross-Validation Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {euclidean_model.score(euclid_x_train, euclid_y_train): 0.2f}")
print("Validation R^2: "
f"{euclidean_model.score(euclid_x_validate, euclid_y_validate):0.2f}")
print(search.best_params_)
2020-02-10 14:23:41,605 graeae.timers.timer start: Started: 2020-02-10 14:23:41.605310 /home/brunhilde/.virtualenvs/Visions-Voices-Data/lib/python3.7/site-packages/sklearn/model_selection/_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning. warnings.warn(CV_WARNING, FutureWarning) 2020-02-10 14:24:20,385 graeae.timers.timer end: Ended: 2020-02-10 14:24:20.385580 2020-02-10 14:24:20,388 graeae.timers.timer end: Elapsed: 0:00:38.780270 Mean Cross-Validation Training R^2: 0.48 Training R^2: 0.74 Validation R^2: 0.48 {'n_estimators': 190, 'max_depth': 10}
Interestingly, the training \(R^2\) score went down although there was a slight improvement in the validation \(R^2\).
permutor = PermutationImportance(euclidean_model, random_state=1).fit(
euclid_x_validate, euclid_y_validate)
ipython_html = eli5.show_weights(
permutor,
feature_names=euclid_x_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
Weight | Feature |
---|---|
1.3370 ± 0.0469 | euclidean_distance |
0.3031 ± 0.0076 | absolute_change_latitude |
0.1179 ± 0.0046 | absolute_change_longitude |
0.0261 ± 0.0045 | dropoff_latitude |
0.0224 ± 0.0140 | dropoff_longitude |
0.0219 ± 0.0041 | pickup_longitude |
0.0183 ± 0.0051 | pickup_latitude |
According to the permutation importance, euclidean distance is much more important than the separate distances.
End
The suggested next tutorial is about Partial Dependence Plots.