Underfitting and Overfitting Exercise
Table of Contents
Beginning
This is the fourth part of kaggle's Introduction to Machine Learning tutorial - Overfitting and Underfitting.
Imports
Python
from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path
PyPi
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
import hvplot.pandas
import pandas
Others
from graeae import EmbedHoloviews, EnvironmentLoader, Timer
Set Up
Plottting
SLUG = "underfitting-and-overfitting-exercise"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
height=800,
width=1000,
)
The Timer
TIMER = Timer()
Environment
ENVIRONMENT = EnvironmentLoader()
The Data
data = pandas.read_csv(
Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()/"train.csv")
Middle
Preliminary 1: Specify Prediction Target
Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.
Our target is SalePrice.
Y = data.SalePrice
Preliminary 2: Create X
Now you will create a DataFrame called `X` holding the predictive features.
Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.
You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):
- LotArea
- YearBuilt
- 1stFlrSF
- 2ndFlrSF
- FullBath
- BedroomAbvGr
- TotRmsAbvGrd
FEATURES = [
"LotArea",
"YearBuilt",
"1stFlrSF",
"2ndFlrSF",
"FullBath",
"BedroomAbvGr",
"TotRmsAbvGrd",
]
X = data[FEATURES]
Split up the data into training and validation sets.
x_train, x_validate, y_train, y_validate = train_test_split(X, Y, random_state=1)
Preliminary 3: Specify and Fit Model
A Linear Regression Model
As a baseline, I'll fit a simple Linear Regression (ordinary-least-squares) model.
regression = LinearRegression()
scores = cross_val_score(regression, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
regression = regression.fit(x_train, y_train)
print(f"Training R^2: {regression.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {regression.score(x_validate, y_validate):0.2f}")
0.66 (+/- 0.17) Training R^2: 0.68 Validation R^2: 0.77
Decision Tree
Create a
DecisionTreeRegressor
and save it asiowa_model
. Ensure you've done the relevant import from sklearn to run this command.Then fit the model you just created using the data in
X
andy
that you saved above.
tree = DecisionTreeRegressor()
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.54 (+/- 0.32) Training R^2: 1.00 Validation R^2: 0.75
So our linear regression actually does better than the tree does. It looks like the tree might be overfitting on the training data.
Preliminary 4: Make Some Predictions
tree_predict = tree.predict(x_validate)
regression_predict = regression.predict(x_validate)
Preliminary 5: Calculate the Mean Absolute Error in Validation Data
tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)
print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
Tree MAE: 29371.52 Regression MAE: 27228.88
The tree's error is a little higher than the regression line's.
Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for max_leaf_nodes from a set of possible values.
Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of
max_leaf_nodes
that gives the most accurate model on your data.
def get_mae(max_leaf_nodes, train_X=x_train, val_X=x_validate, train_y=y_train, val_y=y_validate):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return mae
Write a loop to find the ideal tree size from
candidate_max_leaf_nodes
.
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
outcomes = [(get_mae(nodes), nodes) for nodes in candidate_max_leaf_nodes]
best = min(outcomes)
print(best)
best_tree_size = best[1]
(27282.50803885739, 100)
mae = pandas.DataFrame(dict(nodes=candidate_max_leaf_nodes, mae = [outcome[0] for outcome in outcomes]))
plot = mae.hvplot(x="nodes", y="mae").opts(title="Node Mean Absolute Error",
width=Plot.width,
height=Plot.height)
source = Embed(plot=plot, file_name="node_mean_absolute_error")()
print(source)
Looking at the plot you can see that the error drops until you hit 100 nodes and then begins to rise again as it overfits the data with more nodes.
Let's see how much this improves our model using \(r^2\).
tree = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.60 (+/- 0.26) Training R^2: 0.93 Validation R^2: 0.76
We've improved it slightly, it's probably still overfitting the model but not as much.
Step 2: Fit Model Using All Data
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.
final_model = DecisionTreeRegressor(max_leaf_nodes = best_tree_size)
final_model.fit(X, Y)
predictions_first = tree.predict(X)
predictions_final = final_model.predict(X)
x_y_tree = pandas.DataFrame(dict(predicted=predictions_first, actual=Y))
x_y_line = pandas.DataFrame(dict(predicted=predictions_final, actual=Y))
ideal = pandas.DataFrame(dict(x=Y, y=Y))
tree_plot = x_y_tree.hvplot.scatter(x="actual", y="predicted", label="Default")
line_plot = x_y_line.hvplot.scatter(x="actual", y="predicted", label="Tuned")
ideal_plot = ideal.hvplot(x="x", y="y")
plot = (tree_plot * line_plot * ideal_plot).opts(title="Decision Tree Actual Vs Predictions",
width=Plot.width,
height=Plot.height)
source = Embed(plot=plot, file_name="decision_tree_actual_vs_predicted")()
print(source)
The tuned model seems closer to the predicted.
End
That's a basic way to tune hyperparameters to improve your model. But our decision tree still isn't doing as well as the regression line. Next up we'll try an ensemble method - Random Forests.