Pulling Oregon Covid-19 Data

Beginning

This is a look at pulling Oregon Covid-19 data from their weekly update PDFs. There are datasets for Oregon Covid-19 cases published by the Oregon Health Authority but for some reason I can't find the raw data sources matching what they have in their weekly reports so I thought maybe instead of just endlessly searching I'd pull them from the PDFs themselves.

Set Up

Imports

# python
from pathlib import Path

import calendar
import datetime
import os

# pypi
from dateutil.relativedelta import relativedelta
from dotenv import load_dotenv

import requests

The Path

This is where I'm going to save the PDFs - to make it portable I usually keep the path in a file with dotenv loads as an environment variable. This wayif things get moved around I can edit the file to change the path but the code that uses it doesn't have to change.

load_dotenv(Path("~/.local/share/env").expanduser())
FOLDER = Path(os.environ["OREGON_COVID"]).expanduser()
if not FOLDER.is_dir():
    FOLDER.mkdir(parents=True)

Middle

Starting the Dates

I could only find a link to the current PDF on their page (as of August 4, 2020). The URL embeds the date in it so to work backwards (and maybe later forwards) we need an easy way to set it. Here's what the URL looks like.

EXAMPLE_URL = "https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/DISEASESAZ/Emerging%20Respitory%20Infections/COVID-19-Weekly-Report-2020-07-29-FINAL.pdf"

So to work with it I'll find out what day of the week July 29, 2020 was (the date in the URL).

START = datetime.date(2020, 7, 29)
WEEKDAY = START.weekday()
assert WEEKDAY == calendar.WEDNESDAY
print(WEEKDAY)
2

So, it looks like it came out on a Wednesday, which I'm going to assume is going to be the same every week. Now we can find the most recent Wednesday, which will be the start (or end, depending on how you look at it) of our date-ragen.

TODAY = datetime.date.today()
LAST = TODAY + relativedelta(weekday=WEEKDAY) - relativedelta(weeks=1) 
print(LAST)
2020-08-12

Now I'm going to work backwards one week at a time to create the URLS and file-names. Using the relativedelta makes it easy-peasey to get prior weeks from a given date. Here's the Wednesday before the most recent one.

one_week = relativedelta(weeks = 1)
print(LAST - one_week)
2020-08-05

Now, I don't actually know when these PDFs started so I'm just going to work backwards until I get an error message from my HTTP request and then stop.

FILE_NAME = "COVID-19-Weekly-Report-{}-FINAL.pdf"
BASE_URL = "https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/DISEASESAZ/Emerging%20Respitory%20Infections/{}"

for week in range(20):
    print(f"Checking back {week} weeks from this past Wednesday")
    date = LAST - (one_week * week)
    filename = FILE_NAME.format(date)
    output_path = FOLDER/filename
    if not output_path.is_file():
        url = BASE_URL.format(filename)
        response = requests.get(url)
        if not response.ok:
            print(f"Bad week: {week}\tDate: {date}")
            break
        print(f"Saving {filename}")
        with output_path.open("wb") as writer:            
            writer.write(response.content)
Checking back 0 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-08-12-FINAL.pdf
Checking back 1 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-08-05-FINAL.pdf
Checking back 2 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-29-FINAL.pdf
Checking back 3 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-22-FINAL.pdf
Checking back 4 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-15-FINAL.pdf
Checking back 5 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-08-FINAL.pdf
Checking back 6 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-01-FINAL.pdf
Checking back 7 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-24-FINAL.pdf
Checking back 8 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-17-FINAL.pdf
Checking back 9 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-10-FINAL.pdf
Checking back 10 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-03-FINAL.pdf
Checking back 11 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-05-27-FINAL.pdf
Checking back 12 weeks from this past Wednesday
Bad week: 12    Date: 2020-05-20

So, there are eleven weeks of PDFs going back to May 27, 2020. Which seems a little short, given that Oregon started telling people to stay home in March, but maybe they have the data somewhere else. Anyway, next up is pulling the data from the PDFs from the files using tabula-py.

Data Leakage

Beginning

Most people find target leakage very tricky until they've thought about it for a long time.

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

1. The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:

  • The current month (January, February, etc)
  • Advertising expenditures in the previous month
  • Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
  • The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the leather used feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

leather_used does seem like a leakage, but it depends on whether the value is known before the shoelace predictions are made. If you won't know it in time for the predictions, then it is a leak. If there's a reason why you would always know the amount used but somehow not know how many shoes were made it might not be a data leak, but it seems odd that it would be useful. It be useable if it is a prediction of the amount of leather that will be needed, but it seems odd to use it then to predict shoelaces.

2. Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

Whether it is a leak will depend on whether the leather is always ordered before the shoelaces or not. If they are always ordered before shoelaces then it wouldn't be a leak.

3. Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

The most important features in his model are:

  • Current price of the currency
  • Amount of the currency sold in the last 24 hours
  • Change in the currency price in the last 24 hours
  • Change in the currency price in the last 1 hour
  • Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over $100 in the last year, and yet his model's average error is less than $1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

The data isn't leaking.

4. Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea.

  1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
  2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Does this pose any target leakage issues? Does it pose any train-test contamination issues?

The infection rate would have a target leak if the calculated value includes the patient whose row it is added to.

You would have train-test contamination if you calculated this value using both the train and test set. You would have to calculate it only on the training set to avoid contamination.

5. Housing Prices

You will build a model to predict housing prices. The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website. Here are four features that could be used as predictors.

  1. Size of the house (in square meters)
  2. Average sales price of homes in the same neighborhood
  3. Latitude and longitude of the house
  4. Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

Average sales price of homes in the same neighborhood. If the home was sold in the past, then it would contribute to the average.

Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.

XGBoost

Beginning

In this exercise, you will leverage what you've learned to tune a machine learning model with cross-validation.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

import random

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from xgboost import XGBRegressor

from tabulate import tabulate

import eli5
import hvplot.pandas
import numpy
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "xgboost"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)
X_test = test_data

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")
X_train, X_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)

Selecting columns

"Cardinality" means the number of unique values in a column. Select categorical columns with relatively low cardinality (convenient but arbitrary).

low_cardinality_cols = [
    cname for cname in X_train.columns if X_train[cname].nunique() < 10 and
    X_train[cname].dtype == "object"]

Select numeric columns.

numeric_columns = [column for column in X_train.columns
                   if X_train[column].dtype in ['int64', 'float64']]

Keep selected columns only

keep_columns = low_cardinality_cols + numeric_columns
X_train = X_train[keep_columns].copy()
X_valid = X_validate[keep_columns].copy()
X_test = X_test[keep_columns].copy()

One-Hot Encoding

One-hot encode the data (to shorten the code, we use pandas).

X_train = pandas.get_dummies(X_train)
X_validate = pandas.get_dummies(X_validate)
X_test = pandas.get_dummies(X_test)
X_train, X_validate = X_train.align(X_validate, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

Step 1: Build model

In this step, you'll build and train your first model with gradient boosting.

  • Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the random seed to 0 (random_state=0). Leave all other parameters as default.
  • Then, fit the model to the training data in X_train and y_train.
model = XGBRegressor(random_state=Data.random_seed)
model.fit(X_train, y_train)
predictions_1 = model.predict(X_validate)

Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions for the validation set. Recall that the labels for the validation data are stored in y_valid.

mae_1 = mean_absolute_error(predictions_1, y_validate)
print(f"Mean Absolute Error: {mae_1}")
Mean Absolute Error: 17662.736729452055

Step 2: Improve the model

Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if you can get better performance.

  • Begin by setting my_model_2 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to get better results.
  • Then, fit the model to the training data in X_train and y_train.
  • Set predictions_2 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
  • Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]
learning_rate = 0.05 * numpy.array(range(1, 10))

grid = dict(n_estimators=estimators,
            max_depth=max_depth)
            #learning_rate=learning_rate)

model = XGBRegressor(random_state=Data.random_seed, learning_rate=0.05)
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            scoring="neg_mean_absolute_error",
                            n_jobs=-1,
                            random_state=1)

X_cv = pandas.concat([X_train, X_validate])
y_cv = pandas.concat([y_train, y_validate])
with TIMER:
    search.fit(X_cv, y_cv)
first_model = search.best_estimator_
print(f"CV Training MAE: {-search.best_score_:0.2f}")
print(search.best_params_)
2020-03-01 21:34:37,418 graeae.timers.timer start: Started: 2020-03-01 21:34:37.418449
2020-03-01 21:36:24,107 graeae.timers.timer end: Ended: 2020-03-01 21:36:24.107671
2020-03-01 21:36:24,109 graeae.timers.timer end: Elapsed: 0:01:46.689222
CV Training MAE: 16048.16
{'n_estimators': 160, 'max_depth': None}
outcome = pandas.DataFrame({"Score": search.cv_results_["mean_test_score"]})
early_stopping_model = XGBRegressor(random_state=Data.random_seed,
                                    learning_rate=0.05,
                                    early_stopping_rounds=5,
                                    n_estimators=500)
early_stopping_model.fit(X_train, y_train, eval_set=[(X_validate, y_validate)],
                         verbose=False)
print(mean_absolute_error(early_stopping_model.predict(X_validate),
                          y_validate))
16728.27523009418
params = model.get_params()
print(f"Trees: {params['n_estimators']}")
Trees: 100

Step 3: Break the model

In this step, you will create a model that performs worse than the original model in Step 1. This will help you to develop your intuition for how to set parameters. You might even find that you accidentally get better performance, which is ultimately a nice problem to have and a valuable learning experience!

  • Begin by setting my_model_3 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to design a model to get high MAE.
  • Then, fit the model to the training data in X_train and y_train.
  • Set predictions_3 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
  • Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
parameters = random.choice(search.cv_results_["params"])
print(parameters)
model = XGBRegressor(random_state=Data.random_seed, **parameters)
with TIMER:
    model.fit(X_train, y_train)
print(f"MAE: {mean_absolute_error(model.predict(X_validate), y_validate)}")
2020-02-29 20:55:54,573 graeae.timers.timer start: Started: 2020-02-29 20:55:54.573008
{'n_estimators': 60, 'max_depth': 20, 'learning_rate': 0.7000000000000001}
2020-02-29 20:55:55,019 graeae.timers.timer end: Ended: 2020-02-29 20:55:55.019335
2020-02-29 20:55:55,020 graeae.timers.timer end: Elapsed: 0:00:00.446327
MAE: 25077.38005672089

Numeric Values Only

imputer = IterativeImputer(random_state=Data.random_seed)

final_x_train = pandas.DataFrame(imputer.fit_transform(X_train),
                                 columns=X_train.columns)
early_stopping_model_2 = XGBRegressor(random_state=Data.random_seed,
                                      learning_rate=0.05,
                                      early_stopping_rounds=5,
                                      n_estimators=500)
early_stopping_model_2.fit(final_x_train, y_train, eval_set=[(X_validate, y_validate)],
                         verbose=False)
print(mean_absolute_error(early_stopping_model_2.predict(X_validate),
                          y_validate))
16516.399373929795

End

Make a Submission using the XGB early-stopping model

predictions = early_stopping_model.predict(X_test)
output = pandas.DataFrame({'Id': X_test.index,
                           'SalePrice': predictions})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This got a score of 14777.96266 on the kaggle submissions page.

Random Search CV

predictions = search.predict(X_test)
output = pandas.DataFrame({'Id': X_test.index,
                           'SalePrice': predictions})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This one got a score of 14976.55345, so the early stopping model is the best one so far… It had fewer trees than the model that the RandomSearch CV ended up with, maybe the Random Search overfit the data.

Early Stopping with Imputation

predictions = early_stopping_model_2.predict(X_test)
output = pandas.DataFrame({'Id': X_test.index,
                           'SalePrice': predictions})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This gets a score of 14965.20801 so it looks like the XGBoost model without imputation is the best one.

Cross Validation

Beginning

In this exercise, you will leverage what you've learned to tune a machine learning model with cross-validation.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from tabulate import tabulate

import eli5
import hvplot.pandas
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "cross-validation"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)
X_test = test_data

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")

Drop Categorical Columns

To make it simpler (since we're only looking at cross-validation, and having them didn't seem to help) we're going to drop the categorical columns.

numeric_columns = [column for column in X.columns if not X[column].dtype == "object"]
rows_0, columns_0 = X.shape
X = X[numeric_columns].copy()
row, columns = X.shape
print(f"Keeping {columns} columns, dropped ({columns_0 - columns})")
Keeping 36 columns, dropped (43)

Middle

Some Pipelines

So far, you've learned how to build pipelines with scikit-learn. For instance, the pipeline below will use SimpleImputer() to replace missing values in the data, before using RandomForestRegressor() to train a random forest model to make predictions. We set the number of trees in the random forest model with the n_estimators parameter, and setting random_state ensures reproducibility.

pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=Data.random_seed))
])

You have also learned how to use pipelines in cross-validation. The code below uses the cross_val_score() function to obtain the mean absolute error (MAE), averaged across five different folds. Recall we set the number of folds with the cv parameter.

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("Average MAE score:", scores.mean())
Average MAE score: 18276.410356164386

Step 1: Write a useful function

In this exercise, you'll use cross-validation to select parameters for a machine learning model.

Begin by writing a function get_score() that reports the average (over three cross-validation folds) MAE of a machine learning pipeline that uses:

  • the data in X and y to create folds,
  • SimpleImputer() (with all parameters left as default) to replace missing values, and
  • RandomForestRegressor() (with random_state=0) to fit a random forest model.

The n_estimators parameter supplied to get_score() is used when setting the number of trees in the random forest model.

def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.

    Args:
     n_estimators: the number of trees in the forest
    """
    pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators=n_estimators,
                                        random_state=Data.random_seed))
    ])
    scores = -1 * cross_val_score(pipeline, X, y,
                                  cv=3,
                                  scoring='neg_mean_absolute_error')
    # Replace this body with your own code
    return scores.mean()

Step 2: Test different parameter values

Now, you will use the function that you defined in Step 1 to evaluate the model performance corresponding to eight different values for the number of trees in the random forest: 50, 100, 150, …, 300, 350, 400. Store your results in a Python dictionary results, where results[i] is the average MAE returned by get_scores(i).

results = {trees: get_score(trees) for trees in range(50, 450, 50)}

results_frame = pandas.DataFrame.from_dict({"Trees": list(results.keys()), "MAE": list(results.values())})

Step 3: Find the best parameter value

plot = results_frame.hvplot(x="Trees", y="MAE").opts(
    title="Cross-Validation Mean Absolute Error",
    width=Plot.width,
    height=Plot.height)

source = Embed(plot=plot, file_name="mean_absolute_error")()
print(source)
: :

Figure Missing

:

200 appears to be the best number of trees for our forest.

Pipelines

Beginning

In this exercise, you will use pipelines to improve the efficiency of your machine learning code.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from tabulate import tabulate

import eli5
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "kaggle-intermediate-machine-learning-pipelines"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)
X_test = test_data

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")
X_train, X_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)

Trim the columns

# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_columns = [column for column in X_train.columns if
                       X_train[column].nunique() < 10 and 
                       X_train[column].dtype == object]

# Select numerical columns
numerical_columns = [column for column in X_train.columns if 
                     X_train[column].dtype in ['int64', 'float64']]

# Keep selected columns only
columns = categorical_columns + numerical_columns
X_train = X_train[columns].copy()
X_validate = X_validate[columns].copy()
X_test = X_test[columns].copy()

Middle

Preprocess Data and Train the Model

The missing numeric values will be filled in with a simple imputer. When the strategy is set to constant then it will fill missing values with a single value (which is 0 by default).

numerical_transformer = SimpleImputer(strategy='constant')

Now the categorical data transformer. We'll use the most frequent value in any column with missing values to fill them in and the do one-hot encoding.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Now we can bundle them together into a single transformer.

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

Define The Model

model = RandomForestRegressor(n_estimators=100, random_state=0)

Build the Pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

Fit the Model

# Preprocessing of training data, fit model 
pipeline.fit(X_train, y_train)

Score the Model

preds = pipeline.predict(X_validate)

print(f"MAE: {mean_absolute_error(y_validate, preds):,}")
MAE: 17,861.780102739725

Improving the Performance

Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:

  • numerical_transformer
  • categorical_transformer
  • model
numerical_transformer = IterativeImputer(random_state=Data.random_seed)

I'll use the same categorical imputer.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

And now we bundle them together.

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numerical_transformer, numerical_columns),
        ('categorical', categorical_transformer, categorical_columns)
    ])

Now build and train the model.

model = RandomForestRegressor(n_estimators=50, max_depth=60, random_state=0)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)
                           ])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_validate)

print(f"MAE: {mean_absolute_error(y_validate, predictions):,}")
MAE: 17,556.51404109589

So we improved slightly, but we're still not doing as well as with the numeric only dataset.

SHAP

with TIMER:
    training = X_train[numerical_columns + categorical_columns]
    data = preprocessor.fit_transform(training)
    columns = (numerical_columns
               + list(preprocessor.named_transformers_["categorical"]["onehot"].get_feature_names()))
    data = pandas.DataFrame(
        data,
        columns=columns)
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(data)
2020-03-01 19:25:58,515 graeae.timers.timer start: Started: 2020-03-01 19:25:58.514631
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
2020-03-01 19:26:16,103 graeae.timers.timer end: Ended: 2020-03-01 19:26:16.103820
2020-03-01 19:26:16,104 graeae.timers.timer end: Elapsed: 0:00:17.589189
shap.summary_plot(shap_values, data)
figure = pyplot.gcf()
output = "shap_summary.png"

figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

nil

[[file:shap_summary.png]]
<Figure size 432x288 with 0 Axes>

nil

shap.initjs()
html = shap.force_plot(explainer.expected_value, shap_values, data)
output_file = "force_plot.html"
output = OUTPUT_PATH/output_file
with output.open("w") as writer:
    shap.save_html(writer, html, False)

print(f"""
#+begin_export html
: <object type="text/html" data="{output_file}" style="width:100%" height=800>
:   <p>Figure Missing</p>
: </object>
#+end_export
""")
: :

Figure Missing

:

End

Categorical Values

Beginning

Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot
from pdpbox import pdp

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from tabulate import tabulate

import eli5
import hvplot.pandas
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "categorical-values"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Middle

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")

We know that there's missing data, but since this is about handling categorical data, not missing data, we'll just drop the columns that have missing values.

print(X.shape)
X = X[[column for column in X.columns if not X[column].hasnans]]
test_data = test_data[X.columns]
print(X.shape)
(1460, 79)
(1460, 60)

So we lost 19 columns - more than I was expecting.

Now do the train-test split.

X_train, X_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)
print(drop_X_train.info())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-002dd7ea4a19> in <module>
----> 1 print(drop_X_train.info())

NameError: name 'drop_X_train' is not defined

Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.

Score Dataset

This is the same function used in the missing-values tutorial. It's used to compare different models' Mean Absolute Error (MAE) as we make changes.

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 1: Drop Categorical Columns

The first approach is to just drop all the non-numeric columns.

columns = [column for column in X_train.columns if X_train[column].dtype != object]
drop_X_train = X_train[columns]
drop_X_validate = X_validate[columns]

print("MAE from Approach 1 (Drop categorical variables):")
print(f"{score_dataset(drop_X_train, drop_X_validate, y_train, y_validate):,}")

Using all the numeric columns does better than we did with our initial subset of columns (20,928.5), but not as good as we did with imputed values (16,656.3).

Step 2: Label encoding

Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2' column. The code cell below prints the unique entries in both the training and validation sets.

train_counter = CountPercentage(X_train.Condition2, dropna=False)
validate_counter = CountPercentage(X_validate.Condition2, dropna=False)
train_counter()
Value Count Percent (%)
Norm 1160 99.32
Feedr 4 0.34
PosA 1 0.09
Artery 1 0.09
RRAe 1 0.09
PosN 1 0.09
validate_counter()
Value Count Percent (%)
Norm 285 97.60
Feedr 2 0.68
RRNn 2 0.68
RRAn 1 0.34
Artery 1 0.34
PosN 1 0.34

It looks like the validation data has values that aren't in the training data (and vice versa), e.g. RRNn, so encoding the training set won't work with the validation set.

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.

Run the code cell below to save the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.

# All categorical columns
object_columns = [column for column in X_train.columns if X_train[column].dtype == "object"]

# Columns that can be safely label encoded
good_label_columns = [column for column in object_columns if 
                      set(X_train[column]) == set(X_validate[column])]

# Problematic columns that will be dropped from the dataset
bad_label_columns = list(set(object_columns)-set(good_label_columns))

print('Categorical columns that will be label encoded:')
for column in  good_label_columns:
    print(f" - {column}")

print('\nCategorical columns that will be dropped from the dataset:')
for column in bad_label_columns:
    print(f" - {column}")

Categorical columns that will be label encoded:

  • MSZoning
  • Street
  • LotShape
  • LandContour
  • LotConfig
  • BldgType
  • HouseStyle
  • ExterQual
  • CentralAir
  • KitchenQual
  • PavedDrive
  • SaleCondition

Categorical columns that will be dropped from the dataset:

  • Condition1
  • RoofMatl
  • HeatingQC
  • ExterCond
  • RoofStyle
  • SaleType
  • Foundation
  • Condition2
  • Exterior2nd
  • Neighborhood
  • Heating
  • LandSlope
  • Utilities
  • Functional
  • Exterior1st

Drop the Bad Columns

label_X_train = X_train.drop(bad_label_columns, axis="columns")
label_X_validate = X_validate.drop(bad_label_columns, axis="columns")

Encode the Categorical Values

We're going to use sklearn's LabelEncoder.

Note: Sklearn's documentation says that this is meant only for categorical target data (the labels), not the input data like we're doing here. Later on we're going to use one-hot-encoding, which is what sklearn recommends (the LabelEncoder method implies that the numbers are values, not just numeric codes for strings).

It's going to create integer values for each of the unique values in each column.

for column in good_label_columns:
    encoder = LabelEncoder()    
    label_X_train.loc[:, column] = encoder.fit_transform(label_X_train[column])
    label_X_validate.loc[:, column] = encoder.fit_transform(label_X_validate[column])

Now check how it did.

print("MAE from Approach 2 (Label Encoding):") 
print(f"{score_dataset(label_X_train, label_X_validate, y_train, y_validate):,}")
MAE from Approach 2 (Label Encoding):
17,575.291883561644

So it does a little better than the previous approach of just dropping all the categorical data, but not as well as it did when we imputed the missing numeric values.

Step 3: Investigating cardinality

So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.

Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.

Get number of unique entries in each column with categorical data

object_nunique = [X_train[column].nunique() for column in object_columns]

## Print number of unique entries by column, in descending
cardinality = pandas.DataFrame(dict(Column=object_columns,
                                    Cardinality=object_nunique)
                     ).sort_values(by="Cardinality", ascending=False)
print(TABLE(cardinality))
Column Cardinality
Neighborhood 25
Exterior2nd 16
Exterior1st 15
SaleType 9
Condition1 9
HouseStyle 8
RoofMatl 7
Functional 6
Heating 6
Foundation 6
RoofStyle 6
SaleCondition 6
Condition2 6
BldgType 5
ExterCond 5
LotConfig 5
HeatingQC 5
MSZoning 5
ExterQual 4
KitchenQual 4
LandContour 4
LotShape 4
LandSlope 3
PavedDrive 3
Street 2
Utilities 2
CentralAir 2

The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street' variable has cardinality 2.

Questions

How many categorical variables in the training data have cardinality greater than 10?

print(len(cardinality[cardinality.Cardinality > 10]))
3

How many columns are needed to one-hot encode the 'Neighborhood' variable in the training data?

print(cardinality[cardinality.Column=="Neighborhood"].Cardinality.iloc[0])
25

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.

  • If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
  • If we instead replace the column with the label encoding, how many entries are added?
print(10000 * 100 - 10000)
990000

Step 4: One-hot encoding

In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.

Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded. Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.

low_cardinality_columns = cardinality[cardinality.Cardinality < 10].Column
high_cardinality_columns = cardinality[~cardinality.Column.isin(low_cardinality_columns)].Column
print("Categorical columns that will be one-hot encoded:")
for column in low_cardinality_columns:
    print(f" - {column}")

Categorical columns that will be one-hot encoded:

  • SaleType
  • Condition1
  • HouseStyle
  • RoofMatl
  • Functional
  • Heating
  • Foundation
  • RoofStyle
  • SaleCondition
  • Condition2
  • BldgType
  • ExterCond
  • LotConfig
  • HeatingQC
  • MSZoning
  • ExterQual
  • KitchenQual
  • LandContour
  • LotShape
  • LandSlope
  • PavedDrive
  • Street
  • Utilities
  • CentralAir
print('Categorical columns that will be dropped from the dataset:')
for column in high_cardinality_columns:
    print(f" - {column}")

Categorical columns that will be dropped from the dataset:

  • Neighborhood
  • Exterior2nd
  • Exterior1st

Use the next code cell to one-hot encode the data in X_train and X_valid. Set the preprocessed DataFrames to OH_X_train and OH_X_valid, respectively.

  • The full list of categorical columns in the dataset can be found in the Python list object_cols.
  • You should only one-hot encode the categorical columns in low_cardinality_cols. All other categorical columns should be dropped from the dataset.
OH_train = X_train[low_cardinality_columns]
OH_validate = X_validate[low_cardinality_columns]

encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")

OH_train = pandas.DataFrame(encoder.fit_transform(OH_train))
OH_validate = pandas.DataFrame(encoder.fit_transform(OH_validate))

OH_train.index = X_train.index
OH_validate.index = X_validate.index

X_train_numeric = X_train.drop(object_columns, axis="columns")
X_validate_numeric = X_validate.drop(object_columns, axis="columns")

OH_X_train = pandas.concat([X_train_numeric, OH_train], axis="columns")
OH_X_validate = pandas.concat([X_validate_numeric, OH_validate], axis="columns")
print("MAE from Approach 3 (One-Hot Encoding):") 
print(f"{score_dataset(OH_X_train, OH_X_validate, y_train, y_validate):,}")
MAE from Approach 3 (One-Hot Encoding):
17,429.93404109589

So we've improved slightly, but still not as well as the all numeric data with imputed data.

Step 5: Generate test predictions and submit your results

After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.

To get the imputation working again we need to re-add the columns with missing values. I'm also going to encode the entire dataset before splitting so that everything is encoded, rather than ignoring the values in the validation set that aren't in the training set.

X_2 = train_data.drop([Data.target], axis="columns")
objects = [column for column in X_2.columns if X_2[column].dtype==object]
missing = [column for column in objects if X_2[column].hasnans]

X_2 = X_2.drop(missing, axis="columns")
OH_X = X_2.drop(high_cardinality_columns, axis="columns").reset_index(drop=True)
for column in low_cardinality_columns:
    encoder = OneHotEncoder(sparse=False)
    encoded = encoder.fit_transform(
        OH_X[column].to_numpy().reshape(-1, 1)
    )
    reencoded = pandas.DataFrame(encoded, columns=encoder.get_feature_names())
    OH_X = pandas.concat([OH_X, reencoded], axis="columns").drop(
        column, axis="columns")

imputer = IterativeImputer(random_state=Data.random_seed)
OH_X = pandas.DataFrame(imputer.fit_transform(OH_X), columns=OH_X.columns)
OH_X_train, OH_X_validate, y_train, y_validate = train_test_split(
    OH_X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)

model = RandomForestRegressor(n_estimators=100, random_state=Data.random_seed)
model.fit(OH_X_train, y_train)

preds_valid = model.predict(OH_X_validate)
error = mean_absolute_error(y_validate, preds_valid)
print(f"MAE: {error:0.2f}")

It seems to have gotten worse… but maybe that's because we tuned the hyperparameters to the numeric-only model.

Hyperparameter Tuning

estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)

model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)

search.fit(OH_X_train, y_train)
model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {model.score(OH_X_train, y_train): 0.2f}")
print(f"Validation R^2: {model.score(OH_X_validate, y_validate):0.2f}")
predictions = model.predict(OH_X_validate)
print(f"Mean Absolute Error: {mean_absolute_error(y_validate, predictions)}")
print(search.best_params_)
CV Training R^2: 0.86
Training R^2:  0.98
Validation R^2: 0.84
Mean Absolute Error: 17615.31526418787
{'n_estimators': 140, 'max_depth': 60}

So it can get a little better, but it doesn't do as well as with just the numeric features. Maybe we don't have enough data to make it work.

permutor = PermutationImportance(model, random_state=Data.random_seed).fit(
    OH_X_validate, y_validate)
ipython_html = eli5.show_weights(
    permutor,
    feature_names=OH_X_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
| Weight           | Feature      |
|------------------+--------------|
| 0.4825  ± 0.1218 | OverallQual  |
| 0.1019  ± 0.0350 | GrLivArea    |
| 0.0199  ± 0.0082 | TotalBsmtSF  |
| 0.0171  ± 0.0067 | BsmtFinSF1   |
| 0.0131  ± 0.0062 | 1stFlrSF     |
| 0.0096  ± 0.0078 | GarageCars   |
| 0.0084  ± 0.0011 | 2ndFlrSF     |
| 0.0074  ± 0.0026 | LotArea      |
| 0.0051  ± 0.0022 | YearRemodAdd |
| 0.0050  ± 0.0034 | GarageArea   |
| 0.0048  ± 0.0047 | BedroomAbvGr |
| 0.0036  ± 0.0009 | LotFrontage  |
| 0.0031  ± 0.0021 | YearBuilt    |
| 0.0031  ± 0.0014 | OverallCond  |
| 0.0029  ± 0.0012 | WoodDeckSF   |
| 0.0021  ± 0.0021 | MasVnrArea   |
| 0.0014  ± 0.0016 | OpenPorchSF  |
| 0.0012  ± 0.0009 | FullBath     |
| 0.0009  ± 0.0008 | x0_RM        |
| 0.0008  ± 0.0036 | GarageYrBlt  |
| … 142 more …     | … 142 more … |

It looks like the most significant categorical features are LandContour (Bnk and Lvl), either Condition1 or Condition2 (Norm) and ExterCond (TA). I just took a quick look they don't seem to contribute a whole lot to the model.

End

This was a brief look at handling categorical data.

Missing Values

Beginning

Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot
from pdpbox import pdp

import shap
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer, KNNImputer, SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from tabulate import tabulate

import eli5
import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "missing-values"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Middle

Remove Training Data with no Target

print(f"{len(train_data):,}")
train_data = train_data.dropna(axis="rows", subset=[Data.target])
print(f"{len(train_data):,}")
1,460
1,460

Doesn't look like there were any missing target values.

y = train_data[Data.target]
x_train = train_data.drop([Data.target], axis="columns")

Numeric Data Only

To keep things simple, we'll use only numerical predictors

print(len(x_train.columns))
X = x_train.select_dtypes(exclude=["object"])
print(len(X.columns))
79
36
print(len(test_data.columns))
x_test = test_data.select_dtypes(exclude=["object"])
print(len(x_test.columns))
79
36

Split the Training and Validation Data

x_train, x_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size,
    test_size=Data.test_size,
    random_state=Data.random_seed)

Step 1: Preliminary investigation

missing_by_column = x_train.isna().sum()
missed = missing_by_column[missing_by_column > 0]
print(TABLE(missed.reset_index().rename(columns={"index": "Feature", 0:"Missing"})))
Feature Missing
LotFrontage 212
MasVnrArea 6
GarageYrBlt 58

According to the data description these features are:

LotFrontage: Linear feet of street connected to property MasVnrArea: Masonry veneer area in square feet GarageYrBlt: Year garage was built

Missing = Namespace(
    frontage = "LotFrontage",
    masonry = "MasVnrArea",
    garage = "GarageYrBlt",
    columns = ["LotFrontage", "MasVnrArea", "GarageYrBlt"],
)

Part A

  • How many rows are in the training data?
     print(f"{len(x_train):,}")
    
    1,168
    
  • Fill in the line below: How many columns in the training data have missing values?
     print(f"{sum([1 for column in x_train.columns if x_train[column].hasnans])}")
    
    3
    
  • Fill in the line below: How many missing entries are contained in all of the training data?
     print(f"{missed.sum()}")
    
    276
    

    Note: For some reason it doesn't appear to be explicitly mentioned in the notebook, but if you don't deal with the missing values and try and fit the trees to the data you'll end up with an error.

    try:
        score_dataset(x_train, x_validate, y_train, y_validate)
    except ValueError as error:
        print(error)
    
    Input contains NaN, infinity or a value too large for dtype('float32').
    

Part B

Considering your answers above, what do you think is likely the best approach to dealing with the missing values?

For the cases where there are few missing values I would drop them - e.g. MasVnrArea. For GarageYrBlt I would use the most common value in the same neighborhood and for the LotFrontage I would use the mean or median.

score_dataset

This function will help check the Mean Absolute Error (MAE) as we make changes to the dataset.

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 2: Drop columns with missing values

We'll try dropping the columns in the training and validation.

missing_columns = missed.index
keep = [column for column in x_train.columns if not x_train[column].hasnans]
reduced_X_train = x_train[keep]
reduced_X_valid = x_validate[keep]

print("MAE (Drop columns with missing values):")
drop_columns_error = score_dataset(reduced_X_train, reduced_X_valid, y_train, y_validate)
print(f"{drop_columns_error:0.2f}")
MAE (Drop columns with missing values):
17837.83

Step 3: Imputation

Part A

Use the next code cell to impute missing values with the mean value along each column. Set the preprocessed DataFrames to imputed_X_train and imputed_X_valid. Make sure that the column names match those in X_train and X_valid.

Here we'll use sklearn's SimpleImputer which fills missing values with the means of the columns (by default). It accepts pandas DataFrames but returns a numpy array so we need to rebuild the DataFrame afterward. The notebook suggests you can just re-set the columns, but I don't know what they're expecting, since it isn't a DataFrame. As long as we end up with the same thing in the end I guess it's okay.

imputer = SimpleImputer()
imputed_X_train = pandas.DataFrame(imputer.fit_transform(x_train),
                                   columns=x_train.columns)
imputed_X_valid = pandas.DataFrame(imputer.transform(x_validate))

Now check the Mean Absolute Error for our imputed frames.

print("MAE (Imputation):")
impute_mean_error = score_dataset(imputed_X_train, imputed_X_valid, y_train, y_validate)
print(f"{impute_mean_error:0.2f}")
print(f"Improvement: {drop_columns_error - impute_mean_error:0.2f}")
MAE (Imputation):
18056.85
Improvement: -219.03

So we actually got a little worse using mean imputation.

Part B

Compare the MAE from each approach. Does anything surprise you about the results? Why do you think one approach performed better than the other?

As note previously, the imputation did worse than discarding the columns did. It might be that using the mean threw the values off so much that it did worse than just throwing the values away. This might indicate that the values aren't symmetrically distributed so using a central tendency doesn't reflect the data very well.

plot = x_train.hvplot.box(y="LotFrontage").opts(
    title="LotFrontage",
    width=Plot.width,
    height=Plot.height,
)
source = Embed(plot=plot, file_name="lot_frontage_box")()
print(source)
: :

Figure Missing

:

Looking at the plot you can see that it's right-skewed, with an extreme point over 300 square feet well over the mean:

print(f"Mean: {x_train.LotFrontage.mean():0.2f} sq ft")
print(f"Max: {x_train.LotFrontage.max():0.2f} sq ft")
Mean: 69.61 sq ft
Max: 313.00 sq ft
plot = x_train.hvplot.box(y="GarageYrBlt").opts(
    title="GarageYrBlt",
    width=Plot.width,
    height=Plot.height,
)
source = Embed(plot=plot, file_name="garage_year_built_box")()
print(source)
: :

Figure Missing

:

This also looks skewed, but the number of missing points is less so I don't know if it had as much of an effect.

Step 4: Generate test predictions

In this final step, you'll use any approach of your choosing to deal with missing values. Once you've preprocessed the training and validation features, you'll train and evaluate a random forest model. Then, you'll preprocess the test data before generating predictions that can be submitted to the competition.

Part A

Use the next code cell to preprocess the training and validation data. Set the preprocessed DataFrames to final_X_train and final_X_valid. You can use any approach of your choosing here! in order for this step to be marked as correct, you need only ensure:

  • the preprocessed DataFrames have the same number of columns,
  • the preprocessed DataFrames have no missing values,
  • final_X_train and y_train have the same number of rows, and
  • final_X_valid and y_valid have the same number of rows.
  • KNN

    Let's try using K-Nearest Neighbors to estimate missing values.

    imputer = KNNImputer()
    final_x_train = pandas.DataFrame(imputer.fit_transform(x_train),
                                     columns=x_train.columns)
    final_x_validate = pandas.DataFrame(imputer.transform(x_validate),
                                     columns=x_validate.columns)
    
    • One Last Try

      Run the next code cell to train and evaluate a random forest model. (Note that we don't use the score_dataset() function above, because we will soon use the trained model to generate test predictions!)

      Define and fit the model.

      model = RandomForestRegressor(n_estimators=100, random_state=0)
      model.fit(final_x_train, y_train)
      

      Get validation predictions and MAE.

      preds_valid = model.predict(final_x_validate)
      print("MAE (Your approach):")
      final_error = mean_absolute_error(y_validate, preds_valid)
      print(f"{final_error:.2f}")
      print(f"Improvement Over Dropping Columns: {drop_columns_error - final_error:0.2f}")
      print(f"Improvement Over Mean: {impute_mean_error - final_error:0.2f}")
      
      MAE (Your approach):
      17834.40
      Improvement Over Dropping Columns: 3.43
      Improvement Over Mean: 222.46
      

      So it does a litle better than dropping the columns altogether.

  • Iterative

    This is an experimental imputer from sklearn based on imputation methods from R.

    imputer_2 = IterativeImputer(random_state=Data.random_seed)
    final_x_train_2 = pandas.DataFrame(imputer_2.fit_transform(x_train),
                                       columns=x_train.columns)
    final_x_validate_2 = pandas.DataFrame(imputer_2.transform(x_validate),
                                          columns=x_validate.columns)
    
    • One Last Try

      Run the next code cell to train and evaluate a random forest model. (Note that we don't use the score_dataset() function above, because we will soon use the trained model to generate test predictions!)

      Define and fit the model.

      model = RandomForestRegressor(n_estimators=100, random_state=0)
      model.fit(final_x_train_2, y_train)
      

      Get validation predictions and MAE.

      preds_valid = model.predict(final_x_validate_2)
      final_error_2 = mean_absolute_error(y_validate, preds_valid)
      print(f"MAE (Your approach): {final_error_2:0.2f}")
      print(f"Improvement Over Dropping Columns: {drop_columns_error - final_error:0.2f}")
      print(f"Improvement Over Mean: {impute_mean_error - final_error:0.2f}")
      print(f"Improvement over KNN: {final_error - final_error_2: 0.2f}")
      
      MAE (Your approach): 17812.88
      Improvement Over Dropping Columns: 3.43
      Improvement Over Mean: 222.46
      Improvement over KNN:  21.51
      

      There's a slight improvement once again (the imputers have hyperparameters themselves that aren't being tuned so they might be even better than what I'm getting).

  • Permutation Importance

    Let's look at the the features that were the most important in our model.

    permutor = PermutationImportance(model, random_state=Data.random_seed).fit(
        final_x_validate_2, y_validate)
    ipython_html = eli5.show_weights(
        permutor,
        feature_names=final_x_validate_2.columns.tolist())
    table = pandas.read_html(ipython_html.data)[0]
    print(TABLE(table))
    
    Weight Feature
    0.4717 ± 0.0741 OverallQual
    0.1163 ± 0.0182 GrLivArea
    0.0254 ± 0.0066 TotalBsmtSF
    0.0209 ± 0.0040 BsmtFinSF1
    0.0127 ± 0.0006 2ndFlrSF
    0.0112 ± 0.0042 1stFlrSF
    0.0090 ± 0.0072 YearRemodAdd
    0.0079 ± 0.0022 YearBuilt
    0.0069 ± 0.0036 LotArea
    0.0046 ± 0.0018 GarageCars
    0.0039 ± 0.0006 WoodDeckSF
    0.0034 ± 0.0026 GarageYrBlt
    0.0031 ± 0.0009 OverallCond
    0.0027 ± 0.0032 OpenPorchSF
    0.0026 ± 0.0008 LotFrontage
    0.0026 ± 0.0011 Fireplaces
    0.0016 ± 0.0009 FullBath
    0.0015 ± 0.0010 BedroomAbvGr
    0.0014 ± 0.0028 BsmtUnfSF
    0.0012 ± 0.0025 TotRmsAbvGrd
    … 16 more … … 16 more …

    It's interesting, but the four most important features (OverallQual, GrLivArea, TotalBsmtSF, and BsmtFinSF1) were'nt in our first models. And LotFrontage that we spent all that time in this post filling in is only fifteenth - but looking at our improvements the imputation made, even these seemingly lowm contributiong features helped.

    ipython_html = eli5.show_weights(
        permutor,
        top = None,
        feature_names=final_x_validate_2.columns.tolist())
    table = pandas.read_html(ipython_html.data)[0]
    
    table["Weight"] = table.Weight.str.split(expand=True)[0].astype(float)
    plot = table.hvplot.bar(x="Feature", y="Weight").opts(
        title="Permutation Importance",
        width=Plot.width,
        height=Plot.height,
        xrotation=45)
    source = Embed(plot=plot, file_name="permutation_importance")()
    
    print(source)
    
    : :

    Figure Missing

    :

    Looking at the plot you can see that there's a huge drop from the influence of OverallQuall to the influence of the rest of the features.

    Let's look at the features that didn't contribute to the model.

    print("| Feature | Weight|")
    print("|-+-|")
    for row in table[table.Weight <= 0].itertuples():
        print(f"|{row.Feature}| {row.Weight}|")
    
    Feature Weight
    MiscVal 0.0
    LowQualFinSF 0.0
    PoolArea 0.0
    EnclosedPorch -0.0003
    HalfBath -0.0004
    YrSold -0.0005
    MasVnrArea -0.0015
    MoSold -0.0058

    So, no point enclosing that porch or expanding your pool, I guess.

  • The Most Important Feature

    This is the data_description entry for OverallQual:

    OverallQual: Rates the overall material and finish of the house

    Value Description
    10 Very Excellent
    9 Excellent
    8 Very Good
    7 Good
    6 Above Average
    5 Average
    4 Below Average
    3 Fair
    2 Poor
    1 Very Poor

    This appears to be an ordinal rather than a continuous variable, interesting how much it dominates.

    • PDP Plot

      Here's the amount that the feature changes the sales price as it changes.

      FEATURE = "OverallQual"
      pdp_dist = pdp.pdp_isolate(model=model,
                                 dataset=final_x_validate_2,
                                 model_features=final_x_validate_2.columns,
                                 feature=FEATURE)
      pdp.pdp_plot(pdp_dist, FEATURE)
      output = f"{FEATURE}_pdp_plot.png"
      figure = pyplot.gcf()
      figure.subplots_adjust(top=0.5)
      figure.savefig(OUTPUT_PATH/output)
      print(f"[[file:{output}]]")
      

      nil

      So it looks like once you hit "Above Average" it's pretty much a linear relationship between the overall quality and the sale price.

    • SHAP Summary

      Let's take a more visual look at the importance of each feature.

      explainer = shap.TreeExplainer(model)
      shap_values = explainer.shap_values(final_x_validate_2)
      
      shap.summary_plot(shap_values, final_x_validate_2)
      figure = pyplot.gcf()
      output = "shap_summary.png"
      
      figure.savefig(OUTPUT_PATH/output)
      print(f"[[file:{output}]]")
      

      nil

      Besides reinforcing the importance of OverallQual, the plot shows how much spread its influence covers. The odd bunches might reflect the fact that it's a discrete, ordinal feature, not a continuous one.

End

Make a kaggle submission.

final_x_test = imputer_2.transform(x_test)
preds_test = model.predict(final_x_test)
output = pandas.DataFrame({'Id': x_test.index,
                           'SalePrice': preds_test})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This model gives us an error of 16,656.25822, an improvement over our previous submission where we used only a smaller subset of the features.

introduction = 27217.91640
previous = 20928.54621
current = 16656.25822

print(f"Latest Error: {current:,}")
print(f"Improvement over the introduction ({introduction:,}): {introduction - current:,}")
print(f"Improvement over the previous model ({previous:,}): {previous - current:,}")
Latest Error: 16,656.25822
Improvement over the introduction (27,217.9164): 10,561.658179999999
Improvement over the previous model (20,928.54621): 4,272.287990000001

So by adding in the remaining features we were able to reduce our error by quite a bit.

Introduction to the Kaggle Intermediate Machine Learning Tutorial

Beginning

This is the introduction to kaggle's intermediate machine learning tutorial.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "introduction-intermediate-machine-learning"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
training_data = pandas.read_csv(
    DATA_PATH/"train.csv")

testing_data = pandas.read_csv(
    DATA_PATH/"test.csv"
)

Middle

Preliminary 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

Our target is SalePrice.

Y = data.SalePrice

Preliminary 2: Create X

Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

  • LotArea
  • YearBuilt
  • 1stFlrSF
  • 2ndFlrSF
  • FullBath
  • BedroomAbvGr
  • TotRmsAbvGrd
Training = Namespace(
    features = [
        "1stFlrSF",
        "2ndFlrSF",
        "BedroomAbvGr",
        "FullBath",
        "LotArea",
        "TotRmsAbvGrd",
        "YearBuilt",
    ],
    random_seed=0,
    train_size=0.8,
    test_size=0.2,
)
X = training_data[Training.features]
x_submit = testing_data[Training.features]

Split up the data into training and validation sets.

x_train, x_validate, y_train, y_validate = train_test_split(
    X, Y,
    random_state=Training.random_seed,
    train_size=Training.train_size,
    test_size=Training.test_size)

Preliminary 3: Evaluate Some Models

hyperparameters = dict(
    model_1=dict(n_estimators=50, random_state=0),
    model_2 = dict(n_estimators=100, random_state=0),
    model_3 = dict(n_estimators=100, criterion='mae', random_state=0),
    model_4 = dict(n_estimators=200, min_samples_split=20, random_state=0),
    model_5 = dict(n_estimators=100, max_depth=7, random_state=0),
    )

models = [RandomForestRegressor(**parameters) for parameters in hyperparameters.values()]
def score_model(model, X_t=x_train, X_v=x_validate, y_t=y_train, y_v=y_validate):
    model.fit(X_t, y_t)
    preds = model.predict(X_v)
    return mean_absolute_error(y_v, preds)

scores = sorted([(score_model(model), model, index) for index, model in enumerate(models)])

for score, model, index in scores:
    print(f"Model {index} MAE: {score:0.2f}")

best = min(scores)
print()
print(f"Best Model: {best}")
best_model = f"model_{best[2]}"
best_hyperparameters = hyperparameters[best_model]
Model 2 MAE: 23528.78
Model 4 MAE: 23706.67
Model 1 MAE: 23740.98
Model 3 MAE: 23996.68
Model 0 MAE: 24015.49

Best Model: (23528.78421232877, RandomForestRegressor(bootstrap=True, criterion='mae', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100,
                      n_jobs=None, oob_score=False, random_state=0, verbose=0,
                      warm_start=False), 2)

Preliminary 4: Make Some Predictions

model = RandomForestRegressor(**best_hyperparameters)
model.fit(X, Y)

test_predictions = model.predict(x_submit)

submission = pandas.DataFrame(dict(Id=testing_data.Id,
                                   SalePrice=test_predictions))
submission.to_csv(DATA_PATH/"submission.csv", index=False)

This gets a score of 20,928.54621 compared to the previous error score of *27,217.91640, so it looks like the error is getting better.

End

Now we're back at the point we were at the end of the introduction to machine learning tutorial, except with a slightly improved model.

Machine Learning Competitions

Beginning

In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.

The steps in this notebook are:

  1. Build a Random Forest model with all of your data (X and y)
  2. Read in the "test" data, which doesn't include values for the target. Predict home values in the test data with your Random Forest model.
  3. Submit those predictions to the competition and see your score.
  4. Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "underfitting-and-overfitting-exercise"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
data = pandas.read_csv(
    DATA_PATH/"train.csv")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv"
)

Middle

Preliminary 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

Our target is SalePrice.

Y = data.SalePrice

Preliminary 2: Create X

Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

  • LotArea
  • YearBuilt
  • 1stFlrSF
  • 2ndFlrSF
  • FullBath
  • BedroomAbvGr
  • TotRmsAbvGrd
FEATURES = [
    "LotArea",
    "YearBuilt",
    "1stFlrSF",
    "2ndFlrSF",
    "FullBath",
    "BedroomAbvGr",
    "TotRmsAbvGrd",
]
X = data[FEATURES]

Split up the data into training and validation sets.

x_train, x_validate, y_train, y_validate = train_test_split(X, Y, random_state=1)

Preliminary 3: Specify and Fit Model

A Linear Regression Model

As a baseline, I'll fit a simple Linear Regression (ordinary-least-squares) model.

regression = LinearRegression()
scores = cross_val_score(regression, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
regression = regression.fit(x_train, y_train)
print(f"Training R^2: {regression.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {regression.score(x_validate, y_validate):0.2f}")
0.66 (+/- 0.17)
Training R^2:  0.68
Validation R^2: 0.77

Decision Tree

Create a DecisionTreeRegressor and save it as iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

tree = DecisionTreeRegressor()
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.52 (+/- 0.34)
Training R^2:  1.00
Validation R^2: 0.75

So our linear regression actually does better than the tree does. It looks like the tree might be overfitting on the training data.

Preliminary 4: Make Some Predictions

tree_predict = tree.predict(x_validate)
regression_predict = regression.predict(x_validate)

Preliminary 5: Calculate the Mean Absolute Error in Validation Data

tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)

print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
Tree MAE:  28969.44
Regression MAE:  27228.88

The tree's error is a little higher than the regression line's.

Preliminary 6: Compare Different Tree Sizes

Write a loop that tries the following values for max_leaf_nodes from a set of possible values.

Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes that gives the most accurate model on your data.

def get_mae(max_leaf_nodes, train_X=x_train, val_X=x_validate, train_y=y_train, val_y=y_validate):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

Write a loop to find the ideal tree size from candidate_max_leaf_nodes.

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
outcomes = [(get_mae(nodes), nodes) for nodes in candidate_max_leaf_nodes]
best = min(outcomes)
print(best)
best_tree_size = best[1]
(27282.50803885739, 100)
mae = pandas.DataFrame(dict(nodes=candidate_max_leaf_nodes, mae = [outcome[0] for outcome in outcomes]))
plot = mae.hvplot(x="nodes", y="mae").opts(title="Node Mean Absolute Error",
                                           width=Plot.width,
                                           height=Plot.height)
source = Embed(plot=plot, file_name="node_mean_absolute_error")()
print(source)
: :

Figure Missing

:

Looking at the plot you can see that the error drops until you hit 100 nodes and then begins to rise again as it overfits the data with more nodes.

Let's see how much this improves our model using \(r^2\).

tree = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.58 (+/- 0.32)
Training R^2:  0.93
Validation R^2: 0.76

We've improved it slightly, it's probably still overfitting the model but not as much.

Preliminary 7: Use a Random Forest

forest = RandomForestRegressor(random_state=1, n_estimators=100)

scores = cross_val_score(forest, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = forest.fit(x_train, y_train)
print(f"Training R^2: {forest.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {forest.score(x_validate, y_validate):0.2f}")
0.76 (+/- 0.11)
Training R^2:  0.97
Validation R^2: 0.85

So the defaults already beat the regression and decision tree model.

forest_predictions = forest.predict(x_validate)
forest_mae = mean_absolute_error(y_true=y_validate, y_pred=forest_predictions)

tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)

print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
print(f"Forest MAE: {forest_mae:0.2f}")
Tree MAE:  28969.44
Regression MAE:  27228.88
Forest MAE: 21857.16

So the forest also has a much better Mean Absolute Error than the other two models.

Step 1: Creating a Model For the Competition

Build a Random Forest model and train it on all of X and y.

forest = RandomForestRegressor(random_state=1, n_estimators=100)
forest.fit(X, Y)

Step 2: Make Predictions

Read the file of "test" data. And apply your model to make predictions. Then create test_X which comes from test_data but includes only the columns you used for prediction. The list of columns is stored in a variable called features.

x_test = test_data[FEATURES]

Make predictions which we will submit.

predictions = forest.predict(x_test)

Step 3: Save the Submission

submission = pandas.DataFrame(dict(Id=test_data.Id, SalePrice=predictions))
submission.to_csv(DATA_PATH/"submission.csv", index=False)

Step 4: Test Your Work

To test your results, you'll need to join the competition (if you haven't already). So open a new window by clicking on this link. Then click on the Join Competition button.

join competition image

Next, follow the instructions below:

  1. Begin by clicking on the blue COMMIT button in the top right corner of this window. This will generate a pop-up window.
  2. After your code has finished running, click on the blue Open Version button in the top right of the pop-up window. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
  3. Click on the Output tab on the left of the screen. Then, click on the Submit to Competition button to submit your results to the leaderboard.

You have now successfully submitted to the competition.

Step 5: Continuing Your Progress

There are many ways to improve your model, and experimenting is a great way to learn at this point.

The best way to improve your model is to add features. Look at the list of columns and think about what might affect home prices. Some features will cause errors because of issues like missing values or non-numeric data types.

End

The submission is evaluated using the Root Mean Squared Error of the logarithms of the sale prices. The logarithm makes it so the errors for the cheap houses and the expensive houses are equally bad.

For this model you get a score of *27,217.91640. The current leader has a score of 0, which would seem to imply he downloaded the original set and learned the data, the second best is 8,830.

The Intermediate Machine Learning micro-course will teach you how to handle these types of features. You will also learn to use xgboost, a technique giving even better accuracy than Random Forest.

Random Forests

Beginning

This is the fourth part of kaggle's Introduction to Machine Learning tutorial - Overfitting and Underfitting.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor

import hvplot.pandas
import pandas

Others

from graeae import EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Plottting

SLUG = "underfitting-and-overfitting-exercise"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

data = pandas.read_csv(
    Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()/"train.csv")

Middle

Preliminary 1: Specify Prediction Target

Select the target variable, which corresponds to the sales price. Save this to a new variable called `y`. You'll need to print a list of the columns to find the name of the column you need.

Our target is SalePrice.

Y = data.SalePrice

Preliminary 2: Create X

Now you will create a DataFrame called `X` holding the predictive features.

Since you want only some columns from the original data, you'll first create a list with the names of the columns you want in `X`.

You'll use just the following columns in the list (you can copy and paste the whole list to save some typing, though you'll still need to add quotes):

  • LotArea
  • YearBuilt
  • 1stFlrSF
  • 2ndFlrSF
  • FullBath
  • BedroomAbvGr
  • TotRmsAbvGrd
FEATURES = [
    "LotArea",
    "YearBuilt",
    "1stFlrSF",
    "2ndFlrSF",
    "FullBath",
    "BedroomAbvGr",
    "TotRmsAbvGrd",
]
X = data[FEATURES]

Split up the data into training and validation sets.

x_train, x_validate, y_train, y_validate = train_test_split(X, Y, random_state=1)

Preliminary 3: Specify and Fit Model

A Linear Regression Model

As a baseline, I'll fit a simple Linear Regression (ordinary-least-squares) model.

regression = LinearRegression()
scores = cross_val_score(regression, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")
regression = regression.fit(x_train, y_train)
print(f"Training R^2: {regression.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {regression.score(x_validate, y_validate):0.2f}")
0.66 (+/- 0.17)
Training R^2:  0.68
Validation R^2: 0.77

Decision Tree

Create a DecisionTreeRegressor and save it as iowa_model. Ensure you've done the relevant import from sklearn to run this command.

Then fit the model you just created using the data in X and y that you saved above.

tree = DecisionTreeRegressor()
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.54 (+/- 0.33)
Training R^2:  1.00
Validation R^2: 0.72

So our linear regression actually does better than the tree does. It looks like the tree might be overfitting on the training data.

Preliminary 4: Make Some Predictions

tree_predict = tree.predict(x_validate)
regression_predict = regression.predict(x_validate)

Preliminary 5: Calculate the Mean Absolute Error in Validation Data

tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)

print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
Tree MAE:  30219.18
Regression MAE:  27228.88

The tree's error is a little higher than the regression line's.

Preliminary 6: Compare Different Tree Sizes

Write a loop that tries the following values for max_leaf_nodes from a set of possible values.

Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes that gives the most accurate model on your data.

def get_mae(max_leaf_nodes, train_X=x_train, val_X=x_validate, train_y=y_train, val_y=y_validate):
    model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
    model.fit(train_X, train_y)
    preds_val = model.predict(val_X)
    mae = mean_absolute_error(val_y, preds_val)
    return mae

Write a loop to find the ideal tree size from candidate_max_leaf_nodes.

candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
outcomes = [(get_mae(nodes), nodes) for nodes in candidate_max_leaf_nodes]
best = min(outcomes)
print(best)
best_tree_size = best[1]
(27282.50803885739, 100)
mae = pandas.DataFrame(dict(nodes=candidate_max_leaf_nodes, mae = [outcome[0] for outcome in outcomes]))
plot = mae.hvplot(x="nodes", y="mae").opts(title="Node Mean Absolute Error",
                                           width=Plot.width,
                                           height=Plot.height)
source = Embed(plot=plot, file_name="node_mean_absolute_error")()
print(source)
: :

Figure Missing

:

Looking at the plot you can see that the error drops until you hit 100 nodes and then begins to rise again as it overfits the data with more nodes.

Let's see how much this improves our model using \(r^2\).

tree = DecisionTreeRegressor(max_leaf_nodes=best_tree_size)
scores = cross_val_score(tree, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = tree.fit(x_train, y_train)
print(f"Training R^2: {tree.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {tree.score(x_validate, y_validate):0.2f}")
0.58 (+/- 0.31)
Training R^2:  0.93
Validation R^2: 0.77

We've improved it slightly, it's probably still overfitting the model but not as much.

Step 1: Use a Random Forest

forest = RandomForestRegressor(random_state=1, n_estimators=100)

scores = cross_val_score(forest, x_train, y_train, cv=5)
print(f"{scores.mean():0.2f} (+/- {2 * scores.std():0.2f})")

tree = forest.fit(x_train, y_train)
print(f"Training R^2: {forest.score(x_train, y_train): 0.2f}")
print(f"Validation R^2: {forest.score(x_validate, y_validate):0.2f}")
0.76 (+/- 0.11)
Training R^2:  0.97
Validation R^2: 0.85

So the defaults already beat the regression and decision tree model.

forest_predictions = forest.predict(x_validate)
forest_mae = mean_absolute_error(y_true=y_validate, y_pred=forest_predictions)

tree_mae = mean_absolute_error(y_true=y_validate, y_pred=tree_predict)
regression_mae = mean_absolute_error(y_true=y_validate, y_pred=regression_predict)

print(f"Tree MAE: {tree_mae: 0.2f}")
print(f"Regression MAE: {regression_mae: 0.2f}")
print(f"Forest MAE: {forest_mae:0.2f}")
Tree MAE:  30219.18
Regression MAE:  27228.88
Forest MAE: 21857.16

So the forest also has a much better Mean Absolute Error than the other two models.

End

This is the end of the tutorial as far as how to build models. Next is a bit on entering a kaggle competition.