Opportunity Cost Neglect

Citation

  • Frederick S, Novemsky N, Wang J, Dhar R, Nowlis S. Opportunity cost neglect. Journal of Consumer Research. 2009 Dec 1;36(4):553-61.

Abstract

And

To properly consider the opportunity costs of a purchase, consumers must actively generate the alternatives that it would displace.

But

The current research suggests that consumers often fail to do so. Even under conditions promoting cognitive effort, various cues to consider opportunity costs reduce purchase rates and increase the choice share of more affordable options. Sensitivity to such cues varies with chronic dispositional differences in spending attitudes.

Therefore

We discuss the implications of these results for the marketing strategies of economy and premium brands.

Opportunity Cost

Beginning

We want to answer the question of whether reminding college students that spending money now deprives them of money in the future will affect their spending behavior.

Imports

# python
from argparse import Namespace
from functools import partial

# pypi
from expects import (
    equal,
    expect,
)
from numpy.random import default_rng
from tabulate import tabulate

import holoviews
import hvplot.pandas
import numpy
import pandas

# my stuff
from graeae import EmbedHoloviews, Timer

Set Up

Randomness

numpy_random = default_rng()

Table

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")

Timer and Plotting

TIMER = Timer()

SLUG = "opportunity-cost"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/{SLUG}")

Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
    color_cycle = holoviews.Cycle(["#4687b7", "#ce7b6d"])
)

Middle

Hypotheses

  • \(H_0:\) Null Hypothesis. Reminding students that they can save money for future purchases won't impact their spending decisions.
  • \(H_A:\) Alternative Hypothesis. Reminding students that they can save money for future purchases will reduce their spending.

I'll us a 95% significance level.

ALPHA = "0.05"

The Study Data

150 Students were given the following statement:

Imagine that you have been saving some extra money on the side to make some purchases and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99.

What would you do in this situation? Please circle one of the options below.

Video store? Are we going to have to invent a time machine to run this study? No, we'll just swipe the data published by Frederick et al. in 2009. They conducted their study at Arizona State University, which perhaps still had some DVD stores around back then. The study was building on prior work that showed that people focus exclusively on explicitly presented details and ignore the facts that the explicit statements imply but don't state. (Also, I just looked it up and there's quite a few video stores around me (here in Portland) I guess I just haven't been to one in a while).

The options that the subjects were given to circle depended on which group they belonged to. The control group (75 students) were given these options:

A. Buy this entertaining video.
B. Not buy this entertaining video.

The treatment group (also 75 students) were given these options:

A. Buy this entertaining video.
B. Not buy this entertaining video. Keep the $14.99 for other purchases.

Here's the outcome of the study.

Outcome = Namespace(
    buy="Buy DVD",
    dont_buy="Not Buy DVD",
    control="Control Group",
    treatment="Treatment Group",
    total="Total"
)

data = pandas.DataFrame.from_dict({Outcome.buy:[56, 41],
                                   Outcome.dont_buy: [19, 34]})
data.index = [Outcome.control, Outcome.treatment]
row = data.sum()
row.name = Outcome.total
data = data.append(row)
data[Outcome.total] = data.sum(axis="columns")
print(TABLE(data))
  Buy DVD Not Buy DVD Total
Control Group 56 19 75
Treatment Group 41 34 75
Total 97 53 150
Total = Namespace(
    buy = data.loc[Outcome.total][Outcome.buy],
    dont_buy = data.loc[Outcome.total][Outcome.dont_buy],
    group = data.loc[Outcome.control][Outcome.total],
    students = data.loc[Outcome.total][Outcome.total],
)

Now a little plotting.

# put the groups into a column
plotter = data.reset_index()

# get rid of the totals
del(plotter["Total"])
plotter = plotter.iloc[:-1]

# move the outcome headers into a column
plotter = plotter.melt(id_vars=["index"],
                       value_vars=[Outcome.buy, Outcome.dont_buy])
plotter = plotter.rename(columns=dict(
    index="Group",
    variable="Outcome",
    value="Count"))
plot = plotter.hvplot.bar(x="Group", y="Count", by="Outcome").opts(
    title="Buy Or Don't Buy DVD",
    width=Plot.width,
    height=Plot.height,
    color=Plot.color_cycle,
    fontscale=Plot.fontscale,
)

outcome = Embed(plot=plot, file_name="buy_dont_buy")()
print(outcome)

Figure Missing

It looks like there was a significant difference since not only did the majority of the treatment group opt not to buy the DVD, but in the control a significant majority did.

Now as row-proportions.

proportions = data.divide(data[Outcome.total], axis="rows")
print(TABLE(proportions))
  Buy DVD Not Buy DVD Total
Control Group 0.746667 0.253333 1
Treatment Group 0.546667 0.453333 1
Total 0.646667 0.353333 1

Looking at the proportions it looks like quite a bit more didn't buy the DVD, but let's run the experiment and see.

Point Estimate of Effect

The thing we are interested in is whether the wording of the second option to not buy the DVD made a difference, so our statistic of interest is the difference in proportions of the control and treatment group participants who didn't buy the DVD.

\begin{align} \hat{p}_{control} &= \frac{\textrm{Control group that wouldnt' buy}}\textrm{size of Control Group}\\ \hat{p}_{treatment} &= \frac{\textrm{Treatment group that wouldnt' buy}}\textrm{size of Treament group}\\ \hat{p} = \hat{p}_{treatment} - \hat{p}_{control} \end{align}
POINT_ESTIMATE = (proportions[Outcome.dont_buy].loc[Outcome.treatment]
                  - proportions[Outcome.dont_buy].loc[Outcome.control])
print(f"{POINT_ESTIMATE * 100:.2f} %")
20.00 %

About twenty percent more of the treatment group said they would abstain from the purchase than control group.

Simulating Random Chance

In a previous post looking at Gender Discrimination Inference I split the Urn into a 50-50 split of males and females to see if there was gender bias in choosing whether to promote them. In that case we didn't have a control group, but here we do, so in this case it's going to work a little differently.

We are asking here whether our treatment had an effect. If it didn't then we would expect that the distribution of "buy" and "don't buy" that we saw represents the distribution of the underlying population of ASU students, so we need to split our urn up to match the counts in the "Buy" and "Don't Buy" columns and then randomly split it into two equal groups. If the difference we saw was the result of random chance then we would expect that the difference between "buy" and "don't buy" group would be equal most of the time. This doesn't seem as intuitive a way to set it up as the previous method, but we'll see how it goes.

Set Up the Urn

buy = data.loc[Outcome.total][Outcome.buy]
dont = data.loc[Outcome.total][Outcome.dont_buy]

bought = numpy.zeros(buy)
didnt = numpy.ones(dont)

URN = numpy.append(bought, didnt)

expect(len(URN)).to(equal(STUDENTS))
expect(buy).to(equal(97))
expect(dont).to(equal(53))
expect(URN.sum()).to(equal(dont))

Simulate

The Process:

  1. Shuffle the urn
  2. Split the shuffled urn in half (control and treatment)
  3. Count the proportion of each half that said they would not buy the DVD
  4. Record the differences between the proportions of the control that wouldn't buy and the treatment that wouldn't buy

But, since I'm doing this with numpy it works a little different.

My process:

  1. Randomly choose half the urn (without replacement) to get the control group
  2. Sum the choices (this gives the count of the control that said they wouldn't buy)
  3. Subtract the previous sum from the total that said they wouldn't buy (to get the treatment count)
  4. Calculate the proportion of each group that said they wouldn't buy
  5. Find the difference in proportion for each group
TRIALS = 1 * 10**6

with TIMER:
    controls = numpy.array([numpy_random.choice(URN,
                                                Total.group,
                                                replace=False).sum()
                            for trial in range(TRIALS)])
    treatments = Total.dont_buy - controls

    control_proportions = controls/Total.group
    treatment_proportions = treatments/Total.group

    differences = control_proportions - treatment_proportions
    simulation = pandas.DataFrame.from_dict({
        "Point Estimate": differences,
        "Control": control_proportions,
        "Treatment": treatment_proportions,
    })
2020-09-26 20:02:52,224 graeae.timers.timer start: Started: 2020-09-26 20:02:52.224515
2020-09-26 20:03:29,174 graeae.timers.timer end: Ended: 2020-09-26 20:03:29.174373
2020-09-26 20:03:29,175 graeae.timers.timer end: Elapsed: 0:00:36.949858

Plot the Distribution

plot = simulation.hvplot.hist("Point Estimate",
                              bins=25).opts(
    title="Opportunity Cost Don't Buy Difference",
    width=Plot.width,
    height=Plot.height,
    color=Plot.tan,
    )
outcome = Embed(plot=plot, file_name="opportunity_cost_simulation")()
print(outcome)

Figure Missing

The distribution appears to be reasonably normal, if a bit peaked in the middle.

matched = len(simulation[simulation["Point Estimate"] >= POINT_ESTIMATE])/len(simulation)
print(f"Percent of trials >= Point Estimate of Study: {100 * matched:0.3f} %")
Percent of trials >= Point Estimate of Study: 0.817 %

According to our simulation, less than one percent of the time we would see a difference like the study found by random chance alone.

Test Our Hypothesis

print(f"Null Hypothesis is {matched >= POINT_ESTIMATE}")
Null Hypothesis is False

So we'll conclude that the Null Hypothesis is false and that the Alternative Hypothesis that telling students about the opportunity cost of buying a DVD does have an effecte in getting them to not buy the DVD.

End

So here we have a walk through of Inference using simulation and an experiment with Control and Treatment groups. Although the conclusion reached is that reminding students of the money they would have in the future if they didn't spend it is "causal", since this was an experiment, I'm not 100% convinced that asking studnents what the

In the Abstract

  1. Do you have control and treatment groups?
  2. Create an urn where the ones and zeros are equal to the totals for each of the outcomes
  3. Sample from the urn a simulated control group and treatment group
  4. Find the difference between the proportion of the control group and the treatment group that match the "success" outcome
  5. Calculate the fraction of the differences that are equal to or greater than the differences in the study
  6. Check if the fraction in the simulation meets your confidence level

Gender Discrimination Inference

Beginning

Imports

# python
from argparse import Namespace
from functools import partial

# pypi
from expects import (
    equal,
    expect,    
)
from numpy.random import default_rng
from tabulate import tabulate

import holoviews
import hvplot.pandas
import numpy
import pandas

# my stuff
from graeae import EmbedHoloviews, Timer

Randomness

numpy_random = default_rng()

Plotting, Tables, and a Timer

Plotting

Embed = partial(EmbedHoloviews,
                folder_path="files/posts/gender-discrimination/")
Plot = Namespace(
    width=990,
    height=780,
    fontscale=2,
    tan="#ddb377",
    blue="#4687b7",
    red="#ce7b6d",
    color_cycle = holoviews.Cycle(["#4687b7", "#ce7b6d"])
)

The Table

TABLE = partial(tabulate, tablefmt="orgtbl", headers="keys")

The Timer

TIMER = Timer()

Middle

The Original Study

This starts with a study where bank managers were each given a personnel file and asked to decide if they would promote the person represented by the file to branch manager. The files were identical but half of them were filled out as male and half as female. The researchers wanted to know if these manager were biased in favor of men. The table below shows their resuts.

study = pandas.DataFrame({"Promoted": [21, 14],
                          "Not Promoted":[3, 10]},
                         index=["Male", "Female"])
row = study.sum()
row.name = "Total"
study = study.append(row)
column = study.sum(axis="columns")
study["Total"] = column
print(TABLE(study))
  Promoted Not Promoted Total
Male 21 3 24
Female 14 10 24
Total 35 13 48

Let's look at the values (without the totals) as a plot.

plotter = study.reset_index()
del(plotter["Total"])
plotter = plotter.iloc[:-1]
plotter = plotter.melt(id_vars=["index"],
                       value_vars=["Promoted", "Not Promoted"])
plotter = plotter.rename(columns=dict(index="Gender",
                                      value="Count",
                                      variable="Decision"))
plot = plotter.hvplot.bar(x="Gender", y="Count", by="Decision").opts(
    title="Outcome by Gender",
    height=Plot.height,
    width=Plot.width,
    fontscale=Plot.fontscale,
    color=Plot.color_cycle,
)
outcome = Embed(plot=plot, file_name="promotions_by_gender")()
print(outcome)

Figure Missing

It looks like a considerable majority of the males were promoted whereas for the females only a slight majority were promoted.

plot = plotter.hvplot.bar(x="Decision", y="Count", by="Gender").opts(
    title="Male Vs Female by Decision",
    height=Plot.height,
    width=Plot.width,
    fontscale=Plot.fontscale,
    color=Plot.color_cycle,
)
outcome = Embed(plot=plot, file_name="decision_by_gender")()
print(outcome)

Figure Missing

This plot doesn't make the disparity look quite as extreme as the previous plot did, I think.

About the Variables

I started using them without explicity stating it but we have two variable here - Promoted and Not Promoted are part of the decision variable and Male and Female are part of the gender variable.

Anyway… so more males were chosen for promotion than female, but by what proportion?

print(TABLE(study/study.loc["Total"]))
  Promoted Not Promoted Total
Male 0.6 0.230769 0.5
Female 0.4 0.769231 0.5
Total 1 1 1

So 60% of the promoted were male and 40% of the promoted were female.

print(f"{.6/.4:.2f}")
1.50

If you were promoted you were one and a half times more likely to be male. Another way to look at it is to ask - What proportion of each gender was promoted?

fractions = study/study.Total.values
print(TABLE(fractions))
  Promoted Not Promoted Total
Male 0.875 0.125 0.5
Female 0.583333 0.416667 0.5
Total 1.45833 0.541667 1

So about 88% of the males were promoted while only 58% of the females were promoted.

POINT_ESTIMATE = fractions.Promoted.loc['Male'] - fractions.Promoted.loc['Female']
print(f"{POINT_ESTIMATE:0.2f}")
0.29

There was a 30% difference between the rate of male promotion and the rate of female promotion. The question we have now is - could this difference have happened by random chance or is this evidence of bias?.

The Experiment

We have a point estimate of the difference of 0.29 - is this evidence of bias?

Our Hypotheses

  • \(H_0\): Null Hypothesis. The variables gender and =decision are independent and the observed difference was due to chance.
  • \(H_A\): Alternative Hypothesis. The variables gender and decision are not independent and the difference was not due to chance, but rather women were less likely to be promoted than men.

I'm going to pick an arbitrary confidence interval of 0.95.

ALPHA = 0.05

The Simulation

The basic method here is we'll create an "urn" with an equal number of "balls" for men and women (24 of each in this case) and then randomly select 35 balls representing the number that were promoted and see the difference in the fraction of males promoted vs the number of females. To make the math simple I'll run it a number of times that is a power of 10.

  • Some Setup

    To make the counting easier I'll set males to be 1 and females to be 0 (so the number of males promoted is the sum and the females is the remainder).

    NUMBER_OF_MALES = NUMBER_OF_FEMALES = 24
    MALE = 1
    FEMALE = 0
    PROMOTED = 35
    MALES = numpy.ones(NUMBER_OF_MALES)
    FEMALES = numpy.zeros(NUMBER_OF_FEMALES)
    URN = numpy.append(MALES, FEMALES)
    
    expect(len(URN)).to(equal(NUMBER_OF_MALES + NUMBER_OF_FEMALES))
    expect(URN.sum()).to(equal(NUMBER_OF_MALES))
    
  • The Trials
    TRIALS = 1 * 10**7
    males_promoted = [numpy_random.choice(URN, PROMOTED, replace=False).sum()
                      for trial in range(TRIALS)]
    females_promoted = [PROMOTED - males for males in males_promoted]
    proportion_males_promoted = [males/NUMBER_OF_MALES for males in males_promoted]
    proportion_females_promoted = [females/NUMBER_OF_FEMALES for females in females_promoted]
    pairs = zip(proportion_males_promoted, proportion_females_promoted)
    differences = [males - females for males, females in pairs]
    

Looking at the Distribution

data = pandas.DataFrame.from_dict({
    "Point Estimate": differences,
    "Males Promoted": males_promoted,
    "Females Promoted": females_promoted,
    "Proportion Males Promoted": proportion_males_promoted,
    "Proportion Females Promoted": proportion_females_promoted,
})
plot = data.hvplot.hist("Point Estimate").opts(
    title="Distribution of Differences in Gender Promotion",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.fontscale,
)

outcome = Embed(plot=plot, file_name="difference_distribution")()
print(outcome)

Figure Missing

The distribution looks mostly normal. As you might guess the distribution is centered around 0 (cases where exactly the same number of males and females were promoted - although 35 were promoted each time so the trials are never exactly 0). The cases where the difference in proportion is as great or greater than as it was in the original study are in the rightmost two bins.

I should note that because the sample size is so small, there's a weird distribution of the points - there's not enough variation in the differences to make a smooth curve (thus the gaps in the histogram).

But anyway, what proportion of our simulations had as much or more of a difference than that found in the study?

proportion = len(data[data["Point Estimate"] >= POINT_ESTIMATE])/TRIALS
print(f"Proportion: {100 * proportion:0.3f} %")
Proportion: 2.450 %
print(f"Probability of our study's difference in promotion between genders by chance alone: {proportion:0.2f}.")
print(f"Our tolerance was {ALPHA:0.2f}.")
Probability of our study's difference in promotion between genders by chance alone: 0.02.
Our tolerance was 0.05.

So we reject the null hypothesis and conclude that there is a statistically significant chance that the number of women promoted vs men in the original study was the result of bias.

End

Well, that was the replication (sort of) of this problem from Introductory Statistics with Randomization and Simulation. The point of it was to both review Hypothesis Testing and see how it can be done using simulation rather than a p-test.

In Abstract

  1. Is your problem that you suspect some kind of bias in outcomes for two groups?
  2. Get the Point Estimate for the value you want to test
  3. State the Null Hypothesis that what happened could happen by chance and the Alternative Hypothesis that there was bias involved.
  4. Decide on a tolerance level for the probability that it happened by chance.
  5. Set up your urn
  6. Simulate random selections for a large number of trials.
  7. Calculate the proporiton of the trials that were greater than the original studies Point Estimate.
  8. Make a conclusion whether the original outcome could have happened by random chance or not.

Pulling Oregon Covid-19 Data

Beginning

This is a look at pulling Oregon Covid-19 data from their weekly update PDFs. There are datasets for Oregon Covid-19 cases published by the Oregon Health Authority but for some reason I can't find the raw data sources matching what they have in their weekly reports so I thought maybe instead of just endlessly searching I'd pull them from the PDFs themselves.

Set Up

Imports

# python
from pathlib import Path

import calendar
import datetime
import os

# pypi
from dateutil.relativedelta import relativedelta
from dotenv import load_dotenv

import requests

The Path

This is where I'm going to save the PDFs - to make it portable I usually keep the path in a file with dotenv loads as an environment variable. This wayif things get moved around I can edit the file to change the path but the code that uses it doesn't have to change.

load_dotenv(Path("~/.local/share/env").expanduser())
FOLDER = Path(os.environ["OREGON_COVID"]).expanduser()
if not FOLDER.is_dir():
    FOLDER.mkdir(parents=True)

Middle

Starting the Dates

I could only find a link to the current PDF on their page (as of August 4, 2020). The URL embeds the date in it so to work backwards (and maybe later forwards) we need an easy way to set it. Here's what the URL looks like.

EXAMPLE_URL = "https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/DISEASESAZ/Emerging%20Respitory%20Infections/COVID-19-Weekly-Report-2020-07-29-FINAL.pdf"

So to work with it I'll find out what day of the week July 29, 2020 was (the date in the URL).

START = datetime.date(2020, 7, 29)
WEEKDAY = START.weekday()
assert WEEKDAY == calendar.WEDNESDAY
print(WEEKDAY)
2

So, it looks like it came out on a Wednesday, which I'm going to assume is going to be the same every week. Now we can find the most recent Wednesday, which will be the start (or end, depending on how you look at it) of our date-ragen.

TODAY = datetime.date.today()
LAST = TODAY + relativedelta(weekday=WEEKDAY) - relativedelta(weeks=1) 
print(LAST)
2020-08-12

Now I'm going to work backwards one week at a time to create the URLS and file-names. Using the relativedelta makes it easy-peasey to get prior weeks from a given date. Here's the Wednesday before the most recent one.

one_week = relativedelta(weeks = 1)
print(LAST - one_week)
2020-08-05

Now, I don't actually know when these PDFs started so I'm just going to work backwards until I get an error message from my HTTP request and then stop.

FILE_NAME = "COVID-19-Weekly-Report-{}-FINAL.pdf"
BASE_URL = "https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/DISEASESAZ/Emerging%20Respitory%20Infections/{}"

for week in range(20):
    print(f"Checking back {week} weeks from this past Wednesday")
    date = LAST - (one_week * week)
    filename = FILE_NAME.format(date)
    output_path = FOLDER/filename
    if not output_path.is_file():
        url = BASE_URL.format(filename)
        response = requests.get(url)
        if not response.ok:
            print(f"Bad week: {week}\tDate: {date}")
            break
        print(f"Saving {filename}")
        with output_path.open("wb") as writer:            
            writer.write(response.content)
Checking back 0 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-08-12-FINAL.pdf
Checking back 1 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-08-05-FINAL.pdf
Checking back 2 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-29-FINAL.pdf
Checking back 3 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-22-FINAL.pdf
Checking back 4 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-15-FINAL.pdf
Checking back 5 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-08-FINAL.pdf
Checking back 6 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-01-FINAL.pdf
Checking back 7 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-24-FINAL.pdf
Checking back 8 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-17-FINAL.pdf
Checking back 9 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-10-FINAL.pdf
Checking back 10 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-03-FINAL.pdf
Checking back 11 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-05-27-FINAL.pdf
Checking back 12 weeks from this past Wednesday
Bad week: 12    Date: 2020-05-20

So, there are eleven weeks of PDFs going back to May 27, 2020. Which seems a little short, given that Oregon started telling people to stay home in March, but maybe they have the data somewhere else. Anyway, next up is pulling the data from the PDFs from the files using tabula-py.

Data Leakage

Beginning

Most people find target leakage very tricky until they've thought about it for a long time.

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

1. The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:

  • The current month (January, February, etc)
  • Advertising expenditures in the previous month
  • Various macroeconomic features (like the unemployment rate) as of the beginning of the current month
  • The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the leather used feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

leather_used does seem like a leakage, but it depends on whether the value is known before the shoelace predictions are made. If you won't know it in time for the predictions, then it is a leak. If there's a reason why you would always know the amount used but somehow not know how many shoes were made it might not be a data leak, but it seems odd that it would be useful. It be useable if it is a prediction of the amount of leather that will be needed, but it seems odd to use it then to predict shoelaces.

2. Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

Whether it is a leak will depend on whether the leather is always ordered before the shoelaces or not. If they are always ordered before shoelaces then it wouldn't be a leak.

3. Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

The most important features in his model are:

  • Current price of the currency
  • Amount of the currency sold in the last 24 hours
  • Change in the currency price in the last 24 hours
  • Change in the currency price in the last 1 hour
  • Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over $100 in the last year, and yet his model's average error is less than $1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

The data isn't leaking.

4. Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea.

  1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
  2. For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Does this pose any target leakage issues? Does it pose any train-test contamination issues?

The infection rate would have a target leak if the calculated value includes the patient whose row it is added to.

You would have train-test contamination if you calculated this value using both the train and test set. You would have to calculate it only on the training set to avoid contamination.

5. Housing Prices

You will build a model to predict housing prices. The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website. Here are four features that could be used as predictors.

  1. Size of the house (in square meters)
  2. Average sales price of homes in the same neighborhood
  3. Latitude and longitude of the house
  4. Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

Average sales price of homes in the same neighborhood. If the home was sold in the past, then it would contribute to the average.

Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.

XGBoost

Beginning

In this exercise, you will leverage what you've learned to tune a machine learning model with cross-validation.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

import random

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from xgboost import XGBRegressor

from tabulate import tabulate

import eli5
import hvplot.pandas
import numpy
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "xgboost"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)
X_test = test_data

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")
X_train, X_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)

Selecting columns

"Cardinality" means the number of unique values in a column. Select categorical columns with relatively low cardinality (convenient but arbitrary).

low_cardinality_cols = [
    cname for cname in X_train.columns if X_train[cname].nunique() < 10 and
    X_train[cname].dtype == "object"]

Select numeric columns.

numeric_columns = [column for column in X_train.columns
                   if X_train[column].dtype in ['int64', 'float64']]

Keep selected columns only

keep_columns = low_cardinality_cols + numeric_columns
X_train = X_train[keep_columns].copy()
X_valid = X_validate[keep_columns].copy()
X_test = X_test[keep_columns].copy()

One-Hot Encoding

One-hot encode the data (to shorten the code, we use pandas).

X_train = pandas.get_dummies(X_train)
X_validate = pandas.get_dummies(X_validate)
X_test = pandas.get_dummies(X_test)
X_train, X_validate = X_train.align(X_validate, join='left', axis=1)
X_train, X_test = X_train.align(X_test, join='left', axis=1)

Step 1: Build model

In this step, you'll build and train your first model with gradient boosting.

  • Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the random seed to 0 (random_state=0). Leave all other parameters as default.
  • Then, fit the model to the training data in X_train and y_train.
model = XGBRegressor(random_state=Data.random_seed)
model.fit(X_train, y_train)
predictions_1 = model.predict(X_validate)

Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions for the validation set. Recall that the labels for the validation data are stored in y_valid.

mae_1 = mean_absolute_error(predictions_1, y_validate)
print(f"Mean Absolute Error: {mae_1}")
Mean Absolute Error: 17662.736729452055

Step 2: Improve the model

Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if you can get better performance.

  • Begin by setting my_model_2 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to get better results.
  • Then, fit the model to the training data in X_train and y_train.
  • Set predictions_2 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
  • Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]
learning_rate = 0.05 * numpy.array(range(1, 10))

grid = dict(n_estimators=estimators,
            max_depth=max_depth)
            #learning_rate=learning_rate)

model = XGBRegressor(random_state=Data.random_seed, learning_rate=0.05)
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            scoring="neg_mean_absolute_error",
                            n_jobs=-1,
                            random_state=1)

X_cv = pandas.concat([X_train, X_validate])
y_cv = pandas.concat([y_train, y_validate])
with TIMER:
    search.fit(X_cv, y_cv)
first_model = search.best_estimator_
print(f"CV Training MAE: {-search.best_score_:0.2f}")
print(search.best_params_)
2020-03-01 21:34:37,418 graeae.timers.timer start: Started: 2020-03-01 21:34:37.418449
2020-03-01 21:36:24,107 graeae.timers.timer end: Ended: 2020-03-01 21:36:24.107671
2020-03-01 21:36:24,109 graeae.timers.timer end: Elapsed: 0:01:46.689222
CV Training MAE: 16048.16
{'n_estimators': 160, 'max_depth': None}
outcome = pandas.DataFrame({"Score": search.cv_results_["mean_test_score"]})
early_stopping_model = XGBRegressor(random_state=Data.random_seed,
                                    learning_rate=0.05,
                                    early_stopping_rounds=5,
                                    n_estimators=500)
early_stopping_model.fit(X_train, y_train, eval_set=[(X_validate, y_validate)],
                         verbose=False)
print(mean_absolute_error(early_stopping_model.predict(X_validate),
                          y_validate))
16728.27523009418
params = model.get_params()
print(f"Trees: {params['n_estimators']}")
Trees: 100

Step 3: Break the model

In this step, you will create a model that performs worse than the original model in Step 1. This will help you to develop your intuition for how to set parameters. You might even find that you accidentally get better performance, which is ultimately a nice problem to have and a valuable learning experience!

  • Begin by setting my_model_3 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to design a model to get high MAE.
  • Then, fit the model to the training data in X_train and y_train.
  • Set predictions_3 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.
  • Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.
parameters = random.choice(search.cv_results_["params"])
print(parameters)
model = XGBRegressor(random_state=Data.random_seed, **parameters)
with TIMER:
    model.fit(X_train, y_train)
print(f"MAE: {mean_absolute_error(model.predict(X_validate), y_validate)}")
2020-02-29 20:55:54,573 graeae.timers.timer start: Started: 2020-02-29 20:55:54.573008
{'n_estimators': 60, 'max_depth': 20, 'learning_rate': 0.7000000000000001}
2020-02-29 20:55:55,019 graeae.timers.timer end: Ended: 2020-02-29 20:55:55.019335
2020-02-29 20:55:55,020 graeae.timers.timer end: Elapsed: 0:00:00.446327
MAE: 25077.38005672089

Numeric Values Only

imputer = IterativeImputer(random_state=Data.random_seed)

final_x_train = pandas.DataFrame(imputer.fit_transform(X_train),
                                 columns=X_train.columns)
early_stopping_model_2 = XGBRegressor(random_state=Data.random_seed,
                                      learning_rate=0.05,
                                      early_stopping_rounds=5,
                                      n_estimators=500)
early_stopping_model_2.fit(final_x_train, y_train, eval_set=[(X_validate, y_validate)],
                         verbose=False)
print(mean_absolute_error(early_stopping_model_2.predict(X_validate),
                          y_validate))
16516.399373929795

End

Make a Submission using the XGB early-stopping model

predictions = early_stopping_model.predict(X_test)
output = pandas.DataFrame({'Id': X_test.index,
                           'SalePrice': predictions})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This got a score of 14777.96266 on the kaggle submissions page.

Random Search CV

predictions = search.predict(X_test)
output = pandas.DataFrame({'Id': X_test.index,
                           'SalePrice': predictions})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This one got a score of 14976.55345, so the early stopping model is the best one so far… It had fewer trees than the model that the RandomSearch CV ended up with, maybe the Random Search overfit the data.

Early Stopping with Imputation

predictions = early_stopping_model_2.predict(X_test)
output = pandas.DataFrame({'Id': X_test.index,
                           'SalePrice': predictions})
output.to_csv(DATA_PATH/'submission.csv', index=False)

This gets a score of 14965.20801 so it looks like the XGBoost model without imputation is the best one.

Cross Validation

Beginning

In this exercise, you will leverage what you've learned to tune a machine learning model with cross-validation.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from tabulate import tabulate

import eli5
import hvplot.pandas
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "cross-validation"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)
X_test = test_data

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")

Drop Categorical Columns

To make it simpler (since we're only looking at cross-validation, and having them didn't seem to help) we're going to drop the categorical columns.

numeric_columns = [column for column in X.columns if not X[column].dtype == "object"]
rows_0, columns_0 = X.shape
X = X[numeric_columns].copy()
row, columns = X.shape
print(f"Keeping {columns} columns, dropped ({columns_0 - columns})")
Keeping 36 columns, dropped (43)

Middle

Some Pipelines

So far, you've learned how to build pipelines with scikit-learn. For instance, the pipeline below will use SimpleImputer() to replace missing values in the data, before using RandomForestRegressor() to train a random forest model to make predictions. We set the number of trees in the random forest model with the n_estimators parameter, and setting random_state ensures reproducibility.

pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=Data.random_seed))
])

You have also learned how to use pipelines in cross-validation. The code below uses the cross_val_score() function to obtain the mean absolute error (MAE), averaged across five different folds. Recall we set the number of folds with the cv parameter.

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("Average MAE score:", scores.mean())
Average MAE score: 18276.410356164386

Step 1: Write a useful function

In this exercise, you'll use cross-validation to select parameters for a machine learning model.

Begin by writing a function get_score() that reports the average (over three cross-validation folds) MAE of a machine learning pipeline that uses:

  • the data in X and y to create folds,
  • SimpleImputer() (with all parameters left as default) to replace missing values, and
  • RandomForestRegressor() (with random_state=0) to fit a random forest model.

The n_estimators parameter supplied to get_score() is used when setting the number of trees in the random forest model.

def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.

    Args:
     n_estimators: the number of trees in the forest
    """
    pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators=n_estimators,
                                        random_state=Data.random_seed))
    ])
    scores = -1 * cross_val_score(pipeline, X, y,
                                  cv=3,
                                  scoring='neg_mean_absolute_error')
    # Replace this body with your own code
    return scores.mean()

Step 2: Test different parameter values

Now, you will use the function that you defined in Step 1 to evaluate the model performance corresponding to eight different values for the number of trees in the random forest: 50, 100, 150, …, 300, 350, 400. Store your results in a Python dictionary results, where results[i] is the average MAE returned by get_scores(i).

results = {trees: get_score(trees) for trees in range(50, 450, 50)}

results_frame = pandas.DataFrame.from_dict({"Trees": list(results.keys()), "MAE": list(results.values())})

Step 3: Find the best parameter value

plot = results_frame.hvplot(x="Trees", y="MAE").opts(
    title="Cross-Validation Mean Absolute Error",
    width=Plot.width,
    height=Plot.height)

source = Embed(plot=plot, file_name="mean_absolute_error")()
print(source)
: :

Figure Missing

:

200 appears to be the best number of trees for our forest.

Pipelines

Beginning

In this exercise, you will use pipelines to improve the efficiency of your machine learning code.

Imports

Python

from argparse import Namespace
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot

from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder

from tabulate import tabulate

import eli5
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "kaggle-intermediate-machine-learning-pipelines"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
if not OUTPUT_PATH.is_dir():
    OUTPUT_PATH.mkdir()
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)
X_test = test_data

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")
X_train, X_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)

Trim the columns

# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_columns = [column for column in X_train.columns if
                       X_train[column].nunique() < 10 and 
                       X_train[column].dtype == object]

# Select numerical columns
numerical_columns = [column for column in X_train.columns if 
                     X_train[column].dtype in ['int64', 'float64']]

# Keep selected columns only
columns = categorical_columns + numerical_columns
X_train = X_train[columns].copy()
X_validate = X_validate[columns].copy()
X_test = X_test[columns].copy()

Middle

Preprocess Data and Train the Model

The missing numeric values will be filled in with a simple imputer. When the strategy is set to constant then it will fill missing values with a single value (which is 0 by default).

numerical_transformer = SimpleImputer(strategy='constant')

Now the categorical data transformer. We'll use the most frequent value in any column with missing values to fill them in and the do one-hot encoding.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

Now we can bundle them together into a single transformer.

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_columns),
        ('cat', categorical_transformer, categorical_columns)
    ])

Define The Model

model = RandomForestRegressor(n_estimators=100, random_state=0)

Build the Pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

Fit the Model

# Preprocessing of training data, fit model 
pipeline.fit(X_train, y_train)

Score the Model

preds = pipeline.predict(X_validate)

print(f"MAE: {mean_absolute_error(y_validate, preds):,}")
MAE: 17,861.780102739725

Improving the Performance

Now, it's your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:

  • numerical_transformer
  • categorical_transformer
  • model
numerical_transformer = IterativeImputer(random_state=Data.random_seed)

I'll use the same categorical imputer.

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

And now we bundle them together.

preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numerical_transformer, numerical_columns),
        ('categorical', categorical_transformer, categorical_columns)
    ])

Now build and train the model.

model = RandomForestRegressor(n_estimators=50, max_depth=60, random_state=0)
pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                           ('model', model)
                           ])
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_validate)

print(f"MAE: {mean_absolute_error(y_validate, predictions):,}")
MAE: 17,556.51404109589

So we improved slightly, but we're still not doing as well as with the numeric only dataset.

SHAP

with TIMER:
    training = X_train[numerical_columns + categorical_columns]
    data = preprocessor.fit_transform(training)
    columns = (numerical_columns
               + list(preprocessor.named_transformers_["categorical"]["onehot"].get_feature_names()))
    data = pandas.DataFrame(
        data,
        columns=columns)
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(data)
2020-03-01 19:25:58,515 graeae.timers.timer start: Started: 2020-03-01 19:25:58.514631
Setting feature_perturbation = "tree_path_dependent" because no background data was given.
2020-03-01 19:26:16,103 graeae.timers.timer end: Ended: 2020-03-01 19:26:16.103820
2020-03-01 19:26:16,104 graeae.timers.timer end: Elapsed: 0:00:17.589189
shap.summary_plot(shap_values, data)
figure = pyplot.gcf()
output = "shap_summary.png"

figure.savefig(OUTPUT_PATH/output)
print(f"[[file:{output}]]")

52fc9fa36e385cc6b1cc5a40b4505610a7dd6c65.png

[[file:shap_summary.png]]
<Figure size 432x288 with 0 Axes>

shap_summary.png

shap.initjs()
html = shap.force_plot(explainer.expected_value, shap_values, data)
output_file = "force_plot.html"
output = OUTPUT_PATH/output_file
with output.open("w") as writer:
    shap.save_html(writer, html, False)

print(f"""
#+begin_export html
: <object type="text/html" data="{output_file}" style="width:100%" height=800>
:   <p>Figure Missing</p>
: </object>
#+end_export
""")
: :

Figure Missing

:

End

Categorical Values

Beginning

Now it's your turn to test your new knowledge of missing values handling. You'll probably find it makes a big difference.

Imports

Python

from argparse import Namespace
from datetime import datetime
from functools import partial
from pathlib import Path

PyPi

from eli5.sklearn import PermutationImportance
from matplotlib import pyplot
from pdpbox import pdp

from sklearn.ensemble import RandomForestRegressor
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, train_test_split, RandomizedSearchCV
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from tabulate import tabulate

import eli5
import hvplot.pandas
import pandas
import shap

Others

from graeae import CountPercentage, EmbedHoloviews, EnvironmentLoader, Timer

Set Up

Table

TABLE = partial(tabulate, tablefmt="orgtbl", showindex=False, headers="keys")

Plottting

SLUG = "categorical-values"
OUTPUT_PATH = Path("../../files/posts/tutorials/")/SLUG
Embed = partial(EmbedHoloviews, folder_path=OUTPUT_PATH)
Plot = Namespace(
    height=800,
    width=1000,
)

The Timer

TIMER = Timer()

Environment

ENVIRONMENT = EnvironmentLoader()

The Data

DATA_PATH = Path(ENVIRONMENT["HOUSE-PRICES-IOWA"]).expanduser()
train_data = pandas.read_csv(
    DATA_PATH/"train.csv", index_col="Id")

test_data = pandas.read_csv(
    DATA_PATH/"test.csv", index_col="Id"
)

Some Constants

Data = Namespace(
    target="SalePrice",
    train_size=0.8,
    test_size=0.2,
    random_seed=0,
)

Middle

Setup The Data

Split up the target and features.

assert not train_data[Data.target].hasnans
y = train_data[Data.target]
X = train_data.drop([Data.target], axis="columns")

We know that there's missing data, but since this is about handling categorical data, not missing data, we'll just drop the columns that have missing values.

print(X.shape)
X = X[[column for column in X.columns if not X[column].hasnans]]
test_data = test_data[X.columns]
print(X.shape)
(1460, 79)
(1460, 60)

So we lost 19 columns - more than I was expecting.

Now do the train-test split.

X_train, X_validate, y_train, y_validate = train_test_split(
    X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)
print(drop_X_train.info())
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-13-002dd7ea4a19> in <module>
----> 1 print(drop_X_train.info())

NameError: name 'drop_X_train' is not defined

Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.

Score Dataset

This is the same function used in the missing-values tutorial. It's used to compare different models' Mean Absolute Error (MAE) as we make changes.

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

Step 1: Drop Categorical Columns

The first approach is to just drop all the non-numeric columns.

columns = [column for column in X_train.columns if X_train[column].dtype != object]
drop_X_train = X_train[columns]
drop_X_validate = X_validate[columns]

print("MAE from Approach 1 (Drop categorical variables):")
print(f"{score_dataset(drop_X_train, drop_X_validate, y_train, y_validate):,}")

Using all the numeric columns does better than we did with our initial subset of columns (20,928.5), but not as good as we did with imputed values (16,656.3).

Step 2: Label encoding

Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2' column. The code cell below prints the unique entries in both the training and validation sets.

train_counter = CountPercentage(X_train.Condition2, dropna=False)
validate_counter = CountPercentage(X_validate.Condition2, dropna=False)
train_counter()
Value Count Percent (%)
Norm 1160 99.32
Feedr 4 0.34
PosA 1 0.09
Artery 1 0.09
RRAe 1 0.09
PosN 1 0.09
validate_counter()
Value Count Percent (%)
Norm 285 97.60
Feedr 2 0.68
RRNn 2 0.68
RRAn 1 0.34
Artery 1 0.34
PosN 1 0.34

It looks like the validation data has values that aren't in the training data (and vice versa), e.g. RRNn, so encoding the training set won't work with the validation set.

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.

Run the code cell below to save the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.

# All categorical columns
object_columns = [column for column in X_train.columns if X_train[column].dtype == "object"]

# Columns that can be safely label encoded
good_label_columns = [column for column in object_columns if 
                      set(X_train[column]) == set(X_validate[column])]

# Problematic columns that will be dropped from the dataset
bad_label_columns = list(set(object_columns)-set(good_label_columns))

print('Categorical columns that will be label encoded:')
for column in  good_label_columns:
    print(f" - {column}")

print('\nCategorical columns that will be dropped from the dataset:')
for column in bad_label_columns:
    print(f" - {column}")

Categorical columns that will be label encoded:

  • MSZoning
  • Street
  • LotShape
  • LandContour
  • LotConfig
  • BldgType
  • HouseStyle
  • ExterQual
  • CentralAir
  • KitchenQual
  • PavedDrive
  • SaleCondition

Categorical columns that will be dropped from the dataset:

  • Condition1
  • RoofMatl
  • HeatingQC
  • ExterCond
  • RoofStyle
  • SaleType
  • Foundation
  • Condition2
  • Exterior2nd
  • Neighborhood
  • Heating
  • LandSlope
  • Utilities
  • Functional
  • Exterior1st

Drop the Bad Columns

label_X_train = X_train.drop(bad_label_columns, axis="columns")
label_X_validate = X_validate.drop(bad_label_columns, axis="columns")

Encode the Categorical Values

We're going to use sklearn's LabelEncoder.

Note: Sklearn's documentation says that this is meant only for categorical target data (the labels), not the input data like we're doing here. Later on we're going to use one-hot-encoding, which is what sklearn recommends (the LabelEncoder method implies that the numbers are values, not just numeric codes for strings).

It's going to create integer values for each of the unique values in each column.

for column in good_label_columns:
    encoder = LabelEncoder()    
    label_X_train.loc[:, column] = encoder.fit_transform(label_X_train[column])
    label_X_validate.loc[:, column] = encoder.fit_transform(label_X_validate[column])

Now check how it did.

print("MAE from Approach 2 (Label Encoding):") 
print(f"{score_dataset(label_X_train, label_X_validate, y_train, y_validate):,}")
MAE from Approach 2 (Label Encoding):
17,575.291883561644

So it does a little better than the previous approach of just dropping all the categorical data, but not as well as it did when we imputed the missing numeric values.

Step 3: Investigating cardinality

So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.

Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.

Get number of unique entries in each column with categorical data

object_nunique = [X_train[column].nunique() for column in object_columns]

## Print number of unique entries by column, in descending
cardinality = pandas.DataFrame(dict(Column=object_columns,
                                    Cardinality=object_nunique)
                     ).sort_values(by="Cardinality", ascending=False)
print(TABLE(cardinality))
Column Cardinality
Neighborhood 25
Exterior2nd 16
Exterior1st 15
SaleType 9
Condition1 9
HouseStyle 8
RoofMatl 7
Functional 6
Heating 6
Foundation 6
RoofStyle 6
SaleCondition 6
Condition2 6
BldgType 5
ExterCond 5
LotConfig 5
HeatingQC 5
MSZoning 5
ExterQual 4
KitchenQual 4
LandContour 4
LotShape 4
LandSlope 3
PavedDrive 3
Street 2
Utilities 2
CentralAir 2

The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street' variable has cardinality 2.

Questions

How many categorical variables in the training data have cardinality greater than 10?

print(len(cardinality[cardinality.Cardinality > 10]))
3

How many columns are needed to one-hot encode the 'Neighborhood' variable in the training data?

print(cardinality[cardinality.Column=="Neighborhood"].Cardinality.iloc[0])
25

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.

  • If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?
  • If we instead replace the column with the label encoding, how many entries are added?
print(10000 * 100 - 10000)
990000

Step 4: One-hot encoding

In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.

Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded. Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.

low_cardinality_columns = cardinality[cardinality.Cardinality < 10].Column
high_cardinality_columns = cardinality[~cardinality.Column.isin(low_cardinality_columns)].Column
print("Categorical columns that will be one-hot encoded:")
for column in low_cardinality_columns:
    print(f" - {column}")

Categorical columns that will be one-hot encoded:

  • SaleType
  • Condition1
  • HouseStyle
  • RoofMatl
  • Functional
  • Heating
  • Foundation
  • RoofStyle
  • SaleCondition
  • Condition2
  • BldgType
  • ExterCond
  • LotConfig
  • HeatingQC
  • MSZoning
  • ExterQual
  • KitchenQual
  • LandContour
  • LotShape
  • LandSlope
  • PavedDrive
  • Street
  • Utilities
  • CentralAir
print('Categorical columns that will be dropped from the dataset:')
for column in high_cardinality_columns:
    print(f" - {column}")

Categorical columns that will be dropped from the dataset:

  • Neighborhood
  • Exterior2nd
  • Exterior1st

Use the next code cell to one-hot encode the data in X_train and X_valid. Set the preprocessed DataFrames to OH_X_train and OH_X_valid, respectively.

  • The full list of categorical columns in the dataset can be found in the Python list object_cols.
  • You should only one-hot encode the categorical columns in low_cardinality_cols. All other categorical columns should be dropped from the dataset.
OH_train = X_train[low_cardinality_columns]
OH_validate = X_validate[low_cardinality_columns]

encoder = OneHotEncoder(sparse=False, handle_unknown="ignore")

OH_train = pandas.DataFrame(encoder.fit_transform(OH_train))
OH_validate = pandas.DataFrame(encoder.fit_transform(OH_validate))

OH_train.index = X_train.index
OH_validate.index = X_validate.index

X_train_numeric = X_train.drop(object_columns, axis="columns")
X_validate_numeric = X_validate.drop(object_columns, axis="columns")

OH_X_train = pandas.concat([X_train_numeric, OH_train], axis="columns")
OH_X_validate = pandas.concat([X_validate_numeric, OH_validate], axis="columns")
print("MAE from Approach 3 (One-Hot Encoding):") 
print(f"{score_dataset(OH_X_train, OH_X_validate, y_train, y_validate):,}")
MAE from Approach 3 (One-Hot Encoding):
17,429.93404109589

So we've improved slightly, but still not as well as the all numeric data with imputed data.

Step 5: Generate test predictions and submit your results

After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.

To get the imputation working again we need to re-add the columns with missing values. I'm also going to encode the entire dataset before splitting so that everything is encoded, rather than ignoring the values in the validation set that aren't in the training set.

X_2 = train_data.drop([Data.target], axis="columns")
objects = [column for column in X_2.columns if X_2[column].dtype==object]
missing = [column for column in objects if X_2[column].hasnans]

X_2 = X_2.drop(missing, axis="columns")
OH_X = X_2.drop(high_cardinality_columns, axis="columns").reset_index(drop=True)
for column in low_cardinality_columns:
    encoder = OneHotEncoder(sparse=False)
    encoded = encoder.fit_transform(
        OH_X[column].to_numpy().reshape(-1, 1)
    )
    reencoded = pandas.DataFrame(encoded, columns=encoder.get_feature_names())
    OH_X = pandas.concat([OH_X, reencoded], axis="columns").drop(
        column, axis="columns")

imputer = IterativeImputer(random_state=Data.random_seed)
OH_X = pandas.DataFrame(imputer.fit_transform(OH_X), columns=OH_X.columns)
OH_X_train, OH_X_validate, y_train, y_validate = train_test_split(
    OH_X, y,
    train_size=Data.train_size, test_size=Data.test_size,
    random_state=Data.random_seed)

model = RandomForestRegressor(n_estimators=100, random_state=Data.random_seed)
model.fit(OH_X_train, y_train)

preds_valid = model.predict(OH_X_validate)
error = mean_absolute_error(y_validate, preds_valid)
print(f"MAE: {error:0.2f}")

It seems to have gotten worse… but maybe that's because we tuned the hyperparameters to the numeric-only model.

Hyperparameter Tuning

estimators = list(range(50, 200, 10))
max_depth = list(range(10, 100, 10)) + [None]

grid = dict(n_estimators=estimators,
            max_depth=max_depth)

model = RandomForestRegressor()
search = RandomizedSearchCV(estimator=model,
                            param_distributions=grid,
                            n_iter=40,
                            n_jobs=-1,
                            random_state=1)

search.fit(OH_X_train, y_train)
model = search.best_estimator_
print(f"CV Training R^2: {search.best_score_:0.2f}")
print(f"Training R^2: {model.score(OH_X_train, y_train): 0.2f}")
print(f"Validation R^2: {model.score(OH_X_validate, y_validate):0.2f}")
predictions = model.predict(OH_X_validate)
print(f"Mean Absolute Error: {mean_absolute_error(y_validate, predictions)}")
print(search.best_params_)
CV Training R^2: 0.86
Training R^2:  0.98
Validation R^2: 0.84
Mean Absolute Error: 17615.31526418787
{'n_estimators': 140, 'max_depth': 60}

So it can get a little better, but it doesn't do as well as with just the numeric features. Maybe we don't have enough data to make it work.

permutor = PermutationImportance(model, random_state=Data.random_seed).fit(
    OH_X_validate, y_validate)
ipython_html = eli5.show_weights(
    permutor,
    feature_names=OH_X_validate.columns.tolist())
table = pandas.read_html(ipython_html.data)[0]
print(TABLE(table))
| Weight           | Feature      |
|------------------+--------------|
| 0.4825  ± 0.1218 | OverallQual  |
| 0.1019  ± 0.0350 | GrLivArea    |
| 0.0199  ± 0.0082 | TotalBsmtSF  |
| 0.0171  ± 0.0067 | BsmtFinSF1   |
| 0.0131  ± 0.0062 | 1stFlrSF     |
| 0.0096  ± 0.0078 | GarageCars   |
| 0.0084  ± 0.0011 | 2ndFlrSF     |
| 0.0074  ± 0.0026 | LotArea      |
| 0.0051  ± 0.0022 | YearRemodAdd |
| 0.0050  ± 0.0034 | GarageArea   |
| 0.0048  ± 0.0047 | BedroomAbvGr |
| 0.0036  ± 0.0009 | LotFrontage  |
| 0.0031  ± 0.0021 | YearBuilt    |
| 0.0031  ± 0.0014 | OverallCond  |
| 0.0029  ± 0.0012 | WoodDeckSF   |
| 0.0021  ± 0.0021 | MasVnrArea   |
| 0.0014  ± 0.0016 | OpenPorchSF  |
| 0.0012  ± 0.0009 | FullBath     |
| 0.0009  ± 0.0008 | x0_RM        |
| 0.0008  ± 0.0036 | GarageYrBlt  |
| … 142 more …     | … 142 more … |

It looks like the most significant categorical features are LandContour (Bnk and Lvl), either Condition1 or Condition2 (Norm) and ExterCond (TA). I just took a quick look they don't seem to contribute a whole lot to the model.

End

This was a brief look at handling categorical data.