Frederick S, Novemsky N, Wang J, Dhar R, Nowlis S. Opportunity cost neglect. Journal of Consumer Research. 2009 Dec 1;36(4):553-61.

Abstract

And

To properly consider the opportunity costs of a purchase, consumers must actively generate the alternatives that it would displace.

But

The current research suggests that consumers often fail to do so. Even under conditions promoting cognitive effort, various cues to consider opportunity costs reduce purchase rates and increase the choice share of more affordable options. Sensitivity to such cues varies with chronic dispositional differences in spending attitudes.

Therefore

We discuss the implications of these results for the marketing strategies of economy and premium brands.

We want to answer the question of whether reminding college students that spending money now deprives them of money in the future will affect their spending behavior.

Imports

# pythonfromargparseimportNamespacefromfunctoolsimportpartial# pypifromexpectsimport(equal,expect,)fromnumpy.randomimportdefault_rngfromtabulateimporttabulateimportholoviewsimporthvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews,Timer

\(H_0:\) Null Hypothesis. Reminding students that they can save money for future purchases won't impact their spending decisions.

\(H_A:\) Alternative Hypothesis. Reminding students that they can save money for future purchases will reduce their spending.

I'll us a 95% significance level.

ALPHA="0.05"

The Study Data

150 Students were given the following statement:

Imagine that you have been saving some extra money on the side to make some purchases and on your most recent visit to the video store you come across a special sale on a new video. This video is one with your favorite actor or actress, and your favorite type of movie (such as comedy, drama, thriller, etc.). This particular video that you are considering is one you have been thinking about buying for a long time. It is available for a special sale price of $14.99.

What would you do in this situation? Please circle one of the options below.

Video store? Are we going to have to invent a time machine to run this study? No, we'll just swipe the data published by Frederick et al. in 2009. They conducted their study at Arizona State University, which perhaps still had some DVD stores around back then. The study was building on prior work that showed that people focus exclusively on explicitly presented details and ignore the facts that the explicit statements imply but don't state. (Also, I just looked it up and there's quite a few video stores around me (here in Portland) I guess I just haven't been to one in a while).

The options that the subjects were given to circle depended on which group they belonged to. The control group (75 students) were given these options:

A. Buy this entertaining video.
B. Not buy this entertaining video.

The treatment group (also 75 students) were given these options:

A. Buy this entertaining video.
B. Not buy this entertaining video. Keep the $14.99 for other purchases.

# put the groups into a columnplotter=data.reset_index()# get rid of the totalsdel(plotter["Total"])plotter=plotter.iloc[:-1]# move the outcome headers into a columnplotter=plotter.melt(id_vars=["index"],value_vars=[Outcome.buy,Outcome.dont_buy])plotter=plotter.rename(columns=dict(index="Group",variable="Outcome",value="Count"))plot=plotter.hvplot.bar(x="Group",y="Count",by="Outcome").opts(title="Buy Or Don't Buy DVD",width=Plot.width,height=Plot.height,color=Plot.color_cycle,fontscale=Plot.fontscale,)outcome=Embed(plot=plot,file_name="buy_dont_buy")()

print(outcome)

It looks like there was a significant difference since not only did the majority of the treatment group opt not to buy the DVD, but in the control a significant majority did.

Looking at the proportions it looks like quite a bit more didn't buy the DVD, but let's run the experiment and see.

Point Estimate of Effect

The thing we are interested in is whether the wording of the second option to not buy the DVD made a difference, so our statistic of interest is the difference in proportions of the control and treatment group participants who didn't buy the DVD.

\begin{align} \hat{p}_{control} &= \frac{\textrm{Control group that wouldnt' buy}}\textrm{size of Control Group}\\ \hat{p}_{treatment} &= \frac{\textrm{Treatment group that wouldnt' buy}}\textrm{size of Treament group}\\ \hat{p} = \hat{p}_{treatment} - \hat{p}_{control} \end{align}

About twenty percent more of the treatment group said they would abstain from the purchase than control group.

Simulating Random Chance

In a previous post looking at Gender Discrimination Inference I split the Urn into a 50-50 split of males and females to see if there was gender bias in choosing whether to promote them. In that case we didn't have a control group, but here we do, so in this case it's going to work a little differently.

We are asking here whether our treatment had an effect. If it didn't then we would expect that the distribution of "buy" and "don't buy" that we saw represents the distribution of the underlying population of ASU students, so we need to split our urn up to match the counts in the "Buy" and "Don't Buy" columns and then randomly split it into two equal groups. If the difference we saw was the result of random chance then we would expect that the difference between "buy" and "don't buy" group would be equal most of the time. This doesn't seem as intuitive a way to set it up as the previous method, but we'll see how it goes.

The distribution appears to be reasonably normal, if a bit peaked in the middle.

matched=len(simulation[simulation["Point Estimate"]>=POINT_ESTIMATE])/len(simulation)print(f"Percent of trials >= Point Estimate of Study: {100 * matched:0.3f} %")

Percent of trials >= Point Estimate of Study: 0.817 %

According to our simulation, less than one percent of the time we would see a difference like the study found by random chance alone.

Test Our Hypothesis

print(f"Null Hypothesis is {matched >= POINT_ESTIMATE}")

Null Hypothesis is False

So we'll conclude that the Null Hypothesis is false and that the Alternative Hypothesis that telling students about the opportunity cost of buying a DVD does have an effecte in getting them to not buy the DVD.

End

So here we have a walk through of Inference using simulation and an experiment with Control and Treatment groups. Although the conclusion reached is that reminding students of the money they would have in the future if they didn't spend it is "causal", since this was an experiment, I'm not 100% convinced that asking studnents what the

In the Abstract

Do you have control and treatment groups?

Create an urn where the ones and zeros are equal to the totals for each of the outcomes

Sample from the urn a simulated control group and treatment group

Find the difference between the proportion of the control group and the treatment group that match the "success" outcome

Calculate the fraction of the differences that are equal to or greater than the differences in the study

Check if the fraction in the simulation meets your confidence level

# pythonfromargparseimportNamespacefromfunctoolsimportpartial# pypifromexpectsimport(equal,expect,)fromnumpy.randomimportdefault_rngfromtabulateimporttabulateimportholoviewsimporthvplot.pandasimportnumpyimportpandas# my stufffromgraeaeimportEmbedHoloviews,Timer

This starts with a study where bank managers were each given a personnel file and asked to decide if they would promote the person represented by the file to branch manager. The files were identical but half of them were filled out as male and half as female. The researchers wanted to know if these manager were biased in favor of men. The table below shows their resuts.

Let's look at the values (without the totals) as a plot.

plotter=study.reset_index()del(plotter["Total"])plotter=plotter.iloc[:-1]plotter=plotter.melt(id_vars=["index"],value_vars=["Promoted","Not Promoted"])plotter=plotter.rename(columns=dict(index="Gender",value="Count",variable="Decision"))plot=plotter.hvplot.bar(x="Gender",y="Count",by="Decision").opts(title="Outcome by Gender",height=Plot.height,width=Plot.width,fontscale=Plot.fontscale,color=Plot.color_cycle,)outcome=Embed(plot=plot,file_name="promotions_by_gender")()

print(outcome)

It looks like a considerable majority of the males were promoted whereas for the females only a slight majority were promoted.

plot=plotter.hvplot.bar(x="Decision",y="Count",by="Gender").opts(title="Male Vs Female by Decision",height=Plot.height,width=Plot.width,fontscale=Plot.fontscale,color=Plot.color_cycle,)outcome=Embed(plot=plot,file_name="decision_by_gender")()

print(outcome)

This plot doesn't make the disparity look quite as extreme as the previous plot did, I think.

About the Variables

I started using them without explicity stating it but we have two variable here - Promoted and Not Promoted are part of the decision variable and Male and Female are part of the gender variable.

Anyway… so more males were chosen for promotion than female, but by what proportion?

print(TABLE(study/study.loc["Total"]))

Promoted

Not Promoted

Total

Male

0.6

0.230769

0.5

Female

0.4

0.769231

0.5

Total

1

1

1

So 60% of the promoted were male and 40% of the promoted were female.

print(f"{.6/.4:.2f}")

1.50

If you were promoted you were one and a half times more likely to be male. Another way to look at it is to ask - What proportion of each gender was promoted?

There was a 30% difference between the rate of male promotion and the rate of female promotion. The question we have now is - could this difference have happened by random chance or is this evidence of bias?.

The Experiment

We have a point estimate of the difference of 0.29 - is this evidence of bias?

Our Hypotheses

\(H_0\): Null Hypothesis. The variables gender and =decision are independent and the observed difference was due to chance.

\(H_A\): Alternative Hypothesis. The variables gender and decision are not independent and the difference was not due to chance, but rather women were less likely to be promoted than men.

I'm going to pick an arbitrary confidence interval of 0.95.

ALPHA=0.05

The Simulation

The basic method here is we'll create an "urn" with an equal number of "balls" for men and women (24 of each in this case) and then randomly select 35 balls representing the number that were promoted and see the difference in the fraction of males promoted vs the number of females. To make the math simple I'll run it a number of times that is a power of 10.

Some Setup

To make the counting easier I'll set males to be 1 and females to be 0 (so the number of males promoted is the sum and the females is the remainder).

plot=data.hvplot.hist("Point Estimate").opts(title="Distribution of Differences in Gender Promotion",width=Plot.width,height=Plot.height,fontscale=Plot.fontscale,)outcome=Embed(plot=plot,file_name="difference_distribution")()

print(outcome)

The distribution looks mostly normal. As you might guess the distribution is centered around 0 (cases where exactly the same number of males and females were promoted - although 35 were promoted each time so the trials are never exactly 0). The cases where the difference in proportion is as great or greater than as it was in the original study are in the rightmost two bins.

I should note that because the sample size is so small, there's a weird distribution of the points - there's not enough variation in the differences to make a smooth curve (thus the gaps in the histogram).

But anyway, what proportion of our simulations had as much or more of a difference than that found in the study?

print(f"Probability of our study's difference in promotion between genders by chance alone: {proportion:0.2f}.")print(f"Our tolerance was {ALPHA:0.2f}.")

Probability of our study's difference in promotion between genders by chance alone: 0.02.
Our tolerance was 0.05.

So we reject the null hypothesis and conclude that there is a statistically significant chance that the number of women promoted vs men in the original study was the result of bias.

End

Well, that was the replication (sort of) of this problem from Introductory Statistics with Randomization and Simulation. The point of it was to both review Hypothesis Testing and see how it can be done using simulation rather than a p-test.

In Abstract

Is your problem that you suspect some kind of bias in outcomes for two groups?

Get the Point Estimate for the value you want to test

State the Null Hypothesis that what happened could happen by chance and the Alternative Hypothesis that there was bias involved.

Decide on a tolerance level for the probability that it happened by chance.

Set up your urn

Simulate random selections for a large number of trials.

Calculate the proporiton of the trials that were greater than the original studies Point Estimate.

Make a conclusion whether the original outcome could have happened by random chance or not.

This is a look at pulling Oregon Covid-19 data from their weekly update PDFs. There are datasets for Oregon Covid-19 cases published by the Oregon Health Authority but for some reason I can't find the raw data sources matching what they have in their weekly reports so I thought maybe instead of just endlessly searching I'd pull them from the PDFs themselves.

This is where I'm going to save the PDFs - to make it portable I usually keep the path in a file with dotenv loads as an environment variable. This wayif things get moved around I can edit the file to change the path but the code that uses it doesn't have to change.

I could only find a link to the current PDF on their page (as of August 4, 2020). The URL embeds the date in it so to work backwards (and maybe later forwards) we need an easy way to set it. Here's what the URL looks like.

So, it looks like it came out on a Wednesday, which I'm going to assume is going to be the same every week. Now we can find the most recent Wednesday, which will be the start (or end, depending on how you look at it) of our date-ragen.

Now I'm going to work backwards one week at a time to create the URLS and file-names. Using the relativedelta makes it easy-peasey to get prior weeks from a given date. Here's the Wednesday before the most recent one.

Now, I don't actually know when these PDFs started so I'm just going to work backwards until I get an error message from my HTTP request and then stop.

FILE_NAME="COVID-19-Weekly-Report-{}-FINAL.pdf"BASE_URL="https://www.oregon.gov/oha/PH/DISEASESCONDITIONS/DISEASESAZ/Emerging%20Respitory%20Infections/{}"forweekinrange(20):print(f"Checking back {week} weeks from this past Wednesday")date=LAST-(one_week*week)filename=FILE_NAME.format(date)output_path=FOLDER/filenameifnotoutput_path.is_file():url=BASE_URL.format(filename)response=requests.get(url)ifnotresponse.ok:print(f"Bad week: {week}\tDate: {date}")breakprint(f"Saving {filename}")withoutput_path.open("wb")aswriter:writer.write(response.content)

Checking back 0 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-08-12-FINAL.pdf
Checking back 1 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-08-05-FINAL.pdf
Checking back 2 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-29-FINAL.pdf
Checking back 3 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-22-FINAL.pdf
Checking back 4 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-15-FINAL.pdf
Checking back 5 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-08-FINAL.pdf
Checking back 6 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-07-01-FINAL.pdf
Checking back 7 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-24-FINAL.pdf
Checking back 8 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-17-FINAL.pdf
Checking back 9 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-10-FINAL.pdf
Checking back 10 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-06-03-FINAL.pdf
Checking back 11 weeks from this past Wednesday
Saving COVID-19-Weekly-Report-2020-05-27-FINAL.pdf
Checking back 12 weeks from this past Wednesday
Bad week: 12 Date: 2020-05-20

So, there are eleven weeks of PDFs going back to May 27, 2020. Which seems a little short, given that Oregon started telling people to stay home in March, but maybe they have the data somewhere else. Anyway, next up is pulling the data from the PDFs from the files using tabula-py.

Most people find target leakage very tricky until they've thought about it for a long time.

So, before trying to think about leakage in the housing price example, we'll go through a few examples in other applications. Things will feel more familiar once you come back to a question about house prices.

1. The Data Science of Shoelaces

Nike has hired you as a data science consultant to help them save money on shoe materials. Your first assignment is to review a model one of their employees built to predict how many shoelaces they'll need each month. The features going into the machine learning model include:

The current month (January, February, etc)

Advertising expenditures in the previous month

Various macroeconomic features (like the unemployment rate) as of the beginning of the current month

The amount of leather they ended up using in the current month

The results show the model is almost perfectly accurate if you include the feature about how much leather they used. But it is only moderately accurate if you leave that feature out. You realize this is because the amount of leather they use is a perfect indicator of how many shoes they produce, which in turn tells you how many shoelaces they need.

Do you think the leather used feature constitutes a source of data leakage? If your answer is "it depends," what does it depend on?

leather_used does seem like a leakage, but it depends on whether the value is known before the shoelace predictions are made. If you won't know it in time for the predictions, then it is a leak. If there's a reason why you would always know the amount used but somehow not know how many shoes were made it might not be a data leak, but it seems odd that it would be useful. It be useable if it is a prediction of the amount of leather that will be needed, but it seems odd to use it then to predict shoelaces.

2. Return of the Shoelaces

You have a new idea. You could use the amount of leather Nike ordered (rather than the amount they actually used) leading up to a given month as a predictor in your shoelace model.

Does this change your answer about whether there is a leakage problem? If you answer "it depends," what does it depend on?

Whether it is a leak will depend on whether the leather is always ordered before the shoelaces or not. If they are always ordered before shoelaces then it wouldn't be a leak.

3. Getting Rich With Cryptocurrencies?

You saved Nike so much money that they gave you a bonus. Congratulations.

Your friend, who is also a data scientist, says he has built a model that will let you turn your bonus into millions of dollars. Specifically, his model predicts the price of a new cryptocurrency (like Bitcoin, but a newer one) one day ahead of the moment of prediction. His plan is to purchase the cryptocurrency whenever the model says the price of the currency (in dollars) is about to go up.

The most important features in his model are:

Current price of the currency

Amount of the currency sold in the last 24 hours

Change in the currency price in the last 24 hours

Change in the currency price in the last 1 hour

Number of new tweets in the last 24 hours that mention the currency

The value of the cryptocurrency in dollars has fluctuated up and down by over $100 in the last year, and yet his model's average error is less than $1. He says this is proof his model is accurate, and you should invest with him, buying the currency whenever the model says it is about to go up.

Is he right? If there is a problem with his model, what is it?

The data isn't leaking.

4. Preventing Infections

An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.

You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.

Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?

You have a clever idea.

Take all surgeries by each surgeon and calculate the infection rate among those surgeons.

For each patient in the data, find out who the surgeon was and plug in that surgeon's average infection rate as a feature.

Does this pose any target leakage issues? Does it pose any train-test contamination issues?

The infection rate would have a target leak if the calculated value includes the patient whose row it is added to.

You would have train-test contamination if you calculated this value using both the train and test set. You would have to calculate it only on the training set to avoid contamination.

5. Housing Prices

You will build a model to predict housing prices. The model will be deployed on an ongoing basis, to predict the price of a new house when a description is added to a website. Here are four features that could be used as predictors.

Size of the house (in square meters)

Average sales price of homes in the same neighborhood

Latitude and longitude of the house

Whether the house has a basement

You have historic data to train and validate the model.

Which of the features is most likely to be a source of leakage?

Average sales price of homes in the same neighborhood. If the home was sold in the past, then it would contribute to the average.

Leakage is a hard and subtle issue. You should be proud if you picked up on the issues in these examples.

Now you have the tools to make highly accurate models, and pick up on the most difficult practical problems that arise with applying these models to solve real problems.

In this step, you'll build and train your first model with gradient boosting.

Begin by setting my_model_1 to an XGBoost model. Use the XGBRegressor class, and set the random seed to 0 (random_state=0). Leave all other parameters as default.

Then, fit the model to the training data in X_train and y_train.

model=XGBRegressor(random_state=Data.random_seed)

model.fit(X_train,y_train)

predictions_1=model.predict(X_validate)

Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions for the validation set. Recall that the labels for the validation data are stored in y_valid.

Now that you've trained a default model as baseline, it's time to tinker with the parameters, to see if you can get better performance.

Begin by setting my_model_2 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to get better results.

Then, fit the model to the training data in X_train and y_train.

Set predictions_2 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.

Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.

estimators=list(range(50,200,10))max_depth=list(range(10,100,10))+[None]learning_rate=0.05*numpy.array(range(1,10))grid=dict(n_estimators=estimators,max_depth=max_depth)#learning_rate=learning_rate)model=XGBRegressor(random_state=Data.random_seed,learning_rate=0.05)search=RandomizedSearchCV(estimator=model,param_distributions=grid,n_iter=40,scoring="neg_mean_absolute_error",n_jobs=-1,random_state=1)X_cv=pandas.concat([X_train,X_validate])y_cv=pandas.concat([y_train,y_validate])withTIMER:search.fit(X_cv,y_cv)first_model=search.best_estimator_print(f"CV Training MAE: {-search.best_score_:0.2f}")print(search.best_params_)

In this step, you will create a model that performs worse than the original model in Step 1. This will help you to develop your intuition for how to set parameters. You might even find that you accidentally get better performance, which is ultimately a nice problem to have and a valuable learning experience!

Begin by setting my_model_3 to an XGBoost model, using the XGBRegressor class. Use what you learned in the previous tutorial to figure out how to change the default parameters (like n_estimators and learning_rate) to design a model to get high MAE.

Then, fit the model to the training data in X_train and y_train.

Set predictions_3 to the model's predictions for the validation data. Recall that the validation features are stored in X_valid.

Finally, use the mean_absolute_error() function to calculate the mean absolute error (MAE) corresponding to the predictions on the validation set. Recall that the labels for the validation data are stored in y_valid.

This one got a score of 14976.55345, so the early stopping model is the best one so far… It had fewer trees than the model that the RandomSearch CV ended up with, maybe the Random Search overfit the data.

So far, you've learned how to build pipelines with scikit-learn. For instance, the pipeline below will use SimpleImputer() to replace missing values in the data, before using RandomForestRegressor() to train a random forest model to make predictions. We set the number of trees in the random forest model with the n_estimators parameter, and setting random_state ensures reproducibility.

You have also learned how to use pipelines in cross-validation. The code below uses the cross_val_score() function to obtain the mean absolute error (MAE), averaged across five different folds. Recall we set the number of folds with the cv parameter.

# Multiply by -1 since sklearn calculates *negative* MAEscores=-1*cross_val_score(pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')print("Average MAE score:",scores.mean())

Average MAE score: 18276.410356164386

Step 1: Write a useful function

In this exercise, you'll use cross-validation to select parameters for a machine learning model.

Begin by writing a function get_score() that reports the average (over three cross-validation folds) MAE of a machine learning pipeline that uses:

the data in X and y to create folds,

SimpleImputer() (with all parameters left as default) to replace missing values, and

RandomForestRegressor() (with random_state=0) to fit a random forest model.

The n_estimators parameter supplied to get_score() is used when setting the number of trees in the random forest model.

defget_score(n_estimators):"""Return the average MAE over 3 CV folds of random forest model. Args: n_estimators: the number of trees in the forest """pipeline=Pipeline(steps=[('preprocessor',SimpleImputer()),('model',RandomForestRegressor(n_estimators=n_estimators,random_state=Data.random_seed))])scores=-1*cross_val_score(pipeline,X,y,cv=3,scoring='neg_mean_absolute_error')# Replace this body with your own codereturnscores.mean()

Step 2: Test different parameter values

Now, you will use the function that you defined in Step 1 to evaluate the model performance corresponding to eight different values for the number of trees in the random forest: 50, 100, 150, …, 300, 350, 400. Store your results in a Python dictionary results, where results[i] is the average MAE returned by get_scores(i).

plot=results_frame.hvplot(x="Trees",y="MAE").opts(title="Cross-Validation Mean Absolute Error",width=Plot.width,height=Plot.height)source=Embed(plot=plot,file_name="mean_absolute_error")()

print(source)

:

200 appears to be the best number of trees for our forest.

# Select categorical columns with relatively low cardinality (convenient but arbitrary)categorical_columns=[columnforcolumninX_train.columnsifX_train[column].nunique()<10andX_train[column].dtype==object]# Select numerical columnsnumerical_columns=[columnforcolumninX_train.columnsifX_train[column].dtypein['int64','float64']]# Keep selected columns onlycolumns=categorical_columns+numerical_columnsX_train=X_train[columns].copy()X_validate=X_validate[columns].copy()X_test=X_test[columns].copy()

Middle

Preprocess Data and Train the Model

The missing numeric values will be filled in with a simple imputer. When the strategy is set to constant then it will fill missing values with a single value (which is 0 by default).

We know that there's missing data, but since this is about handling categorical data, not missing data, we'll just drop the columns that have missing values.

---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-13-002dd7ea4a19> in <module>
----> 1 print(drop_X_train.info())
NameError: name 'drop_X_train' is not defined

Notice that the dataset contains both numerical and categorical variables. You'll need to encode the categorical data before training a model.

Score Dataset

This is the same function used in the missing-values tutorial. It's used to compare different models' Mean Absolute Error (MAE) as we make changes.

The first approach is to just drop all the non-numeric columns.

columns=[columnforcolumninX_train.columnsifX_train[column].dtype!=object]drop_X_train=X_train[columns]drop_X_validate=X_validate[columns]print("MAE from Approach 1 (Drop categorical variables):")print(f"{score_dataset(drop_X_train, drop_X_validate, y_train, y_validate):,}")

Using all the numeric columns does better than we did with our initial subset of columns (20,928.5), but not as good as we did with imputed values (16,656.3).

Step 2: Label encoding

Before jumping into label encoding, we'll investigate the dataset. Specifically, we'll look at the 'Condition2' column. The code cell below prints the unique entries in both the training and validation sets.

It looks like the validation data has values that aren't in the training data (and vice versa), e.g. RRNn, so encoding the training set won't work with the validation set.

This is a common problem that you'll encounter with real-world data, and there are many approaches to fixing this issue. For instance, you can write a custom label encoder to deal with new categories. The simplest approach, however, is to drop the problematic categorical columns.

Run the code cell below to save the problematic columns to a Python list bad_label_cols. Likewise, columns that can be safely label encoded are stored in good_label_cols.

# All categorical columnsobject_columns=[columnforcolumninX_train.columnsifX_train[column].dtype=="object"]# Columns that can be safely label encodedgood_label_columns=[columnforcolumninobject_columnsifset(X_train[column])==set(X_validate[column])]# Problematic columns that will be dropped from the datasetbad_label_columns=list(set(object_columns)-set(good_label_columns))print('Categorical columns that will be label encoded:')forcolumningood_label_columns:print(f" - {column}")print('\nCategorical columns that will be dropped from the dataset:')forcolumninbad_label_columns:print(f" - {column}")

Categorical columns that will be label encoded:

MSZoning

Street

LotShape

LandContour

LotConfig

BldgType

HouseStyle

ExterQual

CentralAir

KitchenQual

PavedDrive

SaleCondition

Categorical columns that will be dropped from the dataset:

Note: Sklearn's documentation says that this is meant only for categorical target data (the labels), not the input data like we're doing here. Later on we're going to use one-hot-encoding, which is what sklearn recommends (the LabelEncoder method implies that the numbers are values, not just numeric codes for strings).

It's going to create integer values for each of the unique values in each column.

print("MAE from Approach 2 (Label Encoding):")print(f"{score_dataset(label_X_train, label_X_validate, y_train, y_validate):,}")

MAE from Approach 2 (Label Encoding):
17,575.291883561644

So it does a little better than the previous approach of just dropping all the categorical data, but not as well as it did when we imputed the missing numeric values.

Step 3: Investigating cardinality

So far, you've tried two different approaches to dealing with categorical variables. And, you've seen that encoding categorical data yields better results than removing columns from the dataset.

Soon, you'll try one-hot encoding. Before then, there's one additional topic we need to cover. Begin by running the next code cell without changes.

Get number of unique entries in each column with categorical data

object_nunique=[X_train[column].nunique()forcolumninobject_columns]## Print number of unique entries by column, in descendingcardinality=pandas.DataFrame(dict(Column=object_columns,Cardinality=object_nunique)).sort_values(by="Cardinality",ascending=False)print(TABLE(cardinality))

Column

Cardinality

Neighborhood

25

Exterior2nd

16

Exterior1st

15

SaleType

9

Condition1

9

HouseStyle

8

RoofMatl

7

Functional

6

Heating

6

Foundation

6

RoofStyle

6

SaleCondition

6

Condition2

6

BldgType

5

ExterCond

5

LotConfig

5

HeatingQC

5

MSZoning

5

ExterQual

4

KitchenQual

4

LandContour

4

LotShape

4

LandSlope

3

PavedDrive

3

Street

2

Utilities

2

CentralAir

2

The output above shows, for each column with categorical data, the number of unique values in the column. For instance, the 'Street' column in the training data has two unique values: 'Grvl' and 'Pave', corresponding to a gravel road and a paved road, respectively.

We refer to the number of unique entries of a categorical variable as the cardinality of that categorical variable. For instance, the 'Street' variable has cardinality 2.

Questions

How many categorical variables in the training data have cardinality greater than 10?

For large datasets with many rows, one-hot encoding can greatly expand the size of the dataset. For this reason, we typically will only one-hot encode columns with relatively low cardinality. Then, high cardinality columns can either be dropped from the dataset, or we can use label encoding.

As an example, consider a dataset with 10,000 rows, and containing one categorical column with 100 unique entries.

If this column is replaced with the corresponding one-hot encoding, how many entries are added to the dataset?

If we instead replace the column with the label encoding, how many entries are added?

print(10000*100-10000)

990000

Step 4: One-hot encoding

In this step, you'll experiment with one-hot encoding. But, instead of encoding all of the categorical variables in the dataset, you'll only create a one-hot encoding for columns with cardinality less than 10.

Run the code cell below without changes to set low_cardinality_cols to a Python list containing the columns that will be one-hot encoded. Likewise, high_cardinality_cols contains a list of categorical columns that will be dropped from the dataset.

print("MAE from Approach 3 (One-Hot Encoding):")print(f"{score_dataset(OH_X_train, OH_X_validate, y_train, y_validate):,}")

MAE from Approach 3 (One-Hot Encoding):
17,429.93404109589

So we've improved slightly, but still not as well as the all numeric data with imputed data.

Step 5: Generate test predictions and submit your results

After you complete Step 4, if you'd like to use what you've learned to submit your results to the leaderboard, you'll need to preprocess the test data before generating predictions.

To get the imputation working again we need to re-add the columns with missing values. I'm also going to encode the entire dataset before splitting so that everything is encoded, rather than ignoring the values in the validation set that aren't in the training set.

It looks like the most significant categorical features are LandContour (Bnk and Lvl), either Condition1 or Condition2 (Norm) and ExterCond (TA). I just took a quick look they don't seem to contribute a whole lot to the model.

End

This was a brief look at handling categorical data.