Linear Regression and Binary Data
Table of Contents
Beginning
I'm going to look at binary classification (two classes) and linear regression. In particular I'm going to use the Credit Card Balance data set (James et al., 2013) to look at how to interpret the linear model once we encode gender.
Set Up
Imports
First, some importing.
# python
from functools import partial
from pathlib import Path
import math
import os
import pickle
# pypi
from dotenv import load_dotenv
from expects import (
be_true,
expect
)
import hvplot.pandas
import pandas
import statsmodels.api as statsmodels
import statsmodels.formula.api as statsmodels_formula
# my stuff
from graeae import EmbedHoloviews, CountPercentage
The Environment
This loads the path to files.
load_dotenv(".env")
Plotting
This just sets up some convenience values for plotting.
SLUG = "linear-regression-and-binary-data"
Embed = partial(EmbedHoloviews,
folder_path=f"files/posts/{SLUG}")
with Path(os.environ["PLOT_CONSTANTS"]).expanduser().open("rb") as reader:
Plot = pickle.load(reader)
The Data
I downloaded the data from the ISL Data Set page previously so I'll load it here.
data = pandas.read_csv(Path(os.environ["CREDIT"]).expanduser(), index_col=0)
I passed in the index_col
argument because the first column is a index column with no header, so it just looks goofy if you don't. There are several columns but I only want Gender
and Balance
(the credit card balance).
data = data[["Gender", "Balance"]]
Middle
The Data
Now that we have the data we can take a quick look at what's there.
counter = CountPercentage(data.Gender, value_label="Gender")
print(counter())
Gender | Count | Percent (%) |
---|---|---|
Female | 207 | 51.75 |
Male | 193 | 48.25 |
Our two classes are "Female" and "Male" and they are roughly, but not quite, equal in number. Now I'll look at the balance.
plot = data.hvplot.kde(y="Balance", by="Gender", color=Plot.color_cycle).opts(
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
title="Credit Card Balance Distribution by Gender"
)
output = Embed(plot=plot, file_name="balance_distributions")()
print(output)
It looks like there are two populations for each gender. The larger one for both genders is centered near 0 and then both genders have a secondary population that carries a balance.
Encode the Gender
Since the Gender
data is categorical we need to create a dummy variable to encode it. So what I'm going to do is encode males as 0 and females as 1 (because of the nature of binary encoding, we could swap the numbers and it would still work).
gender = dict(
Male=0,
Female=1,
)
data["gender"] = data.Gender.map(gender)
print(data.gender.value_counts())
1 207 0 193 Name: gender, dtype: int64
Fit the Regression
Now I'll fit the model with statsmodels, which uses r-style arguments (I think it supports regular python arguments too).
model = statsmodels_formula.ols("Balance ~ gender", data=data).fit()
print(model.summary())
OLS Regression Results ============================================================================== Dep. Variable: Balance R-squared: 0.000 Model: OLS Adj. R-squared: -0.002 Method: Least Squares F-statistic: 0.1836 Date: Sun, 02 Aug 2020 Prob (F-statistic): 0.669 Time: 16:55:01 Log-Likelihood: -3019.3 No. Observations: 400 AIC: 6043. Df Residuals: 398 BIC: 6051. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 509.8031 33.128 15.389 0.000 444.675 574.931 gender 19.7331 46.051 0.429 0.669 -70.801 110.267 ============================================================================== Omnibus: 28.438 Durbin-Watson: 1.940 Prob(Omnibus): 0.000 Jarque-Bera (JB): 27.346 Skew: 0.583 Prob(JB): 1.15e-06 Kurtosis: 2.471 Cond. No. 2.66 ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The model is something like this:
\[ balance = \beta_0 + \beta_1 \times gender \]
Since we encoded Male as 0 and Female as 1, when the gender is Male the second term drops out and all you have is \(\beta_0\), while for females you have have the full equation. How do you interpret the \(\beta\)s?
- \(\beta_0\) is the average balance that males carry
- \(\beta_0 + \beta_1\) is the average balance that females carry
- \(\beta_1\) is the difference between the average balances
We can check this by comparing the coef
entry in the summary table that I printed. The Intercept
is \(\beta_0\) and gender
is \(\beta_1\)
male_mean = data[data.Gender=="Male"].Balance.mean()
female_mean = data[data.Gender=="Female"].Balance.mean()
print(f"Average Male Balance: {male_mean:.7}")
print(f"Average Female Balance: {female_mean:0.7}")
print(f"Average difference: {female_mean - male_mean:0.7}")
expect(math.isclose(male_mean, model.params.Intercept)).to(be_true)
expect(math.isclose(female_mean - male_mean, model.params.gender)).to(be_true)
Average Male Balance: 509.8031 Average Female Balance: 529.5362 Average difference: 19.73312
data = data.sort_values(by="Balance")
data["prediction"] = model.predict(data.gender)
scatter = data.hvplot.scatter(x="Balance", y="prediction", by="Gender",
color=Plot.color_cycle).opts(
width=Plot.width,
height=Plot.height,
title="Gender Model",
fontscale=Plot.font_scale,
)
output = Embed(plot=scatter, file_name="Gender Model")()
print(output)
I didn't set up a hypothesis test, but if you look at the p-value (the P>|t|
column in the summary) for gender
you can see that it's much larger than 0.05 or whatever level you would normally choose, so it's probable that gender
isn't significant, so the average balance for both genders is really the same, given the deviation, but this is just about interpreting the coefficients, not deciding the validity of this particular model.
End
So there you have it. If you have the specialized case of binary categorical data you can convert the category to dummy variables and then fit a linear regression to it. If you encode the values as 0 and 1 then the y-intercept will give you the average output value for the category set to 0 and the slope will give you the difference between the average outputs for the categories. If you use different dummy variables the meanings will change slightly, although you will still be inferring things about the averages. Why is this interesting - predicting the mean for each category?
Logistic regression also relies on dummy variables for categorical encodings and this shows a preliminary step that helps us:
- encode the dummy variables
- build a linear model using statsmodels
- view summary information about the model
I didn't emphasize it, but the p-value for the f-statistic might be valuable when deciding whether the categorical data is different enough to use as a feature.