Linear Regression and Binary Data


I'm going to look at binary classification (two classes) and linear regression. In particular I'm going to use the Credit Card Balance data set (James et al., 2013) to look at how to interpret the linear model once we encode gender.

Set Up


First, some importing.

# python
from functools import partial
from pathlib import Path

import math
import os
import pickle

# pypi
from dotenv import load_dotenv
from expects import (
import hvplot.pandas
import pandas
import statsmodels.api as statsmodels
import statsmodels.formula.api as statsmodels_formula

# my stuff
from graeae import EmbedHoloviews, CountPercentage

The Environment

This loads the path to files.



This just sets up some convenience values for plotting.

SLUG = "linear-regression-and-binary-data"
Embed = partial(EmbedHoloviews,

with Path(os.environ["PLOT_CONSTANTS"]).expanduser().open("rb") as reader:
    Plot = pickle.load(reader)

The Data

I downloaded the data from the ISL Data Set page previously so I'll load it here.

data = pandas.read_csv(Path(os.environ["CREDIT"]).expanduser(), index_col=0)

I passed in the index_col argument because the first column is a index column with no header, so it just looks goofy if you don't. There are several columns but I only want Gender and Balance (the credit card balance).

data = data[["Gender", "Balance"]]


The Data

Now that we have the data we can take a quick look at what's there.

counter = CountPercentage(data.Gender, value_label="Gender")
Gender Count Percent (%)
Female 207 51.75
Male 193 48.25

Our two classes are "Female" and "Male" and they are roughly, but not quite, equal in number. Now I'll look at the balance.

plot = data.hvplot.kde(y="Balance", by="Gender", color=Plot.color_cycle).opts(
    title="Credit Card Balance Distribution by Gender"

output = Embed(plot=plot, file_name="balance_distributions")()

Figure Missing

It looks like there are two populations for each gender. The larger one for both genders is centered near 0 and then both genders have a secondary population that carries a balance.

Encode the Gender

Since the Gender data is categorical we need to create a dummy variable to encode it. So what I'm going to do is encode males as 0 and females as 1 (because of the nature of binary encoding, we could swap the numbers and it would still work).

gender = dict(
data["gender"] =
1    207
0    193
Name: gender, dtype: int64

Fit the Regression

Now I'll fit the model with statsmodels, which uses r-style arguments (I think it supports regular python arguments too).

model = statsmodels_formula.ols("Balance ~ gender", data=data).fit()
                            OLS Regression Results                            
Dep. Variable:                Balance   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.1836
Date:                Sun, 02 Aug 2020   Prob (F-statistic):              0.669
Time:                        16:55:01   Log-Likelihood:                -3019.3
No. Observations:                 400   AIC:                             6043.
Df Residuals:                     398   BIC:                             6051.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
Intercept    509.8031     33.128     15.389      0.000     444.675     574.931
gender        19.7331     46.051      0.429      0.669     -70.801     110.267
Omnibus:                       28.438   Durbin-Watson:                   1.940
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               27.346
Skew:                           0.583   Prob(JB):                     1.15e-06
Kurtosis:                       2.471   Cond. No.                         2.66

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model is something like this:

\[ balance = \beta_0 + \beta_1 \times gender \]

Since we encoded Male as 0 and Female as 1, when the gender is Male the second term drops out and all you have is \(\beta_0\), while for females you have have the full equation. How do you interpret the \(\beta\)s?

  • \(\beta_0\) is the average balance that males carry
  • \(\beta_0 + \beta_1\) is the average balance that females carry
  • \(\beta_1\) is the difference between the average balances

We can check this by comparing the coef entry in the summary table that I printed. The Intercept is \(\beta_0\) and gender is \(\beta_1\)

male_mean = data[data.Gender=="Male"].Balance.mean()
female_mean = data[data.Gender=="Female"].Balance.mean()
print(f"Average Male Balance: {male_mean:.7}")
print(f"Average Female Balance: {female_mean:0.7}")
print(f"Average difference: {female_mean - male_mean:0.7}")

expect(math.isclose(male_mean, model.params.Intercept)).to(be_true)
expect(math.isclose(female_mean - male_mean, model.params.gender)).to(be_true)
Average Male Balance: 509.8031
Average Female Balance: 529.5362
Average difference: 19.73312
data = data.sort_values(by="Balance")
data["prediction"] = model.predict(data.gender)
scatter = data.hvplot.scatter(x="Balance", y="prediction", by="Gender",
    title="Gender Model",

output = Embed(plot=scatter, file_name="Gender Model")()

Figure Missing

I didn't set up a hypothesis test, but if you look at the p-value (the P>|t| column in the summary) for gender you can see that it's much larger than 0.05 or whatever level you would normally choose, so it's probable that gender isn't significant, so the average balance for both genders is really the same, given the deviation, but this is just about interpreting the coefficients, not deciding the validity of this particular model.


So there you have it. If you have the specialized case of binary categorical data you can convert the category to dummy variables and then fit a linear regression to it. If you encode the values as 0 and 1 then the y-intercept will give you the average output value for the category set to 0 and the slope will give you the difference between the average outputs for the categories. If you use different dummy variables the meanings will change slightly, although you will still be inferring things about the averages. Why is this interesting - predicting the mean for each category?

Logistic regression also relies on dummy variables for categorical encodings and this shows a preliminary step that helps us:

  • encode the dummy variables
  • build a linear model using statsmodels
  • view summary information about the model

I didn't emphasize it, but the p-value for the f-statistic might be valuable when deciding whether the categorical data is different enough to use as a feature.

Step One: Get the Behavior


The first step of Clicker Training is "Get your rabbit to do what you want it to do", which seems a little ambiguous when you see it on the list, but it breaks down into four basic categories.

Capture Something She Does on Her Own

If you see your rabbit do something that she naturally does, like sit up or come to you at certain times, then click and reward it so you can build it up as something she'll do on cue.

Lure With Food

If your rabbit will take food from your hand you can teach her to go places by having her follow your hand with food in it, although the next category - Follow a Target - is a preferable way to do this.

Follow a Target

Create a target (like a ping-pong ball on the end of a stick) and teach your rabbit to follow it.

  • Start by putting it near her and clicking when she touches it with her nose or mouth
  • Extend it slowly by moving it away from her and rewarding her when she follows it


The target training follows the shaping pattern, wherein you start by rewarding anything that sort of looks like what you want her to do - like just looking at the target when you show it to her, then progressively extending the behavior needed until it's the one that you want. For example:

  1. look at the target
  2. touch the target
  3. follow the target a little
  4. follow the target further
  5. follow the target into her carrier

Twitter Sentiment Classification Using Distant Supervision


  • Go A, Bhayani R, Huang L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford. 2009 Dec;1(12):2009.


This was a project report that looked at using emoticons to create a labeled data set for tweets.

About the Data

The authors noted that tweets are different from many other sources used for sentiment analysis - things like movie reviews - in that:

  • they are character limited (140 characters at the time of the paper, it has since doubled)
  • there is a huge amount of data to pull - and it is continuously being generated
  • there is an unusual amount of slang and non-normal spelling
  • it isn't subject specific - you can filter using the API, but twitter itself isn't a single-subject service

Using Emoticons as Labels

The use of emoticons to decide if a tweet is positive, or negative has the benefit of automatically creating a labled dataset, but since they are used as the labels they have to remove them from the training set, removing one of the more useful ways of identifying the tweet sentiment.

Getting and Cleaning the Data

The pulled 100 tweets form the API every 2 minutes until they had 800,000 positive and 800,000 negative tweets (after removing some tweets in pre-processing). The API lets you query by emoticon so the used ":)" to grab positive tweets (the API matches any known equivalent emoticon) and ":(" for negative tweets. They removed re-tweets and duplicates as well as any tweet that had both positive and negative emoticons in them. They then replaced usernames with the token USERNAME and URLs with URL and limited the number of consecutively repeated characters to 2.

Opinion Mining and Sentiment Analysis


An opinion is a subjective statement about what a person thinks or believes about something.

There are three basic parts needed to understand an opinion:

  1. The opinion holder (person)
  2. The opinion target (something)
  3. The opinion content

In addition, to make it meaningful, you can add:

  • The Context of the opinion
  • The Sentiment of the opinion

(Zhai & Massung, 2016)

Sentiment Analysis is a part of Natural Language Processing that looks to transform freeform text into structured data. Opinion is an expression of belief, while sentiment is an expression of feeling. What is sometimes called Sentiment Analysis - the measurement of positive or negative sentiment of a text is really Polarity Classification, a subset of Sentiment Analysis. (Pozzi, Fersini, et al., 2017).

Memorizing Binaries

This is one way that you can memorize binary states - not just binary number, since you can encode the binary states as binary numbers.

  1. Encode as bits (e.g. white=0, black=1)
  2. Arrange the bits into 3x3 grids - 3 bits per row, 3 rows per grid

    \[ 000\\ 110\\ 101\\ \]

  3. Convert each row to decimal numbers.

    \[ 000 = 0\\ 110 = 6\\ 101 = 5\\ \]

  4. Convert the decimal to a word-image using a system. In the case of the Major System you might get 065 = sgl = seagull.
  5. Repeat until you have all the bits covered
  6. Put the images into a Memory Palace, two images per location.