Bibliography

James, G., Witten, D., Hastie, T., and Tibshirani, R. Credit Card Balance Data [Data file]. USC; Marshall School of Business; Los Angeles, California. [cited 2020 Aug 1]. Available from: https://faculty.marshall.usc.edu/gareth-james/ISL/Credit.csv

Comment

This is a simulated data set created for An Introduction To Statistical Learning. There is an R documentation page (man-page-ish) for it here on github.

Linear Regression and Binary Data

Cloistered Monkey

2020-08-01 17:31

Beginning

I'm going to look at binary classification (two classes) and linear regression. In particular I'm going to use the Credit Card Balance data set (James et al., 2013) to look at how to interpret the linear model once we encode gender.

Set Up

Imports

First, some importing.

# python
from functools import partial
from pathlib import Path

import math
import os
import pickle

# pypi
from dotenv import load_dotenv
from expects import (
    be_true,
    expect
)
import hvplot.pandas
import pandas
import statsmodels.api as statsmodels
import statsmodels.formula.api as statsmodels_formula

# my stuff
from graeae import EmbedHoloviews, CountPercentage

The Environment

This loads the path to files.

load_dotenv(".env")

Plotting

This just sets up some convenience values for plotting.

SLUG = "linear-regression-and-binary-data"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/{SLUG}")

with Path(os.environ["PLOT_CONSTANTS"]).expanduser().open("rb") as reader:
    Plot = pickle.load(reader)

The Data

I downloaded the data from the ISL Data Set page previously so I'll load it here.

data = pandas.read_csv(Path(os.environ["CREDIT"]).expanduser(), index_col=0)

I passed in the index_col argument because the first column is a index column with no header, so it just looks goofy if you don't. There are several columns but I only want Gender and Balance (the credit card balance).

data = data[["Gender", "Balance"]]

Middle

The Data

Now that we have the data we can take a quick look at what's there.

counter = CountPercentage(data.Gender, value_label="Gender")
print(counter())

Gender	Count	Percent (%)
Female	207	51.75
Male	193	48.25

Our two classes are "Female" and "Male" and they are roughly, but not quite, equal in number. Now I'll look at the balance.

plot = data.hvplot.kde(y="Balance", by="Gender", color=Plot.color_cycle).opts(
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.font_scale,
    title="Credit Card Balance Distribution by Gender"
)

output = Embed(plot=plot, file_name="balance_distributions")()

print(output)

It looks like there are two populations for each gender. The larger one for both genders is centered near 0 and then both genders have a secondary population that carries a balance.

Encode the Gender

Since the Gender data is categorical we need to create a dummy variable to encode it. So what I'm going to do is encode males as 0 and females as 1 (because of the nature of binary encoding, we could swap the numbers and it would still work).

gender = dict(
    Male=0,
    Female=1,
)

data["gender"] = data.Gender.map(gender)
print(data.gender.value_counts())

1    207
0    193
Name: gender, dtype: int64

Fit the Regression

Now I'll fit the model with statsmodels, which uses r-style arguments (I think it supports regular python arguments too).

model = statsmodels_formula.ols("Balance ~ gender", data=data).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Balance   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.1836
Date:                Sun, 02 Aug 2020   Prob (F-statistic):              0.669
Time:                        16:55:01   Log-Likelihood:                -3019.3
No. Observations:                 400   AIC:                             6043.
Df Residuals:                     398   BIC:                             6051.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    509.8031     33.128     15.389      0.000     444.675     574.931
gender        19.7331     46.051      0.429      0.669     -70.801     110.267
==============================================================================
Omnibus:                       28.438   Durbin-Watson:                   1.940
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               27.346
Skew:                           0.583   Prob(JB):                     1.15e-06
Kurtosis:                       2.471   Cond. No.                         2.66
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The model is something like this:

\[ balance = \beta_0 + \beta_1 \times gender \]

Since we encoded Male as 0 and Female as 1, when the gender is Male the second term drops out and all you have is \(\beta_0\), while for females you have have the full equation. How do you interpret the \(\beta\)s?

\(\beta_0\) is the average balance that males carry
\(\beta_0 + \beta_1\) is the average balance that females carry
\(\beta_1\) is the difference between the average balances

We can check this by comparing the coef entry in the summary table that I printed. The Intercept is \(\beta_0\) and gender is \(\beta_1\)

male_mean = data[data.Gender=="Male"].Balance.mean()
female_mean = data[data.Gender=="Female"].Balance.mean()
print(f"Average Male Balance: {male_mean:.7}")
print(f"Average Female Balance: {female_mean:0.7}")
print(f"Average difference: {female_mean - male_mean:0.7}")

expect(math.isclose(male_mean, model.params.Intercept)).to(be_true)
expect(math.isclose(female_mean - male_mean, model.params.gender)).to(be_true)

Average Male Balance: 509.8031
Average Female Balance: 529.5362
Average difference: 19.73312

data = data.sort_values(by="Balance")
data["prediction"] = model.predict(data.gender)

scatter = data.hvplot.scatter(x="Balance", y="prediction", by="Gender",
                               color=Plot.color_cycle).opts(
    width=Plot.width,
    height=Plot.height,
    title="Gender Model",
    fontscale=Plot.font_scale,
)

output = Embed(plot=scatter, file_name="Gender Model")()

print(output)

I didn't set up a hypothesis test, but if you look at the p-value (the P>|t| column in the summary) for gender you can see that it's much larger than 0.05 or whatever level you would normally choose, so it's probable that gender isn't significant, so the average balance for both genders is really the same, given the deviation, but this is just about interpreting the coefficients, not deciding the validity of this particular model.

End

So there you have it. If you have the specialized case of binary categorical data you can convert the category to dummy variables and then fit a linear regression to it. If you encode the values as 0 and 1 then the y-intercept will give you the average output value for the category set to 0 and the slope will give you the difference between the average outputs for the categories. If you use different dummy variables the meanings will change slightly, although you will still be inferring things about the averages. Why is this interesting - predicting the mean for each category?

Logistic regression also relies on dummy variables for categorical encodings and this shows a preliminary step that helps us:

encode the dummy variables
build a linear model using statsmodels
view summary information about the model

I didn't emphasize it, but the p-value for the f-statistic might be valuable when deciding whether the categorical data is different enough to use as a feature.

Sources

An Introduction To Statistical Learning

Cloistered Monkey

2020-08-01 17:27

Bibliography

James G, Witten D, Hastie T, Tibshirani R, editors. An introduction to statistical learning: with applications in R. New York: Springer; 2013. 426 p. (Springer texts in statistics).

Notes

Book site (includes link to PDF download of the book)
Link to the data sets

Clicker Training (Wikipedia Article)

Cloistered Monkey

2020-07-31 15:58

Bibliography

Clicker training. In: Wikipedia [Internet]. 2020 [cited 2020 Jul 31]. Available from: https://en.wikipedia.org/w/index.php?title=Clicker_training&oldid=966460284

Step One: Get the Behavior

Cloistered Monkey

2020-07-31 15:41

Note

The first step of Clicker Training is "Get your rabbit to do what you want it to do", which seems a little ambiguous when you see it on the list, but it breaks down into four basic categories.

Capture Something She Does on Her Own

If you see your rabbit do something that she naturally does, like sit up or come to you at certain times, then click and reward it so you can build it up as something she'll do on cue.

Lure With Food

If your rabbit will take food from your hand you can teach her to go places by having her follow your hand with food in it, although the next category - Follow a Target - is a preferable way to do this.

Follow a Target

Create a target (like a ping-pong ball on the end of a stick) and teach your rabbit to follow it.

Start by putting it near her and clicking when she touches it with her nose or mouth
Extend it slowly by moving it away from her and rewarding her when she follows it

Shaping

The target training follows the shaping pattern, wherein you start by rewarding anything that sort of looks like what you want her to do - like just looking at the target when you show it to her, then progressively extending the behavior needed until it's the one that you want. For example:

look at the target
touch the target
follow the target a little
follow the target further
follow the target into her carrier

Source

Clicking With Your Rabbit

Twitter Sentiment Classification Using Distant Supervision

Cloistered Monkey

2020-07-31 15:11

Bibliography

Go A, Bhayani R, Huang L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford. 2009 Dec;1(12):2009.

Notes

This was a project report that looked at using emoticons to create a labeled data set for tweets.

About the Data

The authors noted that tweets are different from many other sources used for sentiment analysis - things like movie reviews - in that:

they are character limited (140 characters at the time of the paper, it has since doubled)
there is a huge amount of data to pull - and it is continuously being generated
there is an unusual amount of slang and non-normal spelling
it isn't subject specific - you can filter using the API, but twitter itself isn't a single-subject service

Using Emoticons as Labels

The use of emoticons to decide if a tweet is positive, or negative has the benefit of automatically creating a labled dataset, but since they are used as the labels they have to remove them from the training set, removing one of the more useful ways of identifying the tweet sentiment.

Getting and Cleaning the Data

The pulled 100 tweets form the API every 2 minutes until they had 800,000 positive and 800,000 negative tweets (after removing some tweets in pre-processing). The API lets you query by emoticon so the used ":)" to grab positive tweets (the API matches any known equivalent emoticon) and ":(" for negative tweets. They removed re-tweets and duplicates as well as any tweet that had both positive and negative emoticons in them. They then replaced usernames with the token USERNAME and URLs with URL and limited the number of consecutively repeated characters to 2.

Text Data Management and Analysis

Cloistered Monkey

2020-07-30 13:23

Bibliography

Zhai C, Massung S. Text data management and analysis: a practical introduction to information retrieval and text mining. First edition. New York: Association for Computing Machinery; 2016. 510 p. (ACM books).

Opinion Mining and Sentiment Analysis

Cloistered Monkey

2020-07-30 13:12

Note

An opinion is a subjective statement about what a person thinks or believes about something.

There are three basic parts needed to understand an opinion:

The opinion holder (person)
The opinion target (something)
The opinion content

In addition, to make it meaningful, you can add:

The Context of the opinion
The Sentiment of the opinion

(Zhai & Massung, 2016)

Sentiment Analysis is a part of Natural Language Processing that looks to transform freeform text into structured data. Opinion is an expression of belief, while sentiment is an expression of feeling. What is sometimes called Sentiment Analysis - the measurement of positive or negative sentiment of a text is really Polarity Classification, a subset of Sentiment Analysis. (Pozzi, Fersini, et al., 2017).

Sentiment Analysis In Social Networks

Cloistered Monkey

2020-07-30 13:05

Bibliography

Pozzi FA, Fersini E, Messina E, Liu B, editors. Sentiment analysis in social networks. Elsevier Inc.: 2017. 263 p.

Memorizing Binaries

Cloistered Monkey

2020-07-29 23:41

This is one way that you can memorize binary states - not just binary number, since you can encode the binary states as binary numbers.

Encode as bits (e.g. white=0, black=1)
Arrange the bits into 3x3 grids - 3 bits per row, 3 rows per grid

\[ 000\\ 110\\ 101\\ \]
Convert each row to decimal numbers.

\[ 000 = 0\\ 110 = 6\\ 101 = 5\\ \]
Convert the decimal to a word-image using a system. In the case of the Major System you might get 065 = sgl = seagull.
Repeat until you have all the bits covered
Put the images into a Memory Palace, two images per location.

Bibliography

Mnemonic Alphabet Systems

Bibliography

Comment

Table of Contents

Beginning

Set Up

Imports

The Environment

Plotting

The Data

Middle

The Data

Encode the Gender

Fit the Regression

End

Sources

Bibliography

Notes

Bibliography

Note

Capture Something She Does on Her Own

Lure With Food

Follow a Target

Shaping

Source

Bibliography

Notes

About the Data

Using Emoticons as Labels

Getting and Cleaning the Data

Bibliography

Note

Bibliography

Bibliography