James, G., Witten, D., Hastie, T., and Tibshirani, R. Credit Card Balance Data [Data file]. USC; Marshall School of Business; Los Angeles, California. [cited 2020 Aug 1]. Available from: https://faculty.marshall.usc.edu/gareth-james/ISL/Credit.csv
I'm going to look at binary classification (two classes) and linear regression. In particular I'm going to use the Credit Card Balance data set (James et al., 2013) to look at how to interpret the linear model once we encode gender.
Set Up
Imports
First, some importing.
# pythonfromfunctoolsimportpartialfrompathlibimportPathimportmathimportosimportpickle# pypifromdotenvimportload_dotenvfromexpectsimport(be_true,expect)importhvplot.pandasimportpandasimportstatsmodels.apiasstatsmodelsimportstatsmodels.formula.apiasstatsmodels_formula# my stufffromgraeaeimportEmbedHoloviews,CountPercentage
The Environment
This loads the path to files.
load_dotenv(".env")
Plotting
This just sets up some convenience values for plotting.
I passed in the index_col argument because the first column is a index column with no header, so it just looks goofy if you don't. There are several columns but I only want Gender and Balance (the credit card balance).
data=data[["Gender","Balance"]]
Middle
The Data
Now that we have the data we can take a quick look at what's there.
Our two classes are "Female" and "Male" and they are roughly, but not quite, equal in number. Now I'll look at the balance.
plot=data.hvplot.kde(y="Balance",by="Gender",color=Plot.color_cycle).opts(width=Plot.width,height=Plot.height,fontscale=Plot.font_scale,title="Credit Card Balance Distribution by Gender")output=Embed(plot=plot,file_name="balance_distributions")()
print(output)
It looks like there are two populations for each gender. The larger one for both genders is centered near 0 and then both genders have a secondary population that carries a balance.
Encode the Gender
Since the Gender data is categorical we need to create a dummy variable to encode it. So what I'm going to do is encode males as 0 and females as 1 (because of the nature of binary encoding, we could swap the numbers and it would still work).
Since we encoded Male as 0 and Female as 1, when the gender is Male the second term drops out and all you have is \(\beta_0\), while for females you have have the full equation. How do you interpret the \(\beta\)s?
\(\beta_0\) is the average balance that males carry
\(\beta_0 + \beta_1\) is the average balance that females carry
\(\beta_1\) is the difference between the average balances
We can check this by comparing the coef entry in the summary table that I printed. The Intercept is \(\beta_0\) and gender is \(\beta_1\)
male_mean=data[data.Gender=="Male"].Balance.mean()female_mean=data[data.Gender=="Female"].Balance.mean()print(f"Average Male Balance: {male_mean:.7}")print(f"Average Female Balance: {female_mean:0.7}")print(f"Average difference: {female_mean-male_mean:0.7}")expect(math.isclose(male_mean,model.params.Intercept)).to(be_true)expect(math.isclose(female_mean-male_mean,model.params.gender)).to(be_true)
Average Male Balance: 509.8031
Average Female Balance: 529.5362
Average difference: 19.73312
I didn't set up a hypothesis test, but if you look at the p-value (the P>|t| column in the summary) for gender you can see that it's much larger than 0.05 or whatever level you would normally choose, so it's probable that gender isn't significant, so the average balance for both genders is really the same, given the deviation, but this is just about interpreting the coefficients, not deciding the validity of this particular model.
End
So there you have it. If you have the specialized case of binary categorical data you can convert the category to dummy variables and then fit a linear regression to it. If you encode the values as 0 and 1 then the y-intercept will give you the average output value for the category set to 0 and the slope will give you the difference between the average outputs for the categories. If you use different dummy variables the meanings will change slightly, although you will still be inferring things about the averages. Why is this interesting - predicting the mean for each category?
Logistic regression also relies on dummy variables for categorical encodings and this shows a preliminary step that helps us:
encode the dummy variables
build a linear model using statsmodels
view summary information about the model
I didn't emphasize it, but the p-value for the f-statistic might be valuable when deciding whether the categorical data is different enough to use as a feature.
James G, Witten D, Hastie T, Tibshirani R, editors. An introduction to statistical learning: with applications in R. New York: Springer; 2013. 426 p. (Springer texts in statistics).
Notes
Book site (includes link to PDF download of the book)
The first step of Clicker Training is "Get your rabbit to do what you want it to do", which seems a little ambiguous when you see it on the list, but it breaks down into four basic categories.
Capture Something She Does on Her Own
If you see your rabbit do something that she naturally does, like sit up or come to you at certain times, then click and reward it so you can build it up as something she'll do on cue.
Lure With Food
If your rabbit will take food from your hand you can teach her to go places by having her follow your hand with food in it, although the next category - Follow a Target - is a preferable way to do this.
Follow a Target
Create a target (like a ping-pong ball on the end of a stick) and teach your rabbit to follow it.
Start by putting it near her and clicking when she touches it with her nose or mouth
Extend it slowly by moving it away from her and rewarding her when she follows it
Shaping
The target training follows the shaping pattern, wherein you start by rewarding anything that sort of looks like what you want her to do - like just looking at the target when you show it to her, then progressively extending the behavior needed until it's the one that you want. For example:
Go A, Bhayani R, Huang L. Twitter sentiment classification using distant supervision. CS224N project report, Stanford. 2009 Dec;1(12):2009.
Notes
This was a project report that looked at using emoticons to create a labeled data set for tweets.
About the Data
The authors noted that tweets are different from many other sources used for sentiment analysis - things like movie reviews - in that:
they are character limited (140 characters at the time of the paper, it has since doubled)
there is a huge amount of data to pull - and it is continuously being generated
there is an unusual amount of slang and non-normal spelling
it isn't subject specific - you can filter using the API, but twitter itself isn't a single-subject service
Using Emoticons as Labels
The use of emoticons to decide if a tweet is positive, or negative has the benefit of automatically creating a labled dataset, but since they are used as the labels they have to remove them from the training set, removing one of the more useful ways of identifying the tweet sentiment.
Getting and Cleaning the Data
The pulled 100 tweets form the API every 2 minutes until they had 800,000 positive and 800,000 negative tweets (after removing some tweets in pre-processing). The API lets you query by emoticon so the used ":)" to grab positive tweets (the API matches any known equivalent emoticon) and ":(" for negative tweets. They removed re-tweets and duplicates as well as any tweet that had both positive and negative emoticons in them. They then replaced usernames with the token USERNAME and URLs with URL and limited the number of consecutively repeated characters to 2.
Zhai C, Massung S. Text data management and analysis: a practical introduction to information retrieval and text mining. First edition. New York: Association for Computing Machinery; 2016. 510 p. (ACM books).
Sentiment Analysis is a part of Natural Language Processing that looks to transform freeform text into structured data. Opinion is an expression of belief, while sentiment is an expression of feeling. What is sometimes called Sentiment Analysis - the measurement of positive or negative sentiment of a text is really Polarity Classification, a subset of Sentiment Analysis. (Pozzi, Fersini, et al., 2017).