Head First Design Patterns
Cite
- Freeman E, Robson E, Sierra K, Bates B, editors. Head First design patterns. Sebastopol, CA: O’Reilly; 2004. 638 p.
# python
from functools import partial
from pathlib import Path
import os
import pickle
# pypi
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import holoviews
import hvplot.pandas
import numpy
import pandas
import statsmodels.api as statsmodels
import statsmodels.formula.api as statsmodels_formula
# my stuff
from graeae import CountPercentage, EmbedHoloviews
load_dotenv(Path("~/.local/share/env", override=True).expanduser())
default = pandas.read_csv(Path(os.environ["ISL_DEFAULT"]).expanduser())
SLUG = "linear-regression-binary-classification"
Embed = partial(EmbedHoloviews,
folder_path=f"files/posts/{SLUG}")
with Path(os.environ["TWITTER_PLOT"]).expanduser().open("rb") as reader:
Plot = pickle.load(reader)
The ISLR Default data set is a simulated set that provides information to help you build models to predict whether a loan will default or not. Let's look at the data.
default.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 default 10000 non-null int64 1 student 10000 non-null int64 2 balance 10000 non-null float64 3 income 10000 non-null float64 dtypes: float64(2), int64(2) memory usage: 312.6 KB
So you can see that it has 10,000 rows, all non-null, two float and two integer columns. What might not be obvious in looking at the description is that default and student are actually categorical columns. Furthermore, according to the description they are encoded as "No" and "Yes" which doesn't seem like integers. Let's take a look at the default
column (the target).
counter = CountPercentage(default.default, value_label="Defaulted")
counter()
Defaulted | Count | Percent (%) |
---|---|---|
1 | 9,667 | 96.67 |
2 | 333 | 3.33 |
So it's a pretty imbalanced set, you could just predict the target is 1 and you would be right around 97% of the time. I'm going to assume that 1 means "No" (didn't default) and 2 means "Yes" (defaulted). For this exercise we're only going to look at the default
and balance
columns so I'll pull those out - balance
is the amount of money still owed on their credit cards after making a monthly payment. I'm also going to change the default
values to be 0 and 1 instead of 1 and 2.
default = default[["default", "balance"]]
default.loc[:, "default"] = default.default - 1
CountPercentage(default.default, value_label="Defaulted")()
| Defaulted | Count | Percent (%) | |-------------+---------+---------------| | 0 | 9,667 | 96.67 | | 1 | 333 | 3.33 |
Defaulted | Count | Percent (%) |
---|---|---|
0 | 9,667 | 96.67 |
1 | 333 | 3.33 |
What this is about is figuring out how you can use linear regression to make binary categorical predictions. The way we're going to do this is create probabilities that a balance will default and then fit a line to it. One problem with this is that the balance is a set of real numbers, so we don't end up with anything meaningful if we use them as is.
counts = default.balance.value_counts()
print(counts[counts > 1])
0.0 499 Name: balance, dtype: int64
The only value that appears more than once is 0. I'm going to try and get something more meaningful by rounding. But how should we round?
plot = default.hvplot.kde(y="balance").opts(
title="Credit Card Balance Distribution",
width=Plot.width,
height=Plot.height,
color=Plot.tan,
fontscale=Plot.font_scale,
)
output = Embed(plot=plot, file_name="balance_distribution")()
print(output)
So, I kind of cheated and worked with the whole dataset, but anyway, looking at it you can see that there are two populations (possibly those that default and those that don't) and it peters out around 2,500, so doing it by 1,000 increments might not make sense. I'll try 100's and see if that's good enough.
default.loc[:, "balance"] = (default.balance/100).astype(int) * 100
counter = CountPercentage(default.balance, value_label="Credit Balance")
counter()
Credit Balance | Count | Percent (%) |
---|---|---|
700 | 783 | 7.83 |
900 | 779 | 7.79 |
800 | 770 | 7.70 |
0 | 758 | 7.58 |
600 | 755 | 7.55 |
1,000 | 722 | 7.22 |
500 | 668 | 6.68 |
1,100 | 616 | 6.16 |
400 | 586 | 5.86 |
1,200 | 555 | 5.55 |
300 | 530 | 5.30 |
1,300 | 471 | 4.71 |
200 | 423 | 4.23 |
1,400 | 357 | 3.57 |
100 | 316 | 3.16 |
1,500 | 277 | 2.77 |
1,600 | 189 | 1.89 |
1,700 | 157 | 1.57 |
1,800 | 118 | 1.18 |
1,900 | 64 | 0.64 |
2,000 | 47 | 0.47 |
2,100 | 28 | 0.28 |
2,200 | 14 | 0.14 |
2,300 | 10 | 0.10 |
2,400 | 4 | 0.04 |
2,500 | 2 | 0.02 |
2,600 | 1 | 0.01 |
Well, it's harder to tell but that might be reasonable.
grouped = default.groupby(["balance", "default"]).agg({"balance": "count"}).rename(
columns={"balance": "count"}).reset_index()
plot = grouped.hvplot.bar(x="balance", y="count", by="default").opts(
title="Credit Balance Counts Rounded To 100ths",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
color=Plot.color_cycle,
xrotation=90,
)
outcome = Embed(plot=plot, file_name="balance_bar")()
print(outcome)
That x-axis is a little funky, but the blue is the count of those that didn't default and the red are those that did. So the defaults seem to peak at 1,800 with a spread around it.
So to get the probabilities what we want is the count of accounts that defaulted for each balance divided by the total number of accounts with a given balance.
defaulted = grouped[grouped.default==1]
didnt_default = grouped[grouped.default==0]
joined = didnt_default.set_index("balance").join(
defaulted.set_index("balance"),
lsuffix="_didnt_default",
rsuffix="_defaulted").reset_index().fillna(0)
joined["total"] = joined.count_didnt_default + joined.count_defaulted
joined["probability_of_default"] = joined.count_defaulted/joined.total
joined["predict_yes"] = joined.probability_of_default > 0.5
plot = joined.hvplot.scatter(x="balance",
y="probability_of_default",
by="predict_yes",
color=Plot.color_cycle,).opts(
title="Probability of Default By Balance",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
xrotation=90,
)
outcome = Embed(plot=plot, file_name="probability_vs_balance")()
print(outcome)
So, one of the problems we have here is that by converting the values to probabilities we lost a lot of data (well, not lost, just aggregated them away).
print(len(joined))
24
But this isn't a real model, it's just a look at what happens if you fit a line to the probabilities, so I'll just ignore that little fact.
model = LinearRegression().fit(joined.balance.values.reshape(-1, 1),
joined.probability_of_default)
x_s = numpy.linspace(joined.balance.min(), joined.balance.max())
# predictions = model.predict(x_s)
inputs = numpy.linspace(joined.balance.min(), joined.balance.max())
predictions = model.predict(inputs.reshape(-1, 1))
line = pandas.DataFrame.from_dict(dict(balance=inputs, probability=predictions))
crossover_point = (line.probability[line.probability > 0.5] - 0.5).abs().idxmin()
crossover = line.probability.iloc[crossover_point]
vertical_crossover = line.balance.iloc[crossover_point]
line_plot = line.hvplot(x="balance", y="probability")
hline = holoviews.HLine(crossover)
vline = holoviews.VLine(vertical_crossover)
scatter = joined.hvplot.scatter(x="balance",
y="probability_of_default",
by="predict_yes",
color=Plot.color_cycle,)
plot = (line_plot * scatter * hline * vline).opts(
title="Probability of Default By Balance",
width=Plot.width,
height=Plot.height,
fontscale=Plot.font_scale,
xrotation=90,
)
outcome = Embed(plot=plot, file_name="probability_vs_balance_with_model")()
print(outcome)
Looking at the plot - the horizontal line is the point at which the model switches from predicting the loan won't default to predicting that it will. Once again, I'm plotting on the training data so the performance shouldn't be take at face value, but what it does show is that it would be possible to create a model using linear regression that predicts whether someone would default on their loan. One problem, though, would be interpreting the model - the y-intercept comes at -0.2 - what does it mean when the probability is negative? What happens when it's greater than one?
I'm not really sure what to make of this. I did it as an exercise because the ISLR book had it and I thought it might make an interesting thing to replicate - although my plot doesn't look at all like theirs, and they don't document what they did so I'm not really sure that I'm doing anything remotely close to what they did. One thing to note is that given the binary nature of our data, even though we fit a linear model, if we assume that any balance that gives a probability greater than 0.5 is going to defaualt means that we could re-state the model as:
if balance > 2065:
defaulted = 1
else:
defaulted = 0
Anyway, this isn't something you would normally do, using linear regression to classify data, that is, so it's time to move on.
# python
from pathlib import Path
import os
# pypi
from dotenv import load_dotenv
import numpy
import pandas
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
base = rpackages.importr("base")
utils = rpackages.importr("utils")
utils.chooseCRANmirror(ind=1)
utils.install_packages("ISLR")
islr = rpackages.importr("ISLR")
default = rpackages.data(islr).fetch("Default")["Default"]
default = pandas.DataFrame.from_dict({key: numpy.asarray(default.rx2(key))
for key in default.names})
default.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10000 entries, 0 to 9999 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 default 10000 non-null int32 1 student 10000 non-null int32 2 balance 10000 non-null float64 3 income 10000 non-null float64 dtypes: float64(2), int32(2) memory usage: 234.5 KB
The reason for converting the rpy2 object to a pandas DataFrame was to make saving it easier. The rpy2 DataFrame does have a method called to_csvfile
but I've never used it before so this seemed safer.
env_path = Path("~/.local/share/env").expanduser()
load_dotenv(env_path, override=True)
path = Path(os.environ["ISL_DEFAULT"]).expanduser()
default.to_csv(path, index=False)
This is a simulated dataset created for An Introduction To Statistical Learning. Unlike the ISL Credit Data Set dataset this isn't listed for download on the books page which is why I downloaded it through R, although it looks like it is available from the gzipped tar file on the site but it's in the rda format - a binary format that sounds sort of like the pickle format but for R.
Pandas has a read_html function that will automatically convert HTML tables that it finds into DataFrames, but if you need to do a little cleaning up of the tables first or are doing some exploration of the HTML it can be useful to work with it in Beautiful Soup first, and if the table is being rendered on the page with Javascript it can be useful to use Selenium to grab the page for you so that it's rendered.
Start selenium (headless) and grab the page.
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)
browser.get(URL)
Give it to Beautiful Soup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(browser.page_source)
Do stuff with the soup (not shown).
Optionally grab the table you want - in this case I want the last one.
soup_table = soup.find_all("table")[-1]
But if there's one or you just want the first one you could use find
instead.
soup_table = soup.find("table")
Pass it to pandas.
tables = pandas.read_html(str(soup_table))
tables
is a list of DataFrames - one for each table that pandas found - even if there's only one so now it might be useful to get the one you want.
table = tables[0]
From what I understand from the documentation, pandas is using Beautiful Soup to parse the HTML, so if the tables come out okay and you don't need to mess around with the HTML tree beforehand you can just skip the soup.
import pandas
table = pandas.read_html(browser.page_source)[0]
If you want to replace parts of strings in a Pandas Series you can pass in a string, but if you want to keep part of the original sub-string that you are replacing you can call the replace method while passing in a function that un-packs the match object.
As an example, to add paretheses around strings that look like "Also see penguin" you can define a regular expression that has a named group in it.
PATTERN = r"Also see (?P<name>\w+)"
Then define a function that expects objects that match the pattern and creates a string using the named-group.
def see_also(match: re.match) -> str:
"""Add parentheses to Also see
Args:
match: regular expression match object
Returns:
string with the name from the object and parentheses added
"""
return f"(Also see {match['name']})"
Then pass in the pattern and function to the replace method.
table["Animal"] = table.Animal.str.replace(PATTERN, see_also)
If you edit the HTML tree in Beautiful Soup you should call the smooth method to smooth out any funkiness that got introduced.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for tag_id in tag_ids_to_destroy:
tag = soup.find(id=tag_id)
tag.decompose()
soup.smooth()
To replace a tag in Beautful Soup, find the element then call its replace_with method passing in either a string or tag.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
element = soup.find(id=tag_id)
element.replace_with("text")
tag = soup.new_tag("b")
tag.string = "be bold"
element.replace_with(tag)
There are various ways to edit tags with Beautiful Soup, The decompose method both removes the tag from the tree and destroys the object (along with all its descendants in the HTML tree).
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
element = soup.find(id=tag_id)
element.decompose()
If you want to destroy tags around the one you found (along with the tag and its descendants in the tree) then call decompose on it's parent or parent's parent (and so on). Or figure out how to find its parent instead - whichever is easier.
element.parent.parent.decompose()
After you edit the tree it might be a good idea to call smooth to clean up a bit.
soup.smooth()
To run the Selenium Firefox webdriver without starting the GUI you need to set the headless option in the FirefoxOptions.
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)
Note: There's a method called set_headless
for this class which seems to work, but the documentation says that it is deprecated so it's probably better to do it this way.