Linear Regression Binary Classification

Beginning

Set Up

Imports

# python
from functools import partial
from pathlib import Path

import os
import pickle

# pypi
from dotenv import load_dotenv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

import holoviews
import hvplot.pandas
import numpy
import pandas
import statsmodels.api as statsmodels
import statsmodels.formula.api as statsmodels_formula

# my stuff
from graeae import CountPercentage, EmbedHoloviews

The Data

load_dotenv(Path("~/.local/share/env", override=True).expanduser())
default = pandas.read_csv(Path(os.environ["ISL_DEFAULT"]).expanduser())

The Plotting

SLUG = "linear-regression-binary-classification"
Embed = partial(EmbedHoloviews,
                folder_path=f"files/posts/{SLUG}")

with Path(os.environ["TWITTER_PLOT"]).expanduser().open("rb") as reader:
    Plot = pickle.load(reader)

Middle

Looking At the Data

The ISLR Default data set is a simulated set that provides information to help you build models to predict whether a loan will default or not. Let's look at the data.

default.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  int64  
 1   student  10000 non-null  int64  
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), int64(2)
memory usage: 312.6 KB

So you can see that it has 10,000 rows, all non-null, two float and two integer columns. What might not be obvious in looking at the description is that default and student are actually categorical columns. Furthermore, according to the description they are encoded as "No" and "Yes" which doesn't seem like integers. Let's take a look at the default column (the target).

counter = CountPercentage(default.default, value_label="Defaulted")
counter()
Defaulted Count Percent (%)
1 9,667 96.67
2 333 3.33

So it's a pretty imbalanced set, you could just predict the target is 1 and you would be right around 97% of the time. I'm going to assume that 1 means "No" (didn't default) and 2 means "Yes" (defaulted). For this exercise we're only going to look at the default and balance columns so I'll pull those out - balance is the amount of money still owed on their credit cards after making a monthly payment. I'm also going to change the default values to be 0 and 1 instead of 1 and 2.

default = default[["default", "balance"]]
default.loc[:, "default"] = default.default - 1

CountPercentage(default.default, value_label="Defaulted")()
|   Defaulted |   Count |   Percent (%) |
|-------------+---------+---------------|
|           0 |   9,667 |         96.67 |
|           1 |     333 |          3.33 |
Defaulted Count Percent (%)
0 9,667 96.67
1 333 3.33

Probabilities

What this is about is figuring out how you can use linear regression to make binary categorical predictions. The way we're going to do this is create probabilities that a balance will default and then fit a line to it. One problem with this is that the balance is a set of real numbers, so we don't end up with anything meaningful if we use them as is.

counts = default.balance.value_counts()
print(counts[counts > 1])
0.0    499
Name: balance, dtype: int64

The only value that appears more than once is 0. I'm going to try and get something more meaningful by rounding. But how should we round?

plot = default.hvplot.kde(y="balance").opts(
    title="Credit Card Balance Distribution",
    width=Plot.width,
    height=Plot.height,
    color=Plot.tan,
    fontscale=Plot.font_scale,
)

output = Embed(plot=plot, file_name="balance_distribution")()
print(output)
: :

Figure Missing

:

So, I kind of cheated and worked with the whole dataset, but anyway, looking at it you can see that there are two populations (possibly those that default and those that don't) and it peters out around 2,500, so doing it by 1,000 increments might not make sense. I'll try 100's and see if that's good enough.

default.loc[:, "balance"] = (default.balance/100).astype(int) * 100
counter = CountPercentage(default.balance, value_label="Credit Balance")
counter()
Credit Balance Count Percent (%)
700 783 7.83
900 779 7.79
800 770 7.70
0 758 7.58
600 755 7.55
1,000 722 7.22
500 668 6.68
1,100 616 6.16
400 586 5.86
1,200 555 5.55
300 530 5.30
1,300 471 4.71
200 423 4.23
1,400 357 3.57
100 316 3.16
1,500 277 2.77
1,600 189 1.89
1,700 157 1.57
1,800 118 1.18
1,900 64 0.64
2,000 47 0.47
2,100 28 0.28
2,200 14 0.14
2,300 10 0.10
2,400 4 0.04
2,500 2 0.02
2,600 1 0.01

Well, it's harder to tell but that might be reasonable.

grouped = default.groupby(["balance", "default"]).agg({"balance": "count"}).rename(
    columns={"balance": "count"}).reset_index()
plot = grouped.hvplot.bar(x="balance", y="count", by="default").opts(
    title="Credit Balance Counts Rounded To 100ths",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.font_scale,
    color=Plot.color_cycle,
    xrotation=90,
)

outcome = Embed(plot=plot, file_name="balance_bar")()
print(outcome)

Figure Missing

That x-axis is a little funky, but the blue is the count of those that didn't default and the red are those that did. So the defaults seem to peak at 1,800 with a spread around it.

Converting To Probabilities

So to get the probabilities what we want is the count of accounts that defaulted for each balance divided by the total number of accounts with a given balance.

defaulted = grouped[grouped.default==1]
didnt_default = grouped[grouped.default==0]
joined = didnt_default.set_index("balance").join(
    defaulted.set_index("balance"),
    lsuffix="_didnt_default",
    rsuffix="_defaulted").reset_index().fillna(0)
joined["total"] = joined.count_didnt_default + joined.count_defaulted
joined["probability_of_default"] = joined.count_defaulted/joined.total
joined["predict_yes"] = joined.probability_of_default > 0.5
plot = joined.hvplot.scatter(x="balance",
                             y="probability_of_default",
                             by="predict_yes",
                             color=Plot.color_cycle,).opts(
    title="Probability of Default By Balance",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.font_scale,
    xrotation=90,    
)

outcome = Embed(plot=plot, file_name="probability_vs_balance")()
print(outcome)

Figure Missing

Fit a Regression

So, one of the problems we have here is that by converting the values to probabilities we lost a lot of data (well, not lost, just aggregated them away).

print(len(joined))
24

But this isn't a real model, it's just a look at what happens if you fit a line to the probabilities, so I'll just ignore that little fact.

model = LinearRegression().fit(joined.balance.values.reshape(-1, 1),
                               joined.probability_of_default)
x_s = numpy.linspace(joined.balance.min(), joined.balance.max())
# predictions = model.predict(x_s)
inputs = numpy.linspace(joined.balance.min(), joined.balance.max())
predictions = model.predict(inputs.reshape(-1, 1))

line = pandas.DataFrame.from_dict(dict(balance=inputs, probability=predictions))
crossover_point = (line.probability[line.probability > 0.5] - 0.5).abs().idxmin()
crossover = line.probability.iloc[crossover_point]
vertical_crossover = line.balance.iloc[crossover_point]
line_plot = line.hvplot(x="balance", y="probability")

hline = holoviews.HLine(crossover)
vline = holoviews.VLine(vertical_crossover)
scatter = joined.hvplot.scatter(x="balance",
                                y="probability_of_default",
                                by="predict_yes",
                                color=Plot.color_cycle,)
plot = (line_plot * scatter * hline * vline).opts(
    title="Probability of Default By Balance",
    width=Plot.width,
    height=Plot.height,
    fontscale=Plot.font_scale,
    xrotation=90,    
)

outcome = Embed(plot=plot, file_name="probability_vs_balance_with_model")()
print(outcome)

Figure Missing

Looking at the plot - the horizontal line is the point at which the model switches from predicting the loan won't default to predicting that it will. Once again, I'm plotting on the training data so the performance shouldn't be take at face value, but what it does show is that it would be possible to create a model using linear regression that predicts whether someone would default on their loan. One problem, though, would be interpreting the model - the y-intercept comes at -0.2 - what does it mean when the probability is negative? What happens when it's greater than one?

End

I'm not really sure what to make of this. I did it as an exercise because the ISLR book had it and I thought it might make an interesting thing to replicate - although my plot doesn't look at all like theirs, and they don't document what they did so I'm not really sure that I'm doing anything remotely close to what they did. One thing to note is that given the binary nature of our data, even though we fit a linear model, if we assume that any balance that gives a probability greater than 0.5 is going to defaualt means that we could re-state the model as:

if balance > 2065:
    defaulted = 1
else:
    defaulted = 0

Anyway, this isn't something you would normally do, using linear regression to classify data, that is, so it's time to move on.

ISL Default Dataset

Set Up

# python
from pathlib import Path
import os

# pypi
from dotenv import load_dotenv

import numpy
import pandas
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages

The Dataset

Load It

base = rpackages.importr("base")
utils = rpackages.importr("utils")

utils.chooseCRANmirror(ind=1)
utils.install_packages("ISLR")
islr = rpackages.importr("ISLR")
default = rpackages.data(islr).fetch("Default")["Default"]
default = pandas.DataFrame.from_dict({key: numpy.asarray(default.rx2(key))
                                      for key in default.names})

Summarize It

default.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   default  10000 non-null  int32  
 1   student  10000 non-null  int32  
 2   balance  10000 non-null  float64
 3   income   10000 non-null  float64
dtypes: float64(2), int32(2)
memory usage: 234.5 KB

Save It

The reason for converting the rpy2 object to a pandas DataFrame was to make saving it easier. The rpy2 DataFrame does have a method called to_csvfile but I've never used it before so this seemed safer.

env_path = Path("~/.local/share/env").expanduser()
load_dotenv(env_path, override=True)
path = Path(os.environ["ISL_DEFAULT"]).expanduser()
default.to_csv(path, index=False)

Citation

  • James, G., Witten, D., Hastie, T., and Tibshirani, R. ISLR: Data for an Introduction to Statistical Learning with Applications in R. 2017. [R package version 1.2]. CRAN. [cited 2020 Aug 7]. https://cran.r-project.org/package=ISLR

Comment

This is a simulated dataset created for An Introduction To Statistical Learning. Unlike the ISL Credit Data Set dataset this isn't listed for download on the books page which is why I downloaded it through R, although it looks like it is available from the gzipped tar file on the site but it's in the rda format - a binary format that sounds sort of like the pickle format but for R.

Selenium To Soup To Pandas

Pandas has a read_html function that will automatically convert HTML tables that it finds into DataFrames, but if you need to do a little cleaning up of the tables first or are doing some exploration of the HTML it can be useful to work with it in Beautiful Soup first, and if the table is being rendered on the page with Javascript it can be useful to use Selenium to grab the page for you so that it's rendered.

Start selenium (headless) and grab the page.

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)

browser.get(URL)

Give it to Beautiful Soup.

from bs4 import BeautifulSoup

soup = BeautifulSoup(browser.page_source)

Do stuff with the soup (not shown).

Optionally grab the table you want - in this case I want the last one.

soup_table = soup.find_all("table")[-1]

But if there's one or you just want the first one you could use find instead.

soup_table = soup.find("table")

Pass it to pandas.

tables = pandas.read_html(str(soup_table))

tables is a list of DataFrames - one for each table that pandas found - even if there's only one so now it might be useful to get the one you want.

table = tables[0]

From what I understand from the documentation, pandas is using Beautiful Soup to parse the HTML, so if the tables come out okay and you don't need to mess around with the HTML tree beforehand you can just skip the soup.

import pandas

table = pandas.read_html(browser.page_source)[0]

Pandas String Replacement With a Callback

If you want to replace parts of strings in a Pandas Series you can pass in a string, but if you want to keep part of the original sub-string that you are replacing you can call the replace method while passing in a function that un-packs the match object.

As an example, to add paretheses around strings that look like "Also see penguin" you can define a regular expression that has a named group in it.

PATTERN = r"Also see (?P<name>\w+)"

Then define a function that expects objects that match the pattern and creates a string using the named-group.

def see_also(match: re.match) -> str:
    """Add parentheses to Also see

    Args:
     match: regular expression match object

    Returns:
     string with the name from the object and parentheses added
    """
    return f"(Also see {match['name']})"

Then pass in the pattern and function to the replace method.

table["Animal"] = table.Animal.str.replace(PATTERN, see_also)

Destroying Tags With Beautiful Soup

There are various ways to edit tags with Beautiful Soup, The decompose method both removes the tag from the tree and destroys the object (along with all its descendants in the HTML tree).

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)

element = soup.find(id=tag_id)
element.decompose()

If you want to destroy tags around the one you found (along with the tag and its descendants in the tree) then call decompose on it's parent or parent's parent (and so on). Or figure out how to find its parent instead - whichever is easier.

element.parent.parent.decompose()

After you edit the tree it might be a good idea to call smooth to clean up a bit.

soup.smooth()

Selenium Headless

To run the Selenium Firefox webdriver without starting the GUI you need to set the headless option in the FirefoxOptions.

from selenium import webdriver

options = webdriver.FirefoxOptions()
options.headless = True
browser = webdriver.Firefox(firefox_options=options)

Note: There's a method called set_headless for this class which seems to work, but the documentation says that it is deprecated so it's probably better to do it this way.