Linear Models

1 Linear Regression

1.1 Imports

import matplotlib.pyplot as pyplot
import pandas
import seaborn
from sklearn.linear_model import (
    Lasso,
    LinearRegression,
    Ridge,
    )
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
%matplotlib inline
seaborn.set_style("whitegrid")

1.2 The Data

This is the same data I used for k-nearest neighbors regression.

boston = load_boston()
print("Boston data-shape: {0}".format(boston.data.shape))
Boston data-shape: (506, 13)
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target)

1.3 The Model

model = LinearRegression()
model.fit(X_train, y_train)
print("coefficients: {0}".format(model.coef_))
print("intercept: {0}".format(model.intercept_))
coefficients: [ -5.29188465e-02   3.27516047e-02   5.15495287e-02   1.96191849e+00
  -1.70355026e+01   4.26984342e+00  -4.66261395e-03  -1.24731581e+00
   2.40316945e-01  -1.12757320e-02  -9.67653044e-01   1.07129222e-02
  -4.58665079e-01]
intercept: 31.315219281412134
pollution_index = 4
names = ["Crime", "Large Lots", "Non-Retail Businesses",
         "Charles River adjacent", "Nitric Oxide", "Rooms", "Old Homes",
         "Distance to Employment", "Access to Highways", "Tax Rate",
         "Pupil-Teacher Ratio", "Blacks", "Lower Status"]
pandas.Series(model.coef_, index=names)
Crime                     -0.052919
Large Lots                 0.032752
Non-Retail Businesses      0.051550
Charles River adjacent     1.961918
Nitric Oxide             -17.035503
Rooms                      4.269843
Old Homes                 -0.004663
Distance to Employment    -1.247316
Access to Highways         0.240317
Tax Rate                  -0.011276
Pupil-Teacher Ratio       -0.967653
Blacks                     0.010713
Lower Status              -0.458665
dtype: float64

The price of homes in Boston is negatively correlated with Crime, Nitric Oxide (pollution), Distance to employment centes, Tax Rate, Pupil:Teacher ratio and the Lower status of the residents, with pollution being the overall largest factor (positive or negative). The most positive factors were the number of rooms the house had and whether the house was adjacent to the Charles River.

print("Training r2: {:.2f}".format(model.score(X_train, y_train)))
print("Testing r2: {0:.2f}".format(model.score(X_test, y_test)))
Training r2: 0.74
Testing r2: 0.73

The training and testing scores were oddly close, suggesting that this model generalizes well.

training = pandas.DataFrame(X_train, columns=names)
seaborn.pairplot(training)
boston_pair_plots.png

2 Ridge Regression

This model uses L2 regression to reduce the size of the coefficients.

ridge = Ridge()
ridge.fit(X_train, y_train)
print("Training r2: {0:.2f}".format(ridge.score(X_train, y_train)))
print("Testing r2: {:.2f}".format(ridge.score(X_test, y_test)))
Training r2: 0.74
Testing r2: 0.72

This time the testing did a little worse than without ridge regression.

pandas.Series(ridge.coef_, index=names)
Crime                    -0.048337
Large Lots                0.032897
Non-Retail Businesses     0.016831
Charles River adjacent    1.789245
Nitric Oxide             -8.860668
Rooms                     4.270665
Old Homes                -0.011137
Distance to Employment   -1.125192
Access to Highways        0.224993
Tax Rate                 -0.012211
Pupil-Teacher Ratio      -0.891977
Blacks                    0.010977
Lower Status             -0.471429
dtype: float64

Once again pollution and the number of rooms a home had were the biggest influence on the price of the home.

3 Lasso Regression

This model uses L1 regression to remove the variables that don't influenc the outcome.

lasso = Lasso()
lasso.fit(X_train, y_train)
print("Training r2: {0:.2f}".format(lasso.score(X_train, y_train)))
print("Testing r2: {0:.2f}".format(lasso.score(X_test, y_test)))
Training r2: 0.67
Testing r2: 0.64

The Lasso did worse than the Ridge and ordinary-least-squares models did.

coefficients = pandas.Series(lasso.coef_, index=names)
coefficients[coefficients==0]
Non-Retail Businesses    -0.0
Charles River adjacent    0.0
Nitric Oxide             -0.0
dtype: float64

The Lasso removed Non-Retail Businesses, Charles River adjacency, and pollution, even though the other models decided that pollution was the most important factor.

We can try and do better by using a less aggressive alpha value.

lasso = Lasso(alpha=0.01)
lasso.fit(X_train, y_train)
print("Training r2: {0:.2f}".format(lasso.score(X_train, y_train)))
print("Testing r2: {0:.2f}".format(lasso.score(X_test, y_test)))
coefficients = pandas.Series(lasso.coef_, index=names)
print(coefficients[coefficients==0])
Training r2: 0.74
Testing r2: 0.73
Series([], dtype: float64)

Tuning the alpha can make it perform slightly better than the Ridge regression, but in this case making it aggressive enough to get rid of a column ("Nitric Oxide") makes it perform slightl worse than Ride regression.

training = pandas.DataFrame(X_train, columns=names)
training["price"] = y_train
seaborn.regplot(x="Nitric Oxide", y="price", data=training)
pyplot.xlabel("Nitric Oxide")
pyplot.ylabel("House Price")
pyplot.title("Pollution vs House Price")
pollution_vs_price.png

It appears that there is a linear relationship (although there are what appears to be some outliers).