KNN Regression
Introduction
This will look at using K-Nearest Neighbors for regression. First I'll look at a synthetic data-set and then a dataset that was created to study the effect of polution on the housing prices in Boston.
Imports
from numba import jit import numpy import matplotlib.pyplot as pyplot import seaborn from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsRegressor
%matplotlib inline seaborn.set_style("whitegrid")
The Model
def get_r_squared(max_neighbors=10, samples=100): train_score = [] test_score = [] models = [] inputs, values = make_regression(n_samples=samples) X_train, X_test, y_train, y_test = train_test_split(inputs, values) for neighbors in range(1, max_neighbors+1): model = KNeighborsRegressor(n_neighbors=neighbors, n_jobs=4) model.fit(X_train, y_train) train_score.append(model.score(X_train, y_train)) test_score.append(model.score(X_test, y_test)) models.append(model) return train_score, test_score, models
def plot_r_squared(neighbors=20, samples=100): train_score, test_score, models = get_r_squared(neighbors, samples) neighbors = range(1, neighbors+1) pyplot.plot(neighbors, train_score, label="Training $r^2$") pyplot.plot(neighbors, test_score, label="Testing $r^2$") pyplot.xlabel("Neighbors") pyplot.ylabel("$r^2$") pyplot.title("KNN Synthetic Data") pyplot.legend() return train_score, test_score, models plot_r_squared()
I originally had it set to a maximum of 10 neighbors, which made it appear that 9 was the peak, but expanding it shows that it was 15. It had a fairly low \(r^2\) score, even at its best. There appears to be more variance in the make_regression function than I had thought. When I ran it earlier the testing score never exceeded the training score and the best k was 12. The actual best score was the same, though.
print("Max r2: {:.2f}".format(max(test_score)))
Max r2: 0.47
The default for the make_regression function is to create 100 samples (which I mimicked by passing in 100 explicitly). By statistics standards this is a reasonable dataset (I believe 20 samples was the minimum for a long time) but it is very small by machine learning samples. Will it do better if it has a larger sample size?
plot_r_squared(samples=1000)
It didn't, but maybe because I didn't increase the number of neighbors.
plot_r_squared(neighbors=100, samples=1000)
No, that didn't help, and after re-looking at the plot above I realized that it was getting worse at the end, so I shouldn't have expected that to help. So why does it do worse with more data?
train, test, models = plot_r_squared(samples=10000, neighbors=100)
Having even more data seems to have improved the amount the testing score goes down with the number of neighbors. Maybe there's an ideal neighbors to data points ratio that I'm missing, and too many neighbors means you need more data.
@jit def find_first(array, match): """find the index of the first match Expects a 1-dimensional array or list Args: array (numpy.array): thing to search match: thing to match Returns: int: index of the first match found (or None) """ for index in range(len(array)): if array[index] == match: return index return
best = max(test) print("Best Test r2: {:.2f}".format(best)) test = numpy.array(test) index = find_first(test, best) print("Best Neighbors: {0}".format(index + 1))
Best Test r2: 0.39 Best Neighbors: 18
Boston
This dataset was created to see if there was a correlation between polution and the price of houses in the Boston area.
Imports
import matplotlib.pyplot as pyplot import seaborn from sklearn.datasets import load_boston from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsRegressor
%matplotlib inline seaborn.set_style("whitegrid")
The Data
boston = load_boston() print("Boston data-shape: {0}".format(boston.data.shape))
Boston data-shape: (506, 13)
Boston House Prices dataset
Notes
Data Set Characteristics:
Number of Instances: | |
---|---|
506 | |
Number of Attributes: | |
13 numeric/categorical predictive | |
Median Value: | (attribute 14) is usually the target |
Attribute Information (in order): | |
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per $10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in $1000's
Missing Attribute Values: | |
---|---|
None | |
Creator: | Harrison, D. and Rubinfeld, D.L. |
This is a copy of UCI ML housing dataset. http://archive.ics.uci.edu/ml/datasets/Housing
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic prices and the demand for clean air', J. Environ. Economics & Management, vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics ...', Wiley, 1980. N.B. Various transformations are used in the table on pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression problems.
References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
- many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)
print(boston.keys())
dict_keys(['target', 'feature_names', 'data', 'DESCR'])
This time there's no target-names because it is a regression problem instead of a classification problem.
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target)
Model Performance
def get_r_squared(max_neighbors=10): train_score = [] test_score = [] models = [] for neighbors in range(1, max_neighbors+1): model = KNeighborsRegressor(n_neighbors=neighbors) model.fit(X_train, y_train) train_score.append(model.score(X_train, y_train)) test_score.append(model.score(X_test, y_test)) models.append(model) return train_score, test_score, models
train_score, test_score, models = get_r_squared() neighbors = range(1, 11) pyplot.plot(neighbors, train_score, label="Training $r^2$") pyplot.plot(neighbors, test_score, label="Testing $r^2$") pyplot.xlabel("Neighbors") pyplot.ylabel("$r^2$") pyplot.title("KNN Boston Housing Prices") pyplot.legend()
The testing score seems to peak at 2 neighbors and then go down from there.
print("Training r2 for 2 neigbors: {:.2f}".format(train_score[1])) print("Testing r2 for 2 neighbors: {:.2f}".format(test_score[1])) assert max(test_score) == test_score[1]
Training r2 for 2 neigbors: 0.84 Testing r2 for 2 neighbors: 0.63
In this case the K-Nearest Neighbors didn't seem to do as well with regression as it did with classification.