Statistical Analysis and Data Exploration

This section is an exploratory analysis of the Boston Housing data which will introduce the data and some changes that I made, summarize the median-value data, then look at the features to make an initial hypothesis about the value of the client’s home.

The Data

The data was taken from the sklearn.load_boston function (sklearn cites the UCI Machine Learning Repository as their source for the data). The data gives values for various features of different suburbs of Boston as well as the median-value for homes in each suburb. The features were chosen to reflect various aspects believed to influence the price of houses including the structure of the house (age and spaciousness), the quality of the neighborhood, transportation access to employment centers and highways, and pollution.

There are 14 variables in the data set (13 features and the median-value target). Here is the description of the data variables provided by sklearn.

Attribute Information (in order)
Variable Name Description
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
AGE proportion of owner-occupied units built prior to 1940
DIS weighted distances to five Boston employment centers
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
PTRATIO pupil-teacher ratio by town
B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT % lower status of the population
MEDV Median value of owner-occupied homes in $1000’s

Note

The data comes from the 1970 U.S. Census and the median-values have not been inflation-adjusted.

Cleaning the Data

There are no missing data points but the odd variable names are sometimes confusing so I’m going to expand them to full variable names.

Variable Aliases
Original Variable New Variable
CRIM crime_rate
ZN large_lots
INDUS industrial
CHAS charles_river
NOX nitric_oxide
RM rooms
AGE old_houses
DIS distances
RAD highway_access
TAX property_taxes
PTRATIO pupil_teacher_ratio
B proportion_blacks
LSTAT lower_status

Median Value

The target variable for this data-set is the median-value of houses within a given suburb. After presenting some summary statistics for the median-value I’ll make some plots to get a sense of the shape of the data.

Boston Housing median-value statistics (in $1000’s)
Item Value
count 506
mean 22.53
std 9.20
min 5.00
25% 17.02
50% 21.20
75% 25.00
max 50.00
IQR 7.975

Outlier Check

Comparing the mean (22.53) and the median (21.2) it looks like the distribution might be right-skewed. This is more obvious looking at distribution plots below, but I’ll also do an outlier check here using the traditional Q1 - 1.5 \times IQR for low outliers and Q3 + 1.5 \times IQR for the higher outliers to see how many there might be.

Outlier Count
Description Value
Low Outlier Limit (LOL) 5.06
LOL - min 0.06
Upper Outlier Limit (UOL) 36.96
max - UOL 13.04
Low Outlier Count 2
High Outlier Count 38

There aren’t an excessive number of outliers - about 8% of the median-values are above the upper outlier limit (UOL) and less than 1% below the lower-outlier limit. The difference between the maximum value of 50 and the UOL is 13.04, however, which is almost as large as the difference between the UOL and the median (15.76) so there might be an undue influence from the upper values if parametric statistics are used.

Plots

The KDE/histogram and box-plot seem to confirm what was shown in the section on outliers, which is that there are some unusually high median-values in the data.

The QQ-Plot shows that the distribution is initially fairly normal but the upper-third seems to come from a different distribution than the lower two-thirds.

Looking at the distribution (histogram and KDE plot) and box-plot the median-values for the homes appear to be right-skewed. The CDF shows that about 90% of the homes are $35,000 or less (the 90th percentile for median-value is 34.8) and that there’s a change in the spread of the data around $25,000. The qq-plot and the other plots show that the median-values aren’t normally distributed.

Possibly Significant Features

To get an idea of how the features are related to the median-value, I’ll plot some linear-regressions.

Looking at the plots, the three features that I think are the most significant are lower_status (LSTAT), nitric_oxide (NOX), and rooms (RM). The lower_status variable is the percent of the population of the town that is of ‘lower status’ which is defined in this case as being an adult with less than a ninth-grade education or a male worker that is classified as a laborer. The nitric_oxide variable represents the annual average parts per million of nitric-oxide measured in the air and is thus a stand-in for pollution. rooms is the average number of rooms per dwelling, representing the spaciousness of houses in the suburb (Harrison and Rubinfeld, 1978).

The Client

As I mentioned previously, the main goal of this project is to create a model to predict the house price for a client. Here are the client’s values.

Client Values
Feature Value
crime_rate 11.95
large_lots 0.0
industrial 18.1
charles_river 0
nitric_oxide 0.659
rooms 5.609
old_houses 90.0
distances 1.385
highway_access 24
property_taxes 680.0
pupil_teacher_ratio 20.2
proportion_blacks 332.09
lower_status 12.13

The Client’s Significant Features

Now a comparison of the client’s values for the three features that I hypothesized might be the most significant along with the values from the data-set.

Client Significant Features
Variable Client Value Boston Q1 Boston Median Boston Q3
lower_status 12.13 6.95 11.36 16.96
nitric_oxide 0.66 0.45 0.54 0.62
rooms 5.61 5.89 6.21 6.62

Comparing the values I guessed would be significant for the client to the median-values for the data set as a whole shows that the client has a higher ratio of lower-status adults, more pollution and fewer rooms than the median suburbs so I would expect that the predicted value will be lower than the median.