Introduction

This is an introductory regression problem that uses California housing data from the 1990 census. There's a description of the original data here, but we're using a slightly altered dataset that's on github (and appears to be mirrored on kaggle). The problem here is to create a model that will predict the median housing value for a census block group (called "district" in the dataset) given the other attributes. The original data is also available from sklearn so I'm going to take advantage of that to get the description and do a double-check of the model.

Imports

These are the dependencies for this problem.

# python standard library
import os
import tarfile
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
from http import HTTPStatus

# from pypi
import matplotlib
import pandas
import requests
import seaborn
from sklearn.datasets import fetch_california_housing
from tabulate import tabulate

Constants

These are convenience holders for strings and other constants so they don't get scattered all over the place.

class Data:
    source_slug = "../data/california-housing-prices/"
    target_slug = "../data_temp/california-housing-prices/"
    url = "https://github.com/ageron/handson-ml/raw/master/datasets/housing/housing.tgz"
    source = source_slug + "housing.tgz"
    target = target_slug + "housing.csv"
    chunk_size = 128

The Data

We'll grab the data from github, extract it (it's a tgz compressed tarfile), then make a pandas data frame from it. I'll also download the sklearn version.

Downloading the sklearn dataset

sklearn_housing_bunch = fetch_california_housing("~/data/sklearn_datasets/")

Downloading Cal. housing from https://ndownloader.figshare.com/files/5976036 to /home/brunhilde/data/sklearn_datasets/

print(sklearn_housing_bunch.DESCR)

California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.

print(sklearn_housing_bunch.feature_names)

['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Now I'll convert it to a Pandas DataFrame.

sklearn_housing = pandas.DataFrame(sklearn_housing_bunch.data,
				   columns=sklearn_housing_bunch.feature_names)

             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704  
std       10.386050      2.135952      2.003532  
min        0.692308     32.540000   -124.350000  
25%        2.429741     33.930000   -121.800000  
50%        2.818116     34.260000   -118.490000  
75%        3.282261     37.710000   -118.010000  
max     1243.333333     41.950000   -114.310000

Downloading and uncompressing the data

def get_data():
    """Gets the data from github and uncompresses it"""
    if os.path.exists(Data.target):
	return

    os.makedirs(Data.target_slug, exist_ok=True)
    os.makedirs(Data.source_slug, exist_ok=True)
    response = requests.get(Data.url, stream=True)
    assert response.status_code == HTTPStatus.OK
    with open(Data.source, "wb") as writer:
	for chunk in response.iter_content(chunk_size=Data.chunk_size):
	    writer.write(chunk)
    assert os.path.exists(Data.source)
    compressed = tarfile.open(Data.source)
    compressed.extractall(Data.target_slug)
    compressed.close()
    assert os.path.exists(Data.target)
    return

Contents of ../data_temp/california-housing-prices/:

housing.csv

Building the dataframe

housing = pandas.read_csv(Data.target)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB
None

Comparison to Sklearn

The dataset seems to differ somewhat from the sklearn description. Instead of total_rooms they have AveRooms, for instance. Is this just a problem of names?

print(sklearn_housing.AveRooms.head())

0 6.984127 1 6.238137 2 8.288136 3 5.817352 4 6.281853 Name: AveRooms, dtype: float64

print(housing.total_rooms.head())

0 880.0 1 7099.0 2 1467.0 3 1274.0 4 1627.0 Name: total_rooms, dtype: float64

So they are different. Let's see if you can get the sklearn values from the original data set.

print((housing.total_rooms/housing.households).head())

0 6.984127 1 6.238137 2 8.288136 3 5.817352 4 6.281853 dtype: float64

It looks like the sklearn values are (in some cases) calculated values derived from the original. It makes sense that they changed some of the things (total number of rooms only makes sense if there is the same number of households in each district, for instance), but it would have been better if they documented the changes they made and why they changed it.

Inspecting the Data

If you look at the total_bedrooms count you'll see that it only has 20,433 non-null values, while the rest of the columns have 20,640 values. These were removed to allow experimenting with missing data. The original dataset that was collected for the census had all the values.

Column	Has Missing Values
longitude	False
latitude	False
housing_median_age	False
total_rooms	False
total_bedrooms	True
population	False
households	False
median_income	False
median_house_value	False
ocean_proximity	False

It looks like total_bedrooms is the only column where there's missing data.

Rows	Columns
20640	10

I'll print the median for each column except the last (since it's non-numeric).

longitude	latitude	housing_median_age	total_rooms
-118.49	34.26	29.00	2127.00

total_bedrooms	population	households	median_income	median_house_value
435.00	1166.00	409.00	3.53	179700.00

Here's the description for the ocean_proximity variable Looking at the median_income you can see that it isn't income in dollars.

Statistic	Value
count	20640
unique	5
top	<1H OCEAN
freq	9136

It looks like the most common house location is less than an hour from the ocean.

print(
    "{:.2f}".format(
	ocean_proximity_description.loc["freq"]/ocean_proximity_description.loc["count"]))

0.44

Which makes up about forty-four percent of all the houses. Here are all the ocean_proximity values.

Proximity	Count	Percentage
<1H OCEAN	9136	44.2636
INLAND	6551	31.7393
NEAR OCEAN	2658	12.8779
NEAR BAY	2290	11.095
ISLAND	5	0.0242248

If you look at the median income plot you can see that it goes from 0 to 15. It turns out that the incomes were re-scaled and limited to the 0.5 to 15 range. The median age and value were also capped, possibly affecting our price predictions.

References

Géron, Aurélien. Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. First edition. Beijing Boston Farnham: O’Reilly, 2017.

California Housing Prices

Table of Contents