California Housing Prices


This is an introductory regression problem that uses California housing data from the 1990 census. There's a description of the original data here, but we're using a slightly altered dataset that's on github (and appears to be mirrored on kaggle). The problem here is to create a model that will predict the median housing value for a census block group (called "district" in the dataset) given the other attributes. The original data is also available from sklearn so I'm going to take advantage of that to get the description and do a double-check of the model.


These are the dependencies for this problem.

# python standard library
import os
import tarfile
import warnings
warnings.filterwarnings("ignore", message="numpy.dtype size changed")
from http import HTTPStatus

# from pypi
import matplotlib
import pandas
import requests
import seaborn
from sklearn.datasets import fetch_california_housing
from tabulate import tabulate


These are convenience holders for strings and other constants so they don't get scattered all over the place.

class Data:
    source_slug = "../data/california-housing-prices/"
    target_slug = "../data_temp/california-housing-prices/"
    url = ""
    source = source_slug + "housing.tgz"
    target = target_slug + "housing.csv"
    chunk_size = 128

The Data

We'll grab the data from github, extract it (it's a tgz compressed tarfile), then make a pandas data frame from it. I'll also download the sklearn version.

Downloading the sklearn dataset

sklearn_housing_bunch = fetch_california_housing("~/data/sklearn_datasets/")
Downloading Cal. housing from to /home/brunhilde/data/sklearn_datasets/
California housing dataset.

The original database is available from StatLib

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.


Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.


['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup', 'Latitude', 'Longitude']

Now I'll convert it to a Pandas DataFrame.

sklearn_housing = pandas.DataFrame(,
             MedInc      HouseAge      AveRooms     AveBedrms    Population  \
count  20640.000000  20640.000000  20640.000000  20640.000000  20640.000000   
mean       3.870671     28.639486      5.429000      1.096675   1425.476744   
std        1.899822     12.585558      2.474173      0.473911   1132.462122   
min        0.499900      1.000000      0.846154      0.333333      3.000000   
25%        2.563400     18.000000      4.440716      1.006079    787.000000   
50%        3.534800     29.000000      5.229129      1.048780   1166.000000   
75%        4.743250     37.000000      6.052381      1.099526   1725.000000   
max       15.000100     52.000000    141.909091     34.066667  35682.000000   

           AveOccup      Latitude     Longitude  
count  20640.000000  20640.000000  20640.000000  
mean       3.070655     35.631861   -119.569704  
std       10.386050      2.135952      2.003532  
min        0.692308     32.540000   -124.350000  
25%        2.429741     33.930000   -121.800000  
50%        2.818116     34.260000   -118.490000  
75%        3.282261     37.710000   -118.010000  
max     1243.333333     41.950000   -114.310000  

Downloading and uncompressing the data

def get_data():
    """Gets the data from github and uncompresses it"""
    if os.path.exists(

    os.makedirs(Data.target_slug, exist_ok=True)
    os.makedirs(Data.source_slug, exist_ok=True)
    response = requests.get(Data.url, stream=True)
    assert response.status_code == HTTPStatus.OK
    with open(Data.source, "wb") as writer:
	for chunk in response.iter_content(chunk_size=Data.chunk_size):
    assert os.path.exists(Data.source)
    compressed =
    assert os.path.exists(

Contents of ../data_temp/california-housing-prices/:

  • housing.csv

Building the dataframe

housing = pandas.read_csv(
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
longitude             20640 non-null float64
latitude              20640 non-null float64
housing_median_age    20640 non-null float64
total_rooms           20640 non-null float64
total_bedrooms        20433 non-null float64
population            20640 non-null float64
households            20640 non-null float64
median_income         20640 non-null float64
median_house_value    20640 non-null float64
ocean_proximity       20640 non-null object
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Comparison to Sklearn

The dataset seems to differ somewhat from the sklearn description. Instead of total_rooms they have AveRooms, for instance. Is this just a problem of names?


0 6.984127 1 6.238137 2 8.288136 3 5.817352 4 6.281853 Name: AveRooms, dtype: float64


0 880.0 1 7099.0 2 1467.0 3 1274.0 4 1627.0 Name: total_rooms, dtype: float64

So they are different. Let's see if you can get the sklearn values from the original data set.


0 6.984127 1 6.238137 2 8.288136 3 5.817352 4 6.281853 dtype: float64

It looks like the sklearn values are (in some cases) calculated values derived from the original. It makes sense that they changed some of the things (total number of rooms only makes sense if there is the same number of households in each district, for instance), but it would have been better if they documented the changes they made and why they changed it.

Inspecting the Data

If you look at the total_bedrooms count you'll see that it only has 20,433 non-null values, while the rest of the columns have 20,640 values. These were removed to allow experimenting with missing data. The original dataset that was collected for the census had all the values.

Column Has Missing Values
longitude False
latitude False
housing_median_age False
total_rooms False
total_bedrooms True
population False
households False
median_income False
median_house_value False
ocean_proximity False

It looks like total_bedrooms is the only column where there's missing data.

Rows Columns
20640 10

I'll print the median for each column except the last (since it's non-numeric).

longitude latitude housing_median_age total_rooms
-118.49 34.26 29.00 2127.00
total_bedrooms population households median_income median_house_value
435.00 1166.00 409.00 3.53 179700.00

Here's the description for the ocean_proximity variable Looking at the median_income you can see that it isn't income in dollars.

Statistic Value
count 20640
unique 5
top <1H OCEAN
freq 9136

It looks like the most common house location is less than an hour from the ocean.



Which makes up about forty-four percent of all the houses. Here are all the ocean_proximity values.


Proximity Count Percentage
<1H OCEAN 9136 44.2636
INLAND 6551 31.7393
NEAR OCEAN 2658 12.8779
NEAR BAY 2290 11.095
ISLAND 5 0.0242248


If you look at the median income plot you can see that it goes from 0 to 15. It turns out that the incomes were re-scaled and limited to the 0.5 to 15 range. The median age and value were also capped, possibly affecting our price predictions.


  • Géron, Aurélien. Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. First edition. Beijing Boston Farnham: O’Reilly, 2017.