Evaluating Model Performance

Here I’ll discuss splitting the data for training and testing, the performance metric I chose, the algorithm used for the modeling and how the hyper-parameters for the model were chosen.

Splitting the Data

First a function named shuffle_split_data was created that acts as an alias for the train_test_split function from sklearn. The main difference is that the ordering of the data-sets is changed from both x’s followed by both y’s to both training sets followed by both testing sets. In this case a 70% training data, 30% test data split was used.

We split the data into training and testing subsets so that we can assess the model using a different data-set than what it was trained on, thus reducing the likelihood of overfitting the model to the training data and increasing the likelihood that it will generalize to other data.

Choosing a Performance Metric

There are several possible regression metrics to use, but I chose Mean Squared Error as the most appropriate performance metric for predicting housing prices because we are predicting a numeric value (a regression problem) and while Mean Absolute Error, Median Absolute Error, Explained Variance Score, or r2_score could also be used, I wanted a metric that would be based on the errors in the model and the MSE emphasizes larger errors more and so I felt it would be preferable.

The Mean Squared Error is an average of the squared differences between predicted values and the actual values.

MSE(y, \hat{y}) = \frac{1}{n}\sum_{i=0}^{n-1} (y_i - \hat{y}_i)^2

DecisionTreeRegressor

The model was built using sklearn’s DecisionTreeRegressor, a non-parametric, tree-based algorithm (using the Classification and Regression Trees (CART) tree algorithm).