Model Prediction

To find the ‘best’ model I ran the fit_model function 1,000 times and took the best_params_ (max-depth) and best_score_ (negative MSE) for each trial.

Parameter Counts
Max-Depth Count
4 315
5 190
7 166
6 136
8 111
9 82
Median Scores
Max-Depth Median Score
4 -34.44
5 -32.54
6 -32.55
7 -32.83
8 -32.80
9 -32.94
10 -33.54
Max Scores
Max-Depth Max Score
4 -34.35
5 -30.46
6 -30.67
7 -30.88
8 -30.93
9 -30.79
10 -31.32

Note

Since the GridSearchCV normally tries to maximize the output of the scoring-function, but the goal in this case was to minimize it, the scores are negations of the MSE, thus the higher the score, the lower the MSE.

While a max-depth of 4 was the most common best-parameter, the max-depth of 5 was the median max-depth, had the highest median score, and had the highest overall score, so I will say that the optimal max_depth parameter is 5. This is in line with what I had guessed, based on the Complexity Performance plot.

Predicting the Client’s Price

Using the model that had the lowest MSE (30.46) out of the 1,000 generated, I then made a prediction for the price of the client’s house.

Predicted Price
Predicted value of client’s home $20,967.76
Difference between median and predicted $232.24

My three chosen features (lower_status, nitric_oxide, and rooms) seemed to indicate that the client’s house might be a lower-valued house, and the predicted value was about $232 less than the median median-value, so it appears that our model predicts that the client has a below-median-value house.

Confidence Interval

Although this isn’t an inferential analysis, I’ll calculate the 95% Confidence Interval for the median-value so that I’ll have a range to compare the prediction to. Since the data isn’t symmetric I’ll use a bootstrapped confidence interval (bias-corrected and accelerated (BCA))of the median instead of one based on the standard error.

95% CI [20.40, 21.75]

Our prediction for the client’s house falls within a 95% confidence interval for the median, so although I predicted that it would be below the median, there’s insufficient evidence to conclude that it differs from the median house price.

Assessing the Model

I think that this model seems reasonable for the given data (Boston Suburbs in 1970), but I think that I might be hesitant to predict the value for a specific house using it, given that we are using aggregate-values for entire suburbs, not values for individual houses. I would also think that separating out the upper-class houses would give a better model for certain clients, given the right-skew of the data. Also, the median MSE for the best model was ~32 so taking the square root of this gives an ‘average’ error of about $5,700, which seems fairly high, given the low median-values for the houses. I think that the model gives a useful ball-park-figure estimate, but I think I’d have to qualify the certainty of prediction for future clients, noting also the age of the data and not extrapolating much beyond 1970.