Model Prediction¶

To find the ‘best’ model I ran the fit_model function 1,000 times and took the best_params_ (max-depth) and best_score_ (negative MSE) for each trial.

Parameter Counts¶
Max-Depth	Count
4	315
5	190
7	166
6	136
8	111
9	82

Median Scores¶
Max-Depth	Median Score
4	-34.44
5	-32.54
6	-32.55
7	-32.83
8	-32.80
9	-32.94
10	-33.54

Max Scores¶
Max-Depth	Max Score
4	-34.35
5	-30.46
6	-30.67
7	-30.88
8	-30.93
9	-30.79
10	-31.32

Note

Since the GridSearchCV normally tries to maximize the output of the scoring-function, but the goal in this case was to minimize it, the scores are negations of the MSE, thus the higher the score, the lower the MSE.

While a max-depth of 4 was the most common best-parameter, the max-depth of 5 was the median max-depth, had the highest median score, and had the highest overall score, so I will say that the optimal max_depth parameter is 5. This is in line with what I had guessed, based on the Complexity Performance plot.

Predicting the Client’s Price¶

Using the model that had the lowest MSE (30.46) out of the 1,000 generated, I then made a prediction for the price of the client’s house.

Predicted Price¶
Predicted value of client’s home	$20,967.76
Difference between median and predicted	$232.24

My three chosen features (lower_status, nitric_oxide, and rooms) seemed to indicate that the client’s house might be a lower-valued house, and the predicted value was about $232 less than the median median-value, so it appears that our model predicts that the client has a below-median-value house.

Confidence Interval¶

Although this isn’t an inferential analysis, I’ll calculate the 95% Confidence Interval for the median-value so that I’ll have a range to compare the prediction to. Since the data isn’t symmetric I’ll use a bootstrapped confidence interval (bias-corrected and accelerated (BCA))of the median instead of one based on the standard error.

95% CI [20.40, 21.75]

Our prediction for the client’s house falls within a 95% confidence interval for the median, so although I predicted that it would be below the median, there’s insufficient evidence to conclude that it differs from the median house price.

Assessing the Model¶

I think that this model seems reasonable for the given data (Boston Suburbs in 1970), but I think that I might be hesitant to predict the value for a specific house using it, given that we are using aggregate-values for entire suburbs, not values for individual houses. I would also think that separating out the upper-class houses would give a better model for certain clients, given the right-skew of the data. Also, the median MSE for the best model was ~32 so taking the square root of this gives an ‘average’ error of about $5,700, which seems fairly high, given the low median-values for the houses. I think that the model gives a useful ball-park-figure estimate, but I think I’d have to qualify the certainty of prediction for future clients, noting also the age of the data and not extrapolating much beyond 1970.