To find the ‘best’ model I ran the fit_model function 1,000 times and took the best_params_ (max-depth) and best_score_ (negative MSE) for each trial.
Max-Depth | Count |
---|---|
4 | 315 |
5 | 190 |
7 | 166 |
6 | 136 |
8 | 111 |
9 | 82 |
Max-Depth | Median Score |
---|---|
4 | -34.44 |
5 | -32.54 |
6 | -32.55 |
7 | -32.83 |
8 | -32.80 |
9 | -32.94 |
10 | -33.54 |
Max-Depth | Max Score |
---|---|
4 | -34.35 |
5 | -30.46 |
6 | -30.67 |
7 | -30.88 |
8 | -30.93 |
9 | -30.79 |
10 | -31.32 |
Note
Since the GridSearchCV normally tries to maximize the output of the scoring-function, but the goal in this case was to minimize it, the scores are negations of the MSE, thus the higher the score, the lower the MSE.
While a max-depth of 4 was the most common best-parameter, the max-depth of 5 was the median max-depth, had the highest median score, and had the highest overall score, so I will say that the optimal max_depth parameter is 5. This is in line with what I had guessed, based on the Complexity Performance plot.
Using the model that had the lowest MSE (30.46) out of the 1,000 generated, I then made a prediction for the price of the client’s house.
Predicted value of client’s home | $20,967.76 |
Difference between median and predicted | $232.24 |
My three chosen features (lower_status, nitric_oxide, and rooms) seemed to indicate that the client’s house might be a lower-valued house, and the predicted value was about $232 less than the median median-value, so it appears that our model predicts that the client has a below-median-value house.
Although this isn’t an inferential analysis, I’ll calculate the 95% Confidence Interval for the median-value so that I’ll have a range to compare the prediction to. Since the data isn’t symmetric I’ll use a bootstrapped confidence interval (bias-corrected and accelerated (BCA))of the median instead of one based on the standard error.
95% CI [20.40, 21.75]
Our prediction for the client’s house falls within a 95% confidence interval for the median, so although I predicted that it would be below the median, there’s insufficient evidence to conclude that it differs from the median house price.
I think that this model seems reasonable for the given data (Boston Suburbs in 1970), but I think that I might be hesitant to predict the value for a specific house using it, given that we are using aggregate-values for entire suburbs, not values for individual houses. I would also think that separating out the upper-class houses would give a better model for certain clients, given the right-skew of the data. Also, the median MSE for the best model was ~32 so taking the square root of this gives an ‘average’ error of about $5,700, which seems fairly high, given the low median-values for the houses. I think that the model gives a useful ball-park-figure estimate, but I think I’d have to qualify the certainty of prediction for future clients, noting also the age of the data and not extrapolating much beyond 1970.