6.1 Boston Housing Sol
6.1 Boston Housing Sol
6.1 Boston Housing Sol
1 in 1st edition)
Answer to 6.1.a: 1. The data should be partitioned into training and validation sets because we need two sets of data: one to build the model that depicts the relationship between the predictor variables and the predicted variable, and another to validate the models predictive accuracy. 2. The training data set is used to build the model. The algorithm discovers the model using this data set. 3. The validation data is used to validate the model. In this process, the model (built using the training data set) is used to make predictions with the validation data - data that were not used to fit the model. In this way we get an unbiased estimate of how well the model performs. We compute measures of error, which reflect the prediction accuracy. Refer to Data_Partition1 excel sheet in 6.1_Boston_Housing.
Regression equation is MEDV = -23.6071014 +(-0.2611129 * CRIM) +(2.88669062 * CHAS) +(7.50815392 * RM)
[Correction in book: average number of rooms =6. Drop the question What is the prediction error?] Answer to 6.1.c: Regression equation is MEDV = -23.6071014 + (-0.2611129 * CRIM) + (2.88669062 * CHAS) + (7.50815392 * RM)
Answer to 6.1.d: There are several variables that measure levels of industrialization, which are expected to be positively correlated. These include INDUS, NOX (pollution), and TAX. We expect a positive relationship between NOX (nitric oxides concentration, a pollutant), INDUS (proportion of non-retail business acres per town) and TAX (tax rate), because areas that have a high proportion of non-retail businesses tend to have higher taxes and more pollution.
Answer to 6.1.d.ii: Refer to the CorrelationTable sheet in 6.1_Boston_Housing.xls Highly correlated pairs are as follows: 1) 2) 3) 4) 5) 6) NOX and INDUS: Correlation coefficient = 0.76365 TAX and INDUS: Correlation coefficient = 0.72076 AGE and NOX: Correlation coefficient = 0.73147 DIS and NOX: Correlation coefficient = -0.76923 DIS and AGE: Correlation coefficient = -0.74788 TAX and RAD: Correlation coefficient = 0.91022
According to the correlation table, we might be able to remove some variables that do not add much information to others that we keep. We might remove INDUS, AGE and TAX.
Answer to 6.1.d.iii: Refer to the MLR_Output2 sheet in 6.1_Boston_Housing.xls for exhaustive search. To find the top three models, the criteria used is as follows: a) Find the 3 highest values of adjusted R-squared. b) Find the 3 values of Cp such that Cp value is near to # coefficients = # variables +1. Summary of top three models to choose the best model: Model 1 is designed using the following 8 variables: CRIM, ZN, NOX, RM, DIS, PTRATIO, B, LSTAT. Model 2 is designed using the following 9 variables: CRIM, ZN, NOX, RM, DIS, RAD, PTRATIO, B, LSTAT. Model 3 is designed using the following 10 variables: CRIM, ZN, CHAS, NOX, RM, DIS, RAD, PTRATIO, B, LSTAT
Note that as we add more variables, the error goes down (but the lower error comes at the cost of a more complex model).
Model II
Model III
4441.41523
4384.03216
4240.39099
4.68905 -0.3508979
4.65866177 -0.2195216
4.5817065 -0.223719429
Lift chart (validation dataset) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 200 # cases 400 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0
Lift chart (validation dataset) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 200 # cases 400
Cumulative
Lift chart
Cumulative
Cumulative
Cumulative MEDV when sorted using predicted values Cumulative MEDV using average
200 # cases
400
From the above table we see that Model III has the smallest value for Total sum of squared errors and RMSE. Model II has the smallest value for average error. Lift Chart for Model I Decile mean / Global mean Lift Chart for Model II Decile mean / Global mean Lift Chart for Model III Decile mean / Global mean
Deciles 1 2 3 4 5 6 7 8 9 10
Cumulative
Cumulative
Cumulative
1.906763 1.90676253 1.86835 1.86834999 1.955631 1.95563057 1.416264 3.32302638 1.45604 3.32439013 1.36876 3.32439013 1.083507 4.40653299 1.086007 4.41039698 1.086007 4.41039698 1.013273 5.41980601 1.046685 5.4570821 1.051913 5.46230984 0.957359 6.37716491 0.93713 6.3942119 0.941221 6.40353092 7.301339 0.927356 7.30452111 0.909173 7.30338464 0.897808
0.835984 8.14050545 0.804618 8.10800252 0.82462 8.12595868 0.737112 8.87761725 0.803027 8.91102935 0.781434 8.90739265 0.694835 9.57245252 0.654832 9.56586101 0.658468 9.56586101 0.479589 10.0520411 0.485581 10.0514419 0.485581 10.0514419
From the initial cumulative values it appears that the cumulative decile mean is greater for model III. In summary, Model III with the variables CRIM, ZN, CHAS, NOX, RM, DIS, RAD, PTRATIO, B, LSTAT is the best model for predicting Boston Housing prices.