6.1 Boston Housing Sol

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Model Answers for Chapter 6: Multiple Linear Regression Problem 6.1 (5.

1 in 1st edition)

Answer to 6.1.a: 1. The data should be partitioned into training and validation sets because we need two sets of data: one to build the model that depicts the relationship between the predictor variables and the predicted variable, and another to validate the models predictive accuracy. 2. The training data set is used to build the model. The algorithm discovers the model using this data set. 3. The validation data is used to validate the model. In this process, the model (built using the training data set) is used to make predictions with the validation data - data that were not used to fit the model. In this way we get an unbiased estimate of how well the model performs. We compute measures of error, which reflect the prediction accuracy. Refer to Data_Partition1 excel sheet in 6.1_Boston_Housing.

Refer to MLR_Output1 excel sheet in 6.1_Boston_Housing.

Answer to 6.1.b: Output of XLMiner: The Regression Model


Input variables Constant term CRIM CHAS RM Coefficient -23.6071014 -0.2611129 2.88669062 7.50815392 Std. Error 3.41045761 0.04066138 1.46451461 0.53549951 p-value SS

0 159255.8125 0 3756.925537 0.04963245 767.8793335 0 7997.099121

Regression equation is MEDV = -23.6071014 +(-0.2611129 * CRIM) +(2.88669062 * CHAS) +(7.50815392 * RM)

[Correction in book: average number of rooms =6. Drop the question What is the prediction error?] Answer to 6.1.c: Regression equation is MEDV = -23.6071014 + (-0.2611129 * CRIM) + (2.88669062 * CHAS) + (7.50815392 * RM)

MEDV = -23.6071014 + (-0.2611129 * 0.1) + (2.88669062 * 0) + (7.50815392 * 6) MEDV = 21.41571


Median house price is $21,415.71

Answer to 6.1.d: There are several variables that measure levels of industrialization, which are expected to be positively correlated. These include INDUS, NOX (pollution), and TAX. We expect a positive relationship between NOX (nitric oxides concentration, a pollutant), INDUS (proportion of non-retail business acres per town) and TAX (tax rate), because areas that have a high proportion of non-retail businesses tend to have higher taxes and more pollution.

Answer to 6.1.d.ii: Refer to the CorrelationTable sheet in 6.1_Boston_Housing.xls Highly correlated pairs are as follows: 1) 2) 3) 4) 5) 6) NOX and INDUS: Correlation coefficient = 0.76365 TAX and INDUS: Correlation coefficient = 0.72076 AGE and NOX: Correlation coefficient = 0.73147 DIS and NOX: Correlation coefficient = -0.76923 DIS and AGE: Correlation coefficient = -0.74788 TAX and RAD: Correlation coefficient = 0.91022

According to the correlation table, we might be able to remove some variables that do not add much information to others that we keep. We might remove INDUS, AGE and TAX.

Answer to 6.1.d.iii: Refer to the MLR_Output2 sheet in 6.1_Boston_Housing.xls for exhaustive search. To find the top three models, the criteria used is as follows: a) Find the 3 highest values of adjusted R-squared. b) Find the 3 values of Cp such that Cp value is near to # coefficients = # variables +1. Summary of top three models to choose the best model: Model 1 is designed using the following 8 variables: CRIM, ZN, NOX, RM, DIS, PTRATIO, B, LSTAT. Model 2 is designed using the following 9 variables: CRIM, ZN, NOX, RM, DIS, RAD, PTRATIO, B, LSTAT. Model 3 is designed using the following 10 variables: CRIM, ZN, CHAS, NOX, RM, DIS, RAD, PTRATIO, B, LSTAT

Note that as we add more variables, the error goes down (but the lower error comes at the cost of a more complex model).

Model I Total sum of squared errors RMS Error Average Error

Model II

Model III

4441.41523

4384.03216

4240.39099

4.68905 -0.3508979

4.65866177 -0.2195216

4.5817065 -0.223719429

Lift chart (validation dataset) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 200 # cases 400 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0

Lift chart (validation dataset)

Lift chart (validation dataset) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 0 200 # cases 400

Cumulative

Lift chart

Cumulative

Cumulative MEDV using average

Cumulative MEDV using average

Cumulative

Cumulative MEDV when sorted using predicted values

Cumulative MEDV when sorted using predicted values

Cumulative MEDV when sorted using predicted values Cumulative MEDV using average

200 # cases

400

From the above table we see that Model III has the smallest value for Total sum of squared errors and RMSE. Model II has the smallest value for average error. Lift Chart for Model I Decile mean / Global mean Lift Chart for Model II Decile mean / Global mean Lift Chart for Model III Decile mean / Global mean

Deciles 1 2 3 4 5 6 7 8 9 10

Cumulative

Cumulative

Cumulative

1.906763 1.90676253 1.86835 1.86834999 1.955631 1.95563057 1.416264 3.32302638 1.45604 3.32439013 1.36876 3.32439013 1.083507 4.40653299 1.086007 4.41039698 1.086007 4.41039698 1.013273 5.41980601 1.046685 5.4570821 1.051913 5.46230984 0.957359 6.37716491 0.93713 6.3942119 0.941221 6.40353092 7.301339 0.927356 7.30452111 0.909173 7.30338464 0.897808

0.835984 8.14050545 0.804618 8.10800252 0.82462 8.12595868 0.737112 8.87761725 0.803027 8.91102935 0.781434 8.90739265 0.694835 9.57245252 0.654832 9.56586101 0.658468 9.56586101 0.479589 10.0520411 0.485581 10.0514419 0.485581 10.0514419

From the initial cumulative values it appears that the cumulative decile mean is greater for model III. In summary, Model III with the variables CRIM, ZN, CHAS, NOX, RM, DIS, RAD, PTRATIO, B, LSTAT is the best model for predicting Boston Housing prices.

You might also like