BAUDM Assignment Predicting Boston Housing Prices
BAUDM Assignment Predicting Boston Housing Prices
By:
Suraj Dhende (14PGP015)
Saranya Panicker (14PGP108)
Bharadwaj Sista (14PGP118)
Ashis Tripathy (14PGP121)
Yadvendra Yadav (14PGP123)
Ans a:
The data should be partitioned into training and validation sets because we need two
sets of data: one to build the regression model and the other to test the model.
The regression model will describe the relationship between the dependent and the
independent variables where as the model run on the validation set will determine the
accuracy of the model that has been built on the training data set.
The validation data is used to validate or test the model. In this process, the model
(built using the training data set) is used to make predictions with the validation data data that were not used to fit the model. In this way we get an unbiased estimate of
how well the model performs. We compute measures of error which reflect the
prediction accuracy.
Ans b: The equation is:
Model Summary
Model
R Square
Approximately
Adjusted R
Square
Estimate
60% of the
cases
(SAMPLE) = 1
(Selected)
.747a
.557
.553
6.161
Sum of Squares
df
Mean Square
Regression
14051.438
4683.813
Residual
11160.292
294
37.960
Total
25211.730
297
F
123.388
-34.867-0.218*CRIM+3.824*CHAS+9.20*RM
Sig.
.000c
Ans c:
Median house price is $-7.288
Ans d:
i. There are certain variables that measure the level of development and industrialization.
These variables are likely to be positively correlated. From the correlations we come to know
that INDUS, NOX and TAX are highly correlated. This is because areas that have a high
proportion of non-retail businesses tend to have higher taxes and more pollution.
INDUS indicates the proportion of non-retail business while NOX indicates Nitric oxide
concentration.
ii. The highly correlated variables are as follows:
1) NOX and INDUS: Correlation coefficient = 0.764
2) TAX and INDUS: Correlation coefficient = 0.688
3) AGE and NOX: Correlation coefficient = 0.724
4) DIS and NOX: Correlation coefficient = -0.765
5) DIS and AGE: Correlation coefficient = -0.745
6) TAX and RAD: Correlation coefficient = 0.891
The variables INDUS, TAX and NOX denote the same thing that is development and
urbanization. So we can remove these variables to find the best fit model
iii.
Model 1: We have chosen to keep NOX
Variables Entered/Removeda,b
Model
Variables
Variables
Entered
Removed
Method
LSTAT, B,
1
PTRATIO,
CRIM, ZN, RM,
. Enter
NOX, DISc
a. Dependent Variable: MEDV
b. Models are based only on cases for which
Approximately 60% of the cases (SAMPLE) = 1
c. All requested variables entered.
Model Summary
Model
R Square
Adjusted R
Std. Error of
Square
the Estimate
Approximately
60% of the
Change Statistics
R Square
F Change
df1
df2
Sig. F Change
Change
cases
(SAMPLE) =
1 (Selected)
.846a
.716
.709
4.973
.716
91.286
289
.000
Variables Entered/Removeda,b
Model
Variables
Variables
Entered
Removed
Method
INDUS, B,
1
PTRATIO,
. Enter
Model Summary
Mod
el
Approximat Square
Adjusted R
Std. Error
Square
of the
R Square
Estimate
Change
Change
ely 60% of
Change Statistics
df1
df2
Sig. F
Change
the cases
(SAMPLE)
= 1
(Selected)
1
.837a
.701
.693
5.104
.701
84.833
289
.000
Variables Entered/Removeda,b
Model
Variables
Variables
Entered
Removed
Method
TAX, RM, B,
1
ZN, PTRATIO,
. Enter
CRIM, DIS,
LSTATc
Model Summary
Mode
l
Adjusted R
Std. Error of
Approximate
Square
Square
the Estimate
ly 60% of
Change Statistics
R Square
df1
Change
Change
df2
Sig. F
Change
the cases
(SAMPLE) =
1 (Selected)
1
.837a
.700
.692
5.114
.700
84.361
We find that the adjusted R square value is highest for following model
Model Summaryb,c
Model
R
Approximately
Approximately
60% of the
60% of the
cases
cases
(SAMPLE) = 1
(SAMPLE) ~= 1
(Selected)
1
R Square
Adjusted R
Square
Estimate
(Unselected)
a
.909
.913
.8257330890
.821
3.899
a. Predictors: (Constant), CAT. MEDV, CHAS, CRIM, B, DIS, PTRATIO, RM, LSTAT
b. Unless noted otherwise, statistics are based only on cases for which Approximately 60% of the
cases (SAMPLE) = 1.
c. Dependent Variable: MEDV
289
.000
Coefficientsa,b
Model
Unstandardized Coefficients
Standardized
Sig.
Collinearity Statistics
Coefficients
B
(Constant)
Std. Error
21.8196835416
4.650
CRIM
-.0916326951
.029
CHAS
2.5602912363
RM
Beta
Tolerance
VIF
4.693
.000
-.088
-3.113
.002
.747
1.339
.963
.068
2.658
.008
.921
1.086
1.5674889438
.534
.112
2.938
.004
.418
2.390
DIS
-.3863479004
.133
-.086
-2.905
.004
.684
1.462
PTRATIO
-.3669881155
.124
-.087
-2.967
.003
.695
1.438
.0072449467
.003
.073
2.653
.008
.790
1.266
-.4361544592
.049
-.346
-8.820
.000
.391
2.559
12.6070216351
.825
.512
15.274
.000
.536
1.865
LSTAT
CAT. MEDV
Final model: The variables are CRIM, CHAS, RM, DIS, PTRATIO, B, LSTAT &
CAT.MEDV