Housing Price Prediction
Housing Price Prediction
August 2019
Fred(Lefan) Cheng
Paul Dingus
Wenjun Ma
Haoyun Zhang
Team Introduction
Please contact us if your company is looking for Data Science or Data Analytics talents
2
Data Exploration
3
The Purpose is to Predict Housing Price with Least Error (RMSE)
We trained 5 models, which are proven to be effective, to predict prices based on houses information
Neighborhood
Stack Regressor
Lot size in square feet
Some variables are significantly skewed Outliers exist and some variables have
that might need to be standardized strong linear relationship with price
5
Transform the Target Feature by Taking the Log for Normalization
A violation of normality assumption of linear regression manifests in the target feature that it’s notable right skew.
Distribution of y
Normal distribution Normal distribution
It would be advantages to work with the normally distributed output variable (Sales Price)
6
Removing Outliers of Important - Highly Correlated - Features
Filter out outliers in important features that have high correlation with Sales Price and remove them.
Above-Grade
Living Area
Size of garage in
square feet
8
General view of missing values
Ratio of missing values of each feature
13965
6965 7000
34
19 33
9
Missing value imputation of train dataset
No Alley Access
No Pool
Pseudo All Covered
Missing No Fence
Impute with ‘No ***’
Values
No Fireplace
No Garage
BsmtExposure Neighborhood
37 No Basement Mode
BsmtFinType2 YearBuilt
Id 949 misses Exposure
Real
Id 333 misses FinType2 LotFrontage Neighborhood Median
Missing 259 miss lot Frontage
Values 8 miss MasVnrType MasVnrType Neighborhood
8 miss MaxVnrArea Mode
MasVnrArea YearBuilt
1 Electrical
Electrical YearBuilt SBrkr
10
Missing value imputation of test dataset
LotFrontage Median
MasVnrType None+BrkFace
Id 2218 MasVnrArea 0
Id 2219 Id 2127 MSZoning RM+RL
No
Id 2349 Id 2577 No Garage Utilities AllPub
Basement
Wrong inputs Wrong inputs Exterior VinylSd
KitchenQual TA
Functional Mod
SaleType WD
11
Feature Engineering
12
Dealing with different types of data
Grouping data into different categories can help to sort through it and organize it effectively
13
Dummifying Data
When dummifying, every category will form its own column. Many will
not be numerous enough or different enough to form meaningful
variables, so we group them:
14
Dummifying Data
15
Dealing with Skewness
Reducing the skew in our continuous or ordered features will help our modeling. We applied the
box-cox transformation to particularly skewed data.
For some variables, we found it was better to manually apply power transformations to reduce extreme
skew. This was done for BsmtCond, BsmtQual, GarageCond, and GarageQual.
16
Dealing with Skewness
17
Model Fitting
18
Feature Selection via Lasso Regression
Analyze the lasso regression plot to decide which features need to be dropped
19
Parameter Optimization of Lasso Regression
Comparison of feature selection
20
Parameter Optimization of Ridge Regression
Comparison after feature selection
21
Price Prediction by Elastic Net Regression
Model comparison and price prediction
Lasso Regression
22
Gradient Boosting
Feature Importance via Gradient Boosting
23
Stack Regressor
Stack All Models(ridge, lasso, elastic, xg boosting, g boosting) and Price Prediction
Lasso 0.30
Prediction 1
RMSE = 0.1071
24
Thank you!
25