0% found this document useful (0 votes)
30 views25 pages

Housing Price Prediction

The document discusses a Kaggle competition to predict housing prices in Ames, Iowa. It introduces the team members and describes their process of data exploration, feature engineering and model training to predict sales price including transforming features, handling missing values, and selecting models.

Uploaded by

tawfeq.akas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views25 pages

Housing Price Prediction

The document discusses a Kaggle competition to predict housing prices in Ames, Iowa. It introduces the team members and describes their process of data exploration, feature engineering and model training to predict sales price including transforming features, handling missing values, and selecting models.

Uploaded by

tawfeq.akas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Kaggle Competition: Predicting Housing Sales

Price in Ames, Iowa

August 2019

Fred(Lefan) Cheng
Paul Dingus
Wenjun Ma
Haoyun Zhang
Team Introduction
Please contact us if your company is looking for Data Science or Data Analytics talents

Fred(Lefan) Cheng Paul Dingus Wenjun Ma Haoyun Zhang


• Linkedin: • Linkedin: • Linkedin: • Linkedin:
https://www.linkedin.com/in/ https://www.linkedin.com/in/ https://www.linkedin.com/in/ https://www.linkedin.com/in/
lefancheng/ paul-dingus/ wenjun-ma-phd/ Haoyun-Zhang-UPenn/
• Email: • Email: • Email: • Email:
[email protected] [email protected] [email protected] [email protected]

2
Data Exploration

3
The Purpose is to Predict Housing Price with Least Error (RMSE)
We trained 5 models, which are proven to be effective, to predict prices based on houses information

Over 80 Input Variables:


Information about houses
Lasso Regression
Overall Quality Ridge Regression
Elastic Regression
Above-Grade Living Area

Size of garage in square feet


Output Variable:
. Gradient Boosting
. XGBoosting
Housing Sales
. Price
Original construction date

Neighborhood
Stack Regressor
Lot size in square feet

RMSE: Root Mean Squared Error


4
A glance of dataset
Explorating dataset is the foundation of the following pre-processing and modeling

Distributions of variables Relationship between input and output variables

Some variables are significantly skewed Outliers exist and some variables have
that might need to be standardized strong linear relationship with price
5
Transform the Target Feature by Taking the Log for Normalization
A violation of normality assumption of linear regression manifests in the target feature that it’s notable right skew.

Before Transformation After Transformation


Distribution of y

Distribution of y
Normal distribution Normal distribution

It would be advantages to work with the normally distributed output variable (Sales Price)
6
Removing Outliers of Important - Highly Correlated - Features
Filter out outliers in important features that have high correlation with Sales Price and remove them.

Overall Quality After Removing Outliers


Before Removing Outliers

Above-Grade
Living Area

Size of garage in
square feet

Select important features with


high correlation with price
7
Missing Values

8
General view of missing values
Ratio of missing values of each feature

Number of Missing values

13965

Train Dataset Test Dataset

6965 7000

Number of Features with


missing values

34

Train Dataset Test Dataset

19 33

9
Missing value imputation of train dataset

No Alley Access
No Pool
Pseudo All Covered
Missing No Fence
Impute with ‘No ***’
Values
No Fireplace
No Garage

Features Group Imputation

BsmtExposure Neighborhood
37 No Basement Mode
BsmtFinType2 YearBuilt
Id 949 misses Exposure
Real
Id 333 misses FinType2 LotFrontage Neighborhood Median
Missing 259 miss lot Frontage
Values 8 miss MasVnrType MasVnrType Neighborhood
8 miss MaxVnrArea Mode
MasVnrArea YearBuilt
1 Electrical
Electrical YearBuilt SBrkr

10
Missing value imputation of test dataset

LotFrontage Median
MasVnrType None+BrkFace
Id 2218 MasVnrArea 0
Id 2219 Id 2127 MSZoning RM+RL
No
Id 2349 Id 2577 No Garage Utilities AllPub
Basement
Wrong inputs Wrong inputs Exterior VinylSd
KitchenQual TA
Functional Mod
SaleType WD

11
Feature Engineering

12
Dealing with different types of data
Grouping data into different categories can help to sort through it and organize it effectively

Type of variable Transformation Examples

Continuous Just ensure that the variable is numeric: LotFrontage, LotArea,


MassVnrArea, BsmtFinSF1
column.astype('float64')
Simply esure that the data

Ordinal Categorical Manually encode the variables: OverallQual, OverallCond,


ExterQual, BsmtCond
[‘Po’, ‘Fa’, ‘Av’, ‘Gd’, ’Ex’] [2, 4, 6, 8, 10]

Nominal Categorical Dummify the variables: MSSubClass, MSZoning,


LotConfig
pd.get_dummies(column)

13
Dummifying Data

When dummifying, every category will form its own column. Many will
not be numerous enough or different enough to form meaningful
variables, so we group them:

14
Dummifying Data

RoofMat1 [‘Membran', 'ClyTile', 'Metal', 'Roll', 'WdShngl', 'WdShake'] ‘Others’

PoolQC ['Ex', 'Gd', 'Fa' ‘Have_Pool’] ‘Have_Pool’

Condition2 ['RRAn', 'RRAe' ] ‘Norm’


[RRNn', 'Artery', and 'Feedr'] ‘Other’
['PosA', 'PosN'] ‘Pos’

Heating ['Wall', 'OthW', 'Floor'] ‘Other’

MiscFeatures ['TenC', 'Othr'] ‘Other’

15
Dealing with Skewness

Reducing the skew in our continuous or ordered features will help our modeling. We applied the
box-cox transformation to particularly skewed data.

If skew > threshold Calculate skewness after log transform

If log transform reduces skew Apply log transform, repeat.

For some variables, we found it was better to manually apply power transformations to reduce extreme
skew. This was done for BsmtCond, BsmtQual, GarageCond, and GarageQual.

16
Dealing with Skewness

Skewed Data Before Transformation After Transformation

17
Model Fitting

18
Feature Selection via Lasso Regression
Analyze the lasso regression plot to decide which features need to be dropped

rMSE ~ λ | coeff ~ λ Feature Dropped

19
Parameter Optimization of Lasso Regression
Comparison of feature selection

Before Feature Selection After Feature Selection

20
Parameter Optimization of Ridge Regression
Comparison after feature selection

Before Feature Selection After Feature Selection

21
Price Prediction by Elastic Net Regression
Model comparison and price prediction

Lasso Regression

Ridge Regression Lasso

Elastic Net Regression

22
Gradient Boosting
Feature Importance via Gradient Boosting

Feature Importance Score Feature Dropped by Lasso

23
Stack Regressor
Stack All Models(ridge, lasso, elastic, xg boosting, g boosting) and Price Prediction

Lasso 0.30
Prediction 1
RMSE = 0.1071

Ridge Prediction 2 0.10


RMSE = 0.1091
Training Data

Elastic Net Prediction 3 0.25


RMSE = 0.1072 Meta

Gradient Boosting 0.25


RMSE = 0.1080 Prediction 4

XG Boosting Prediction 5 0.10


RMSE = 0.1122

24
Thank you!

25

You might also like