0% found this document useful (0 votes)

30 views25 pages

Housing Price Prediction

The document discusses a Kaggle competition to predict housing prices in Ames, Iowa. It introduces the team members and describes their process of data exploration, feature engineering and model training to predict sales price including transforming features, handling missing values, and selecting models.

Uploaded by

tawfeq.akas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views25 pages

Housing Price Prediction

Uploaded by

tawfeq.akas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Kaggle Competition: Predicting Housing Sales

Price in Ames, Iowa

August 2019

Fred(Lefan) Cheng
Paul Dingus
Wenjun Ma
Haoyun Zhang
Team Introduction
Please contact us if your company is looking for Data Science or Data Analytics talents

Fred(Lefan) Cheng Paul Dingus Wenjun Ma Haoyun Zhang

• Linkedin: • Linkedin: • Linkedin: • Linkedin:
https://www.linkedin.com/in/ https://www.linkedin.com/in/ https://www.linkedin.com/in/ https://www.linkedin.com/in/
lefancheng/ paul-dingus/ wenjun-ma-phd/ Haoyun-Zhang-UPenn/
• Email: • Email: • Email: • Email:
[email protected] [email protected] [email protected] [email protected]

2
Data Exploration

3
The Purpose is to Predict Housing Price with Least Error (RMSE)
We trained 5 models, which are proven to be effective, to predict prices based on houses information

Over 80 Input Variables:

Information about houses
Lasso Regression
Overall Quality Ridge Regression
Elastic Regression
Above-Grade Living Area

Size of garage in square feet

Output Variable:
. Gradient Boosting
. XGBoosting
Housing Sales
. Price
Original construction date

Neighborhood
Stack Regressor
Lot size in square feet

RMSE: Root Mean Squared Error

4
A glance of dataset
Explorating dataset is the foundation of the following pre-processing and modeling

Distributions of variables Relationship between input and output variables

Some variables are significantly skewed Outliers exist and some variables have
that might need to be standardized strong linear relationship with price
5
Transform the Target Feature by Taking the Log for Normalization
A violation of normality assumption of linear regression manifests in the target feature that it’s notable right skew.

Before Transformation After Transformation

Distribution of y

Distribution of y
Normal distribution Normal distribution

It would be advantages to work with the normally distributed output variable (Sales Price)
6
Removing Outliers of Important - Highly Correlated - Features
Filter out outliers in important features that have high correlation with Sales Price and remove them.

Overall Quality After Removing Outliers

Before Removing Outliers

Above-Grade
Living Area

Size of garage in
square feet

Select important features with

high correlation with price
7
Missing Values

8
General view of missing values
Ratio of missing values of each feature

Number of Missing values

13965

Train Dataset Test Dataset

6965 7000

Number of Features with

missing values

Train Dataset Test Dataset

19 33

9
Missing value imputation of train dataset

No Alley Access
No Pool
Pseudo All Covered
Missing No Fence
Impute with ‘No ***’
Values
No Fireplace
No Garage

Features Group Imputation

BsmtExposure Neighborhood
37 No Basement Mode
BsmtFinType2 YearBuilt
Id 949 misses Exposure
Real
Id 333 misses FinType2 LotFrontage Neighborhood Median
Missing 259 miss lot Frontage
Values 8 miss MasVnrType MasVnrType Neighborhood
8 miss MaxVnrArea Mode
MasVnrArea YearBuilt
1 Electrical
Electrical YearBuilt SBrkr

10
Missing value imputation of test dataset

LotFrontage Median
MasVnrType None+BrkFace
Id 2218 MasVnrArea 0
Id 2219 Id 2127 MSZoning RM+RL
No
Id 2349 Id 2577 No Garage Utilities AllPub
Basement
Wrong inputs Wrong inputs Exterior VinylSd
KitchenQual TA
Functional Mod
SaleType WD

11
Feature Engineering

12
Dealing with different types of data
Grouping data into different categories can help to sort through it and organize it effectively

Type of variable Transformation Examples

Continuous Just ensure that the variable is numeric: LotFrontage, LotArea,

MassVnrArea, BsmtFinSF1
column.astype('float64')
Simply esure that the data

Ordinal Categorical Manually encode the variables: OverallQual, OverallCond,

ExterQual, BsmtCond
[‘Po’, ‘Fa’, ‘Av’, ‘Gd’, ’Ex’] [2, 4, 6, 8, 10]

Nominal Categorical Dummify the variables: MSSubClass, MSZoning,

LotConfig
pd.get_dummies(column)

13
Dummifying Data

When dummifying, every category will form its own column. Many will
not be numerous enough or different enough to form meaningful
variables, so we group them:

14
Dummifying Data

RoofMat1 [‘Membran', 'ClyTile', 'Metal', 'Roll', 'WdShngl', 'WdShake'] ‘Others’

PoolQC ['Ex', 'Gd', 'Fa' ‘Have_Pool’] ‘Have_Pool’

Condition2 ['RRAn', 'RRAe' ] ‘Norm’

[RRNn', 'Artery', and 'Feedr'] ‘Other’
['PosA', 'PosN'] ‘Pos’

Heating ['Wall', 'OthW', 'Floor'] ‘Other’

MiscFeatures ['TenC', 'Othr'] ‘Other’

15
Dealing with Skewness

Reducing the skew in our continuous or ordered features will help our modeling. We applied the
box-cox transformation to particularly skewed data.

If skew > threshold Calculate skewness after log transform

If log transform reduces skew Apply log transform, repeat.

For some variables, we found it was better to manually apply power transformations to reduce extreme
skew. This was done for BsmtCond, BsmtQual, GarageCond, and GarageQual.

16
Dealing with Skewness

Skewed Data Before Transformation After Transformation

17
Model Fitting

18
Feature Selection via Lasso Regression
Analyze the lasso regression plot to decide which features need to be dropped

rMSE ~ λ | coeff ~ λ Feature Dropped

19
Parameter Optimization of Lasso Regression
Comparison of feature selection

Before Feature Selection After Feature Selection

20
Parameter Optimization of Ridge Regression
Comparison after feature selection

Before Feature Selection After Feature Selection

21
Price Prediction by Elastic Net Regression
Model comparison and price prediction

Lasso Regression

Ridge Regression Lasso

Elastic Net Regression

22
Gradient Boosting
Feature Importance via Gradient Boosting

Feature Importance Score Feature Dropped by Lasso

23
Stack Regressor
Stack All Models(ridge, lasso, elastic, xg boosting, g boosting) and Price Prediction

Lasso 0.30
Prediction 1
RMSE = 0.1071

Ridge Prediction 2 0.10

RMSE = 0.1091
Training Data

Elastic Net Prediction 3 0.25

RMSE = 0.1072 Meta

Gradient Boosting 0.25

RMSE = 0.1080 Prediction 4

XG Boosting Prediction 5 0.10

RMSE = 0.1122

24
Thank you!

Linear Regression Assignment
0% (2)
Linear Regression Assignment
8 pages
Ames Housing Price Prediction - Complete ML Project With Python
No ratings yet
Ames Housing Price Prediction - Complete ML Project With Python
14 pages
Week 7 - Lecture 13
No ratings yet
Week 7 - Lecture 13
22 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
No ratings yet
Predicting House Prices Using Regression Techniques: Problem Statement: Problems Faced During Buying A House
20 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
House Price Prediction Using Machine Learning Techniques
No ratings yet
House Price Prediction Using Machine Learning Techniques
5 pages
Data Analysis Project MAIN
No ratings yet
Data Analysis Project MAIN
6 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
0.1 Guilherme Marthe - Boston House Pricing Challenge
100% (1)
0.1 Guilherme Marthe - Boston House Pricing Challenge
15 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
Analysis and Prediction of House Prices by Linear Regression Model
No ratings yet
Analysis and Prediction of House Prices by Linear Regression Model
91 pages
Story Point Estimation Copy
No ratings yet
Story Point Estimation Copy
16 pages
Report
No ratings yet
Report
40 pages
Bi El
No ratings yet
Bi El
26 pages
Ds ML House Price Book
No ratings yet
Ds ML House Price Book
46 pages
Housepriceprediction ML 221104055342 Fb5109ae
No ratings yet
Housepriceprediction ML 221104055342 Fb5109ae
17 pages
Introduction To Machine Learning (ML) With Sklearn
No ratings yet
Introduction To Machine Learning (ML) With Sklearn
10 pages
Recipes For Data Processing
No ratings yet
Recipes For Data Processing
51 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Dawit House
No ratings yet
Dawit House
49 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Linear Regression Analysis - Polynomial Regression
No ratings yet
Linear Regression Analysis - Polynomial Regression
25 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
Ese Lab File
No ratings yet
Ese Lab File
30 pages
Problem Statement
No ratings yet
Problem Statement
6 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
L03 The Regression Pipeline
No ratings yet
L03 The Regression Pipeline
94 pages
Data Mining Final Assignment
No ratings yet
Data Mining Final Assignment
4 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
End To End Machine Learning Project-2
No ratings yet
End To End Machine Learning Project-2
10 pages
Unit 2
No ratings yet
Unit 2
78 pages
Predicting House Prices
No ratings yet
Predicting House Prices
9 pages
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
No ratings yet
EDA and Hypothesis Testing On KC Housing Data: Daniele Sammarco - Exploratory Data Analysis For Machine Learning by IBM
9 pages
ISMLA Module5
No ratings yet
ISMLA Module5
25 pages
House Value
No ratings yet
House Value
22 pages
Making Predictions
No ratings yet
Making Predictions
13 pages
(House Price Prediction) Capstone Project For Python
No ratings yet
(House Price Prediction) Capstone Project For Python
10 pages
Pa Da1
No ratings yet
Pa Da1
17 pages
Module 2
No ratings yet
Module 2
20 pages
Explain Me Every Code Written in It With Deep Know
No ratings yet
Explain Me Every Code Written in It With Deep Know
7 pages
01 Build Model
No ratings yet
01 Build Model
109 pages
Document From Jahnavi
No ratings yet
Document From Jahnavi
20 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Housing
No ratings yet
Housing
21 pages
ML Manual
No ratings yet
ML Manual
24 pages
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
No ratings yet
USA Real Estate Price Prediction Using Decision Tree Regressor, and AdaBoost Regressor
14 pages
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
No ratings yet
Project Report ME-315 Machine Learning in Practice: Sebastian Perez Viegener LSE ID:201870983 July 3, 2019
15 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
Module 2
No ratings yet
Module 2
35 pages
Regression Analysis
No ratings yet
Regression Analysis
17 pages
House Ames Project
No ratings yet
House Ames Project
15 pages
Abstract Machine Learning Has Been Instrumental Across Diver
No ratings yet
Abstract Machine Learning Has Been Instrumental Across Diver
6 pages
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
No ratings yet
Machine Learning Project: TITLE: Predicting The Sale Price of A House Using Linear Regression
20 pages
Statistics with Rust: 50+ Statistical Techniques Put into Action
From Everand
Statistics with Rust: 50+ Statistical Techniques Put into Action
Keiko Nakamura
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Research On Consumer's Purchase Decision: Advances in Social Science, Education and Humanities Research, Volume 344
No ratings yet
Research On Consumer's Purchase Decision: Advances in Social Science, Education and Humanities Research, Volume 344
14 pages
Data Mining: Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
No ratings yet
Data Mining: Bob Stine Dept of Statistics, Wharton School University of Pennsylvania
21 pages
Fundamentals of Data Analytics - GROUP 4
No ratings yet
Fundamentals of Data Analytics - GROUP 4
18 pages
Barro (1999) - Determinants of Democracy
No ratings yet
Barro (1999) - Determinants of Democracy
27 pages
Effect of Covid-19 Pandemic On Sme Performance in Nigeria
No ratings yet
Effect of Covid-19 Pandemic On Sme Performance in Nigeria
18 pages
Rose 2016
No ratings yet
Rose 2016
17 pages
1 s2.0 S1877117320300478 Main
No ratings yet
1 s2.0 S1877117320300478 Main
183 pages
BA Sample Paper Questions
No ratings yet
BA Sample Paper Questions
8 pages
1 s2.0 S0049089X11002286 Main PDF
No ratings yet
1 s2.0 S0049089X11002286 Main PDF
14 pages
Explore: Notes
No ratings yet
Explore: Notes
37 pages
Barber 1996
No ratings yet
Barber 1996
31 pages
Big Data: New Tricks For Econometrics: Hal R. Varian
No ratings yet
Big Data: New Tricks For Econometrics: Hal R. Varian
55 pages
Machine Learning and Its Applications 1st Edition Peter Wlodarczak 2024 Scribd Download
100% (2)
Machine Learning and Its Applications 1st Edition Peter Wlodarczak 2024 Scribd Download
55 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
MA8402 - PQT - 2 Marks With Answers
No ratings yet
MA8402 - PQT - 2 Marks With Answers
13 pages
Linear Regression Model
No ratings yet
Linear Regression Model
195 pages
Research Project 12
No ratings yet
Research Project 12
18 pages
Nigatu Mengesha Fentaw
No ratings yet
Nigatu Mengesha Fentaw
11 pages
Format of SIP Report Batch 2023-25.
No ratings yet
Format of SIP Report Batch 2023-25.
26 pages
THE EFFECT OF MARKETING STRATEGY ON BANK'S (Oromia International Bank)
No ratings yet
THE EFFECT OF MARKETING STRATEGY ON BANK'S (Oromia International Bank)
95 pages
Habitat and Conservation of The Enigmatic Damselfly Ischnura Pumilio
No ratings yet
Habitat and Conservation of The Enigmatic Damselfly Ischnura Pumilio
12 pages
Journal of Hydrology: Sciencedirect
No ratings yet
Journal of Hydrology: Sciencedirect
12 pages
Correlation and Regression
No ratings yet
Correlation and Regression
59 pages
Emotional Regulation and Interpersonal Effectiveness As Mechanisms of Change For Treatment Outcomes Within A DBT Program For Adolescents
No ratings yet
Emotional Regulation and Interpersonal Effectiveness As Mechanisms of Change For Treatment Outcomes Within A DBT Program For Adolescents
13 pages
Spss 1. Uji Normalitas Data: One-Sample Kolmogorov-Smirnov Test
No ratings yet
Spss 1. Uji Normalitas Data: One-Sample Kolmogorov-Smirnov Test
3 pages
Pakola Framework
No ratings yet
Pakola Framework
17 pages
EVA Based Financial Performance Measurem
No ratings yet
EVA Based Financial Performance Measurem
17 pages
MAT240 Project One
No ratings yet
MAT240 Project One
14 pages
Soal Uas Statistika Data Sains s3 BK - 2021-s3 BK
No ratings yet
Soal Uas Statistika Data Sains s3 BK - 2021-s3 BK
2 pages
MANAGERIAL ECONOMICS 10 Marks
No ratings yet
MANAGERIAL ECONOMICS 10 Marks
64 pages

Housing Price Prediction

Uploaded by

Housing Price Prediction

Uploaded by

Kaggle Competition: Predicting Housing Sales

Price in Ames, Iowa

Fred(Lefan) Cheng Paul Dingus Wenjun Ma Haoyun Zhang

Over 80 Input Variables:

Size of garage in square feet

RMSE: Root Mean Squared Error

Distributions of variables Relationship between input and output variables

Before Transformation After Transformation

Overall Quality After Removing Outliers

Select important features with

Number of Missing values

Train Dataset Test Dataset

Number of Features with

Train Dataset Test Dataset

Features Group Imputation

Type of variable Transformation Examples

Continuous Just ensure that the variable is numeric: LotFrontage, LotArea,

Ordinal Categorical Manually encode the variables: OverallQual, OverallCond,

Nominal Categorical Dummify the variables: MSSubClass, MSZoning,

RoofMat1 [‘Membran', 'ClyTile', 'Metal', 'Roll', 'WdShngl', 'WdShake'] ‘Others’

PoolQC ['Ex', 'Gd', 'Fa' ‘Have_Pool’] ‘Have_Pool’

Condition2 ['RRAn', 'RRAe' ] ‘Norm’

Heating ['Wall', 'OthW', 'Floor'] ‘Other’

MiscFeatures ['TenC', 'Othr'] ‘Other’

If skew > threshold Calculate skewness after log transform

If log transform reduces skew Apply log transform, repeat.

Skewed Data Before Transformation After Transformation

rMSE ~ λ | coeff ~ λ Feature Dropped

Before Feature Selection After Feature Selection

Before Feature Selection After Feature Selection

Ridge Regression Lasso

Elastic Net Regression

Feature Importance Score Feature Dropped by Lasso

Ridge Prediction 2 0.10

Elastic Net Prediction 3 0.25

Gradient Boosting 0.25

XG Boosting Prediction 5 0.10

You might also like