0% found this document useful (0 votes)

28 views91 pages

Analysis and Prediction of House Prices by Linear Regression Model

The document provides an overview of analyzing and predicting house prices using linear regression. It discusses data analysis and processing, including exploring the data, handling missing values, outlier detection, and feature engineering. Numerical and categorical features are separated. Correlations between features and the target (SalePrice) are calculated. Features with low correlations are removed. The preprocessed data will then be used to build a linear regression model to predict house prices.

Uploaded by

2001 Since

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views91 pages

Analysis and Prediction of House Prices by Linear Regression Model

Uploaded by

2001 Since

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 91

Analysis and prediction of house

prices by Linear Regression

model
Course: Data Analytics with R/Python
Lecturer: Nguyen Phat Dat, MA
Group: MEP Familu
Table of contents
01. 02. 03.
Overview of Data analysis and Building a linear
the topic processing regression model

04. 05.
Evaluation of Summary
models and
experiments
About my team

01 Nguyễn Trần Ngọc Trâm 04 Đỗ Thị Thanh Phương

02 Nguyễn Thanh Phong

05 Trương Thị Thanh Thư

03 Nguyễn Thị Tố Nhi

06 Vũ Thị Thu Thảo
Overview of
01. the topic
Learn about the problem, project steps and research
questions
Problem
Real estate, buying a house is one of the most expensive fields at the
moment. Before making a decision to buy a house, customers will
carefully consider factors affecting the choice of house. Therefore,
investors need to predict what factors the house price will be affected
by in order to be able to predict a house selling price that is consistent
with reality.
Research question
Question 1 House prices are strongly influenced by what factors?

What is the difference between the predicted house

Question 2 price and the actual house price?

How to predict house prices?

Question 3
Implementation process
Exploratory Data 01 Feature
Analysis engineering

04 02

Evaluation Modeling
03
02
Data analysis and
processing
My group will learn about the house price data set and
clean the data set before entering
Info() comand
01. df_train.info() 02. df_test.info()

Data type: float64(3), Data type: float64(11),

int64(35), object(43) int64(26), object(43)
Many columnns have missing data
The df_test will receive the same data processing as df_train to model and
prediction.
2.1 General understanding
of datasets

Training data 1460 rows 81 columns

Testing data 1459 houses 80 attributes

In or not in comand

This explains why there is a difference in the number of

rows and columns between the two data df_train and
df_test.
Discover how SalePrice
home prices are
distributed
The average sale price of a house in our dataset
is close to $180,000, with most of the values
df_train.SalePrice.describe() falling within the $130,000 to $215,000 range.
Check for skewness, which is a measure of the shape of the distribution
of values. Then, using plt.hist() to plot a histogram of SalePrice.

The distribution has a longer tail on the right.

It is positively skewed and some outliers lie
above ~500,000.
Use np.log() to transform df_train.SalePrice and compute Skew for the
second time, as well as redraw the data.

A value close to 0 means that we have

improved the skewness of the data (Skew).
The data will be more like a normal
distribution.
2.2 Handling missing data

The purpose is the importance of dealing

with missing data. Why do we need to

?
handle missing data?
Handing missing data
The purpose of missing data processing is to clean the data
for the following steps more conveniently, and at the same
time to reduce data distortion, this is an important step
because it will affect the results. of the data modeling and
prediction of the problem.

● Many machine learning algorithms fail if the dataset

contains missing values.
● May end up building a biased machine learning model
which will lead to incorrect results if the missing values
are not handled properly.
● Missing data can lead to a lack of precision in the
statistical analysis.
Two method handing

Drop data 1 Replace values 2

Drop have more than replace the value in columns
5% missing data with missing data less than
5%
Solution handle missing

More 5% Less 5% Less 5%

Firstly Sencondly Finally

Drop columns with Replace values with data Fill data type other
missing data above type 'object' with most ‘object ’ to the mean of
5% appearing value that column
Sum the null values in the dataset and represent them, sort
the column of null values from high to low, for general
visualization of null data so that it can be handled

--Find null values in data set--

Overview of the method

Find columns with a percentage of null values above 5% to determine which columns are
above 5% to drop it

• Columns in the train data table with more than 1387 null values will be dropped
• Columns in the set data table with more than 1387 null values will be dropped
Perform drop data null

Classify data

Expressed by the
statement code

Drop data
Visualize missing data
with chart

Missing data in Train

Missing data in Test

Train data Test data
Fill in missing data in columns be hold

Replace values is object type, the value will be filled in mode()

Replace values is of another type, the value will be entered according to mean().
Check dataset
2.3 Handling outlier

An outlier is a data point in a data set

that is distant from all other
observations. A data point that lies
outside the overall distribution of
the dataset.
The correlation low features with SalePrice

Both features "WoodDeckSF" and "OpenPorchSF" have a high number of 0

values with a correspondingly high price variation.
 The best thing to do is to drop these columns.
The correlation strong features with SalePrice

Outliers can affect a regression model by pulling our estimated

regression line further away from the true population
regression line.
 Should remove those observations from our data
Drop outlier

Removing 6 outliers values leaves 1454 rows

of data left. Removing 2 columns leaves 77
columns remaining.
2.4 Split the dataset for deeper
processing
2.4.1 Numerical data
df_train_num 1 df_test_num 2

To get columns with data type is Numerical to separate

DataFrame for processing
What is numerical data?

Numerical data refers to the data that is in the form of numbers,

and not in any language or descriptive form. Often referred to as
quantitative data, numerical data is collected in number form and
stands different from any form of number data types due to its ability
to be statistically and arithmetically calculated.
What is categorical data?

- Categorical data refers to a data type that can be stored and identified based on
the names or labels given to them.

- The data collected in the categorical form is also known as qualitative data.
Each dataset can be grouped and labelled depending on their matching qualities,
under only one category. This makes the categories mutual exclusive.
Calculate and graph the columns of the dataset
Calculate and graph the columns of the dataset
Looking at the dispersion of each numerical feature, we
see that:

01. 02.
Distribution types will There will be variables with small
include discrete and changes
continuous variables in the Þ Solution: Remove any variable
data set. where 95% of the values are similar
or constant.
Import the library
VarianceThreshold, which is a
Feature Selector that removes all
features with low variance.

"KitchenAbvGr" has at least

95% similarity => Remove from
both train and test datasets.

The result will be train (32

columns) and test (31 columns).
Show
correlation

Create a correlation heatmap

that shows the relationship
between all variables
(Numerical features) with
SalePrice.
Show
correlation

We see that 14 of these numeric features have a notable correlation with

SalePrice.
Show
correlation
- The correlation between
"GarageCars" and "GarageArea"
was the highest (ratio 0.89)
- GarageCars should be dropped in
train and test datasets.
- At this time, train's numerical
features have 14 columns and test
has 13 columns
Looking at the heatmap above:

Variables have a correlation less than |0,3| will be replaced with 0

Find out 10 features with strong correlation with SalePrice (correlation level > 0.5)
and 4 features with weak correlation with SalePrice (correlation level from 0.3-0.5).

Thus, determine to retain 14 correlated features and SalePrice is 15 features.

2.4.2 Categorical data processing

Processing categoric data in data sets

Get columns
Get columns with data type is Categorical to separate dataFrame for processing.

Train: 35 columns
Test: 34 columns
Visualize data into a chart
Visualize data into a chart

Visualize data in Categorical

features to see correlation
relationships.
Drop some columns

There are some variables

that are dominated by 1 feature

Drop 13 features removed

in this step
Drop some columns

Results
Train: 22 columns
Test: 21 columns
Variation of target variable with
each categorical feature

Consider the degree of similarity between variables

Visualize data into a chart
The data has a similar meaning

The sales price distributions for certain categorical variables are

similar, which suggests that some of the most categorical variables
co-dependence on each other.
"Exterior1st" and "Exterior2nd"
"ExterQual" and "MasVnrType"
"BsmtQual" and "BsmtExposure"
The Chi-squared test
The Chi-squared test is used when we
want to evaluate whether there is a
relationship between two qualitative or
categorical variables in a data set
The Chi-squared test
We will perform Chi-squared test for each pair of variables at 5% significance level

Exterior1st - Exterior2nd ExterQual- MasVnrType BsmtQual - BsmtExposure

Dependency (reject H0)

Drop some columns

Remove 1 of 2 co-dependent
variables (remove 3 columns)

Results
Train: 19 columns
Test: 18 columns
Convert to numeric values

Our team will convert categorical entries into numeric entries using
the dummies() function
Convert data

Drop the SalePrice column in Train

Use dummies in train and test to get
binary
Convert data

Three of these columns from the train dataset are present

but not in the test dataset

Drop 3 columns
Convert data

The shape of both datasets (categorical features only) after all these
changes are given below will be the same (both 128 columns)
Join numerical and
categorical datasets together

Because processing 2 types of data separately, it is necessary to

combine them into a single data
Join data

Results
Train: 142 columns
Test: 141 columns
Combined using the concat function
2.5 Find important features
according to XGBoost
XGboost (Extreme Gradient Boosting) is one of the most commonly used
machine learning methods today.

It supports prediction, linear regression and data classification. XGBoost is well

known for providing better solutions than other machine learning algorithms.

XGBoost is popular because of: Speed and performance , Core algorithm is

parallelizable, Consistently outperforms other algorithm methods, Wide variety of
tuning parameters.
Import method
XGBoost into the
library. And what
sets the support
modes for feature
selection
Create a dataset x =
df_train_new with the
features selected above in
all_feats.
y = df_train_new with
SalePrice.
Split the dataset df_train_new
into x_train, x_val, y_train,
y_val with the assigned source
dataset x and y and test_zize
=0.2
Create a dataframe with rows as values
using ctfm.transform(x_test) and columns as
all features in all_featsPesquisas realizadas
Determine xtrain, xval, xtest using
xgb.DMatrix() method
Draw a graph of illustrating the correlation of feature important
with f_score
Calculate get_score() of tuple model_data
From item_sorted, we have taken
the first 20 values that are the
strongest correlation compared to
SalePrice
03
Building a linear
regression model
Build models based on selected important features
and evaluate performance
Output goal Successfully built 3 models of 20 variables, 15
variables and 10 variables and selected the one

of this step that gave the best results.

Step 2: Provide data to work Step 3: Create a regression model

Step 1: Import the packages with and scale data and fit it to existing data.
3.2.1 why need scaling data
The model's small weights are small and updated based on the
prediction error, scaling the values of input X and output Y of the
training dataset is an important factor. If the input is not scaled, it can
lead to unstable training. In addition, if the output Y is not scaled in
regression problems, it can lead to exploding gradients causing the
algorithm to fail.
Select features scaling

Select the necessary columns for data

scaling, the SalePrice column used for
modeling should be dropped.

We can normalize the data using

train_test_split library with
sklearn.model_selection
Use StandardScaler
to scaling data.
Results returned
Visualize with chart
Plot a beard box plot to see the change of the data.

Before scaling After scaling

Scaling model 2
Plot a beard box plot to see the change of the data.
Model 1
“This is a model built from the 20 most important
features selected from Xgboost”
3.1.2 Build data for model 1

Import the
LinearRegression class 1 Calculate the model's
properties: intercept_, .coef_ 2
3.1.2 Build data for model 1
3.1.2 Build data for model 1
3.1.2 Build data for model 1
3.2 Build data for model 2

3.3 Build data for model 3

04.
Evaluation of models
and experiments
Model evaluation allows us to evaluate the performance of a model and
compare different models, to choose the best one to send to the
experiment.
4.1 Evaluation method
Measures Define Formula
The average magnitude of the errors
MAE in a set of predictions

Calculated by the square of the

MSE difference between the predicted
and actual target variables.

The square root of the mean of

RSME the square of all of the errors

The proportion of the variance in

the dependent variable that is
R2 Score
predictable from the independent
variable(s).
4.2 Evaluation results of each model

0,8486
MAE is: 21620.51899676774
MSE is: 953134790.9843001
RMSE is: 30872.881157810654

0,8479
MAE is: 22034.17440555345
MSE is: 957066907.7115191
RMSE is: 30936.49798719175

0,8013
MAE is: 25185.78090877306
MSE is: 1250517532.8826165
RMSE is: 35362.65732213314
4.3 Experiment from external data

Import external data Price prediction

Our team takes an external data Predicted price of the test
set to predict the price of that data set to be run on the
dataset selected model 1
05.
Summary
The objective of this chapter is to present the research results, limitations of the topic,
practical significance and future development direction of the topic.
5.1 Results

01
Analyze data Visualize
Know analyze the sample data from visualize the information received to
the dataset on Kaggle draw insight and especially
04 02

Build model Soft skills

Build Linear Regression model to predict price 03 Trained in problem-solving thinking,
research skills, teamwork skills and
based on available features.
completing work on time
5.2 Limitation & Future works
Limitations Future works
- Initial time was wasted -Build the above House Pricing
-Analytical knowledge in favor of prediction model using other models
math is not good such as Decision Tree, Ridge, and
-The results of the project have not Lasso.
really achieved all the initial goals
set out
Thanks for listening!

ML Unit 2
No ratings yet
ML Unit 2
52 pages
Predictive Modeling Project
No ratings yet
Predictive Modeling Project
16 pages
Dmdw-Lab Manual
No ratings yet
Dmdw-Lab Manual
61 pages
Hint Sheet
No ratings yet
Hint Sheet
13 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
Data Clearning
No ratings yet
Data Clearning
7 pages
Python For Machine Learning
No ratings yet
Python For Machine Learning
66 pages
Bussiness Report PM
No ratings yet
Bussiness Report PM
44 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
EDA Explanations
No ratings yet
EDA Explanations
22 pages
Report
No ratings yet
Report
40 pages
ML Lab Manual 2025-2
No ratings yet
ML Lab Manual 2025-2
35 pages
Data Preprocessing 2
No ratings yet
Data Preprocessing 2
5 pages
Monika Sree 11-07-2024
No ratings yet
Monika Sree 11-07-2024
36 pages
AIDS - DM Using Python - Lab Programs
No ratings yet
AIDS - DM Using Python - Lab Programs
19 pages
Data Pre-Processing Steps
No ratings yet
Data Pre-Processing Steps
32 pages
Data Preprocessing Techniques in ML
No ratings yet
Data Preprocessing Techniques in ML
12 pages
Data Mining Lab Manual CSE VII Sem
No ratings yet
Data Mining Lab Manual CSE VII Sem
63 pages
Data Mining Using Python Manual
No ratings yet
Data Mining Using Python Manual
69 pages
The Data Science Process
100% (1)
The Data Science Process
53 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
EDA - Exploratory Data Analysis
No ratings yet
EDA - Exploratory Data Analysis
16 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
Data Preprocess Steps
No ratings yet
Data Preprocess Steps
2 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Project Paarth
No ratings yet
Project Paarth
21 pages
Lec ExploratoryDataAnalysis1Unit5Part1
No ratings yet
Lec ExploratoryDataAnalysis1Unit5Part1
22 pages
Step-by-Step Explanation of Python Data Preprocessing Script
No ratings yet
Step-by-Step Explanation of Python Data Preprocessing Script
9 pages
Train
No ratings yet
Train
17 pages
Practice Questions2
No ratings yet
Practice Questions2
2 pages
04 DS 2023
No ratings yet
04 DS 2023
63 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Machine Learning
No ratings yet
Machine Learning
30 pages
Engo 645
No ratings yet
Engo 645
9 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
TYCS Practical
No ratings yet
TYCS Practical
26 pages
Week 10
No ratings yet
Week 10
50 pages
Module 2
No ratings yet
Module 2
20 pages
Aerofit Case Study
No ratings yet
Aerofit Case Study
16 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Lec 4
No ratings yet
Lec 4
9 pages
Kaggle Course Notes
No ratings yet
Kaggle Course Notes
87 pages
ML LAB Manual-1
No ratings yet
ML LAB Manual-1
33 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Ass 1 ML
No ratings yet
Ass 1 ML
21 pages
Data Preprocessing 1
No ratings yet
Data Preprocessing 1
6 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
Even Students
No ratings yet
Even Students
36 pages
Eda Indepth
No ratings yet
Eda Indepth
19 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Machine Learning Unit 2
No ratings yet
Machine Learning Unit 2
71 pages
DM Lab Cycle 2 1
No ratings yet
DM Lab Cycle 2 1
10 pages
Phython Example
No ratings yet
Phython Example
12 pages
Task 6
No ratings yet
Task 6
14 pages
Data Prep and Cleaning For Machine Learning
No ratings yet
Data Prep and Cleaning For Machine Learning
22 pages
An Extensive Step by Step Guide To Exploratory Data Analysis
No ratings yet
An Extensive Step by Step Guide To Exploratory Data Analysis
26 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
KND 100M
No ratings yet
KND 100M
297 pages
ML - 8
No ratings yet
ML - 8
70 pages
Modeling and Design of Plate Heat Exchanger
No ratings yet
Modeling and Design of Plate Heat Exchanger
33 pages
Week 1a - Introduction To Biostatistics
No ratings yet
Week 1a - Introduction To Biostatistics
40 pages
Csi Algebra Unit 2 The Real Number System
No ratings yet
Csi Algebra Unit 2 The Real Number System
22 pages
Trigonometry Formulae - Trigo Formulae For LAKSHYA JEE
No ratings yet
Trigonometry Formulae - Trigo Formulae For LAKSHYA JEE
2 pages
Mohr Circle
No ratings yet
Mohr Circle
14 pages
GreenHouse Model IEEEICAACCA2022
No ratings yet
GreenHouse Model IEEEICAACCA2022
6 pages
PID, Fuzzy and LQR Controllers For Magnetic Levitation System
No ratings yet
PID, Fuzzy and LQR Controllers For Magnetic Levitation System
5 pages
Digital Communications: Fundamentals and Applications: by Bernard Sklar
No ratings yet
Digital Communications: Fundamentals and Applications: by Bernard Sklar
310 pages
Mirza - Ali Resume - I
No ratings yet
Mirza - Ali Resume - I
1 page
Liar by Isaac Asimov 2
No ratings yet
Liar by Isaac Asimov 2
16 pages
To Predict The Bead Geometry Parameters and Shape Relationships in MIG Welding of Stainless Steel 301 by Mathematical Modelling
No ratings yet
To Predict The Bead Geometry Parameters and Shape Relationships in MIG Welding of Stainless Steel 301 by Mathematical Modelling
10 pages
S5 M2 Mock Paper 3
No ratings yet
S5 M2 Mock Paper 3
12 pages
Important Questions XII Computer Science CBSE
No ratings yet
Important Questions XII Computer Science CBSE
16 pages
005 - 006 - AFM - WH - Elastic Solution
No ratings yet
005 - 006 - AFM - WH - Elastic Solution
8 pages
Final Report
No ratings yet
Final Report
17 pages
Vectors Plane
No ratings yet
Vectors Plane
28 pages
UT Dallas Syllabus For Ee6325.001 05s Taught by Poras Balsara (Poras)
No ratings yet
UT Dallas Syllabus For Ee6325.001 05s Taught by Poras Balsara (Poras)
2 pages
Unit-4 Part 2 Modelling and Evaluation
No ratings yet
Unit-4 Part 2 Modelling and Evaluation
35 pages
Traupal Notes
No ratings yet
Traupal Notes
41 pages
Advanced Database Systems
No ratings yet
Advanced Database Systems
15 pages
Calculation Cover Sheet Date: Author: Project: Calc No: Title
No ratings yet
Calculation Cover Sheet Date: Author: Project: Calc No: Title
6 pages
BCA SEM 3 Computer Oriented Numerical Methods BC0043
75% (4)
BCA SEM 3 Computer Oriented Numerical Methods BC0043
10 pages
Class Vsyllabus V
No ratings yet
Class Vsyllabus V
23 pages
Lab Manual For Aiml
No ratings yet
Lab Manual For Aiml
28 pages
CH-2, Stress & Strain
No ratings yet
CH-2, Stress & Strain
74 pages
Resnick 1973
No ratings yet
Resnick 1973
31 pages
Discrete Mathematics Notes - 1
No ratings yet
Discrete Mathematics Notes - 1
24 pages
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
No ratings yet
A Review of Condition Monitoring and Fault Diagnosis For Diesel Engines
25 pages

Analysis and Prediction of House Prices by Linear Regression Model

Uploaded by

Analysis and Prediction of House Prices by Linear Regression Model

Uploaded by

Analysis and prediction of house

prices by Linear Regression

01 Nguyễn Trần Ngọc Trâm 04 Đỗ Thị Thanh Phương

02 Nguyễn Thanh Phong

03 Nguyễn Thị Tố Nhi

What is the difference between the predicted house

How to predict house prices?

Data type: float64(3), Data type: float64(11),

Training data 1460 rows 81 columns

Testing data 1459 houses 80 attributes

This explains why there is a difference in the number of

The distribution has a longer tail on the right.

A value close to 0 means that we have

The purpose is the importance of dealing

● Many machine learning algorithms fail if the dataset

Drop data 1 Replace values 2

More 5% Less 5% Less 5%

--Find null values in data set--

Missing data in Train

Missing data in Test

Replace values is object type, the value will be filled in mode()

An outlier is a data point in a data set

Both features "WoodDeckSF" and "OpenPorchSF" have a high number of 0

Outliers can affect a regression model by pulling our estimated

Removing 6 outliers values leaves 1454 rows

To get columns with data type is Numerical to separate

Numerical data refers to the data that is in the form of numbers,

"KitchenAbvGr" has at least

The result will be train (32

Create a correlation heatmap

We see that 14 of these numeric features have a notable correlation with

Variables have a correlation less than |0,3| will be replaced with 0

Thus, determine to retain 14 correlated features and SalePrice is 15 features.

Processing categoric data in data sets

Visualize data in Categorical

There are some variables

Drop 13 features removed

Consider the degree of similarity between variables

The sales price distributions for certain categorical variables are

Exterior1st - Exterior2nd ExterQual- MasVnrType BsmtQual - BsmtExposure

Dependency (reject H0)

Drop the SalePrice column in Train

Three of these columns from the train dataset are present

Because processing 2 types of data separately, it is necessary to

It supports prediction, linear regression and data classification. XGBoost is well

XGBoost is popular because of: Speed and performance , Core algorithm is

of this step that gave the best results.

Step 2: Provide data to work Step 3: Create a regression model

Select the necessary columns for data

We can normalize the data using

Before scaling After scaling

3.3 Build data for model 3

Calculated by the square of the

The square root of the mean of

The proportion of the variance in

Import external data Price prediction

Build model Soft skills

You might also like