0% found this document useful (0 votes)
27 views91 pages

Analysis and Prediction of House Prices by Linear Regression Model

The document provides an overview of analyzing and predicting house prices using linear regression. It discusses data analysis and processing, including exploring the data, handling missing values, outlier detection, and feature engineering. Numerical and categorical features are separated. Correlations between features and the target (SalePrice) are calculated. Features with low correlations are removed. The preprocessed data will then be used to build a linear regression model to predict house prices.

Uploaded by

2001 Since
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views91 pages

Analysis and Prediction of House Prices by Linear Regression Model

The document provides an overview of analyzing and predicting house prices using linear regression. It discusses data analysis and processing, including exploring the data, handling missing values, outlier detection, and feature engineering. Numerical and categorical features are separated. Correlations between features and the target (SalePrice) are calculated. Features with low correlations are removed. The preprocessed data will then be used to build a linear regression model to predict house prices.

Uploaded by

2001 Since
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 91

Analysis and prediction of house

prices by Linear Regression


model
Course: Data Analytics with R/Python
Lecturer: Nguyen Phat Dat, MA
Group: MEP Familu
Table of contents
01. 02. 03.
Overview of Data analysis and Building a linear
the topic processing regression model

04. 05.
Evaluation of Summary
models and
experiments
About my team

01 Nguyễn Trần Ngọc Trâm 04 Đỗ Thị Thanh Phương

02 Nguyễn Thanh Phong


05 Trương Thị Thanh Thư

03 Nguyễn Thị Tố Nhi


06 Vũ Thị Thu Thảo
Overview of
01. the topic
Learn about the problem, project steps and research
questions
Problem
Real estate, buying a house is one of the most expensive fields at the
moment. Before making a decision to buy a house, customers will
carefully consider factors affecting the choice of house. Therefore,
investors need to predict what factors the house price will be affected
by in order to be able to predict a house selling price that is consistent
with reality.
Research question
Question 1 House prices are strongly influenced by what factors?

What is the difference between the predicted house


Question 2 price and the actual house price?

How to predict house prices?


Question 3
Implementation process
Exploratory Data 01 Feature
Analysis engineering

04 02

Evaluation Modeling
03
02
Data analysis and
processing
 My group will learn about the house price data set and
clean the data set before entering
Info() comand
01. df_train.info() 02. df_test.info()

Data type: float64(3), Data type: float64(11),


int64(35), object(43) int64(26), object(43)
Many columnns have missing data
The df_test will receive the same data processing as df_train to model and
prediction.
2.1 General understanding
of datasets

Training data 1460 rows 81 columns

Testing data 1459 houses 80 attributes


In or not in comand

This explains why there is a difference in the number of


rows and columns between the two data df_train and
df_test.
Discover how SalePrice
home prices are
distributed
The average sale price of a house in our dataset
is close to $180,000, with most of the values
df_train.SalePrice.describe() falling within the $130,000 to $215,000 range.
Check for skewness, which is a measure of the shape of the distribution
of values. Then, using plt.hist() to plot a histogram of SalePrice.

The distribution has a longer tail on the right.


It is positively skewed and some outliers lie
above ~500,000.
Use np.log() to transform df_train.SalePrice and compute Skew for the
second time, as well as redraw the data.

A value close to 0 means that we have


improved the skewness of the data (Skew).
The data will be more like a normal
distribution.
2.2 Handling missing data

The purpose is the importance of dealing


with missing data. Why do we need to

?
handle missing data?
Handing missing data
The purpose of missing data processing is to clean the data
for the following steps more conveniently, and at the same
time to reduce data distortion, this is an important step
because it will affect the results. of the data modeling and
prediction of the problem.

● Many machine learning algorithms fail if the dataset


contains missing values.
● May end up building a biased machine learning model
which will lead to incorrect results if the missing values
are not handled properly.
● Missing data can lead to a lack of precision in the
statistical analysis.
Two method handing

Drop data 1 Replace values 2


Drop have more than replace the value in columns
5% missing data with missing data less than
5%
Solution handle missing

More 5% Less 5% Less 5%


Firstly Sencondly Finally

Drop columns with Replace values with data Fill data type other
missing data above type 'object' with most ‘object ’ to the mean of
5% appearing value that column
Sum the null values in the dataset and represent them, sort
the column of null values from high to low, for general
visualization of null data so that it can be handled

--Find null values in data set--


Overview of the method

Find columns with a percentage of null values above 5% to determine which columns are
above 5% to drop it

• Columns in the train data table with more than 1387 null values will be dropped
• Columns in the set data table with more than 1387 null values will be dropped
Perform drop data null

Classify data

Expressed by the
statement code

Drop data
Visualize missing data
with chart

Missing data in Train

Missing data in Test


Train data Test data
 Fill in missing data in columns be hold

Replace values is object type, the value will be filled in mode()

Replace values is of another type, the value will be entered according to mean().
Check dataset
2.3 Handling outlier

An outlier is a data point in a data set


that is distant from all other
observations. A data point that lies
outside the overall distribution of
the dataset.
The correlation low features with SalePrice

Both features "WoodDeckSF" and "OpenPorchSF" have a high number of 0


values with a correspondingly high price variation.
 The best thing to do is to drop these columns.
The correlation strong features with SalePrice

Outliers can affect a regression model by pulling our estimated


regression line further away from the true population
regression line.
 Should remove those observations from our data
Drop outlier

Removing 6 outliers values leaves 1454 rows


of data left. Removing 2 columns leaves 77
columns remaining.
2.4 Split the dataset for deeper
processing
2.4.1 Numerical data
df_train_num 1 df_test_num 2

To get columns with data type is Numerical to separate


DataFrame for processing
What is numerical data?

Numerical data refers to the data that is in the form of numbers,


and not in any language or descriptive form. Often referred to as
quantitative data, numerical data is collected in number form and
stands different from any form of number data types due to its ability
to be statistically and arithmetically calculated.
What is categorical data?

- Categorical data refers to a data type that can be stored and identified based on
the names or labels given to them.

- The data collected in the categorical form is also known as qualitative data.
Each dataset can be grouped and labelled depending on their matching qualities,
under only one category. This makes the categories mutual exclusive.
Calculate and graph the columns of the dataset
Calculate and graph the columns of the dataset
Looking at the dispersion of each numerical feature, we
see that:

01. 02.
Distribution types will There will be variables with small
include discrete and changes
continuous variables in the Þ Solution: Remove any variable
data set. where 95% of the values ​are similar
or constant.
Import the library
VarianceThreshold, which is a
Feature Selector that removes all
features with low variance.

"KitchenAbvGr" has at least


95% similarity => Remove from
both train and test datasets.

The result will be train (32


columns) and test (31 columns).
Show
correlation

Create a correlation heatmap


that shows the relationship
between all variables
(Numerical features) with
SalePrice.
Show
correlation

We see that 14 of these numeric features have a notable correlation with


SalePrice.
Show
correlation
- The correlation between
"GarageCars" and "GarageArea"
was the highest (ratio 0.89)
- GarageCars should be dropped in
train and test datasets.
- At this time, train's numerical
features have 14 columns and test
has 13 columns
Looking at the heatmap above:

Variables have a correlation less than |0,3| will be replaced with 0

Find out 10 features with strong correlation with SalePrice (correlation level > 0.5)
and 4 features with weak correlation with SalePrice (correlation level from 0.3-0.5).

Thus, determine to retain 14 correlated features and SalePrice is 15 features.


2.4.2 Categorical data processing

Processing categoric data in data sets


Get columns
Get columns with data type is Categorical to separate dataFrame for processing.

Train: 35 columns
Test: 34 columns
Visualize data into a chart
Visualize data into a chart

Visualize data in Categorical


features to see correlation
relationships.
Drop some columns

There are some variables


that are dominated by 1 feature

Drop 13 features removed


in this step
Drop some columns

Results
Train: 22 columns
Test: 21 columns
Variation of target variable with
each categorical feature

Consider the degree of similarity between variables


Visualize data into a chart
The data has a similar meaning

The sales price distributions for certain categorical variables are


similar, which suggests that some of the most categorical variables
co-dependence on each other.
"Exterior1st" and "Exterior2nd"
"ExterQual" and "MasVnrType"
"BsmtQual" and "BsmtExposure"
The Chi-squared test
The Chi-squared test is used when we
want to evaluate whether there is a
relationship between two qualitative or
categorical variables in a data set
The Chi-squared test
We will perform Chi-squared test for each pair of variables at 5% significance level

Exterior1st - Exterior2nd ExterQual- MasVnrType BsmtQual - BsmtExposure

Dependency (reject H0)


Drop some columns

Remove 1 of 2 co-dependent
variables (remove 3 columns)

Results
Train: 19 columns
Test: 18 columns
Convert to numeric values

Our team will convert categorical entries into numeric entries using
the dummies() function
Convert data

Drop the SalePrice column in Train


Use dummies in train and test to get
binary
Convert data

Three of these columns from the train dataset are present


but not in the test dataset

Drop 3 columns
Convert data

The shape of both datasets (categorical features only) after all these
changes are given below will be the same (both 128 columns)
Join numerical and
categorical datasets together

Because processing 2 types of data separately, it is necessary to


combine them into a single data
Join data

Results
Train: 142 columns
Test: 141 columns
Combined using the concat function
2.5 Find important features
according to XGBoost
XGboost (Extreme Gradient Boosting) is one of the most commonly used
machine learning methods today.

It supports prediction, linear regression and data classification. XGBoost is well


known for providing better solutions than other machine learning algorithms.

XGBoost is popular because of: Speed and performance , Core algorithm is


parallelizable, Consistently outperforms other algorithm methods, Wide variety of
tuning parameters.
Import method
XGBoost into the
library. And what
sets the support
modes for feature
selection
Create a dataset x =
df_train_new with the
features selected above in
all_feats.
y = df_train_new with
SalePrice.
Split the dataset df_train_new
into x_train, x_val, y_train,
y_val with the assigned source
dataset x and y and test_zize
=0.2
Create a dataframe with rows as values
using ctfm.transform(x_test) and columns as
all features in all_featsPesquisas realizadas
Determine xtrain, xval, xtest using
xgb.DMatrix() method
Draw a graph of illustrating the correlation of feature important
with f_score
Calculate get_score() of tuple model_data
From item_sorted, we have taken
the first 20 values that are the
strongest correlation compared to
SalePrice
03
Building a linear
regression model
Build models based on selected important features
and evaluate performance
Output goal Successfully built 3 models of 20 variables, 15
variables and 10 variables and selected the one

of this step that gave the best results.

Step 2: Provide data to work Step 3: Create a regression model


Step 1: Import the packages with and scale data and fit it to existing data.
3.2.1 why need scaling data
The model's small weights are small and updated based on the
prediction error, scaling the values of input X and output Y of the
training dataset is an important factor. If the input is not scaled, it can
lead to unstable training. In addition, if the output Y is not scaled in
regression problems, it can lead to exploding gradients causing the
algorithm to fail.
Select features scaling

Select the necessary columns for data


scaling, the SalePrice column used for
modeling should be dropped.

We can normalize the data using


train_test_split library with
sklearn.model_selection
Use StandardScaler
to scaling data.
Results returned
Visualize with chart
Plot a beard box plot to see the change of the data.

Before scaling After scaling


Scaling model 2
Plot a beard box plot to see the change of the data.
Model 1
“This is a model built from the 20 most important
features selected from Xgboost”
3.1.2 Build data for model 1

Import the
LinearRegression class 1 Calculate the model's
properties: intercept_, .coef_ 2
3.1.2 Build data for model 1
3.1.2 Build data for model 1
3.1.2 Build data for model 1
3.2 Build data for model 2

3.3 Build data for model 3


04.
Evaluation of models
and experiments
Model evaluation allows us to evaluate the performance of a model and
compare different models, to choose the best one to send to the
experiment.
4.1 Evaluation method
Measures Define Formula
The average magnitude of the errors
MAE in a set of predictions

Calculated by the square of the


MSE difference between the predicted
and actual target variables.

The square root of the mean of


RSME the square of all of the errors

The proportion of the variance in


the dependent variable that is
R2 Score
predictable from the independent
variable(s).
4.2 Evaluation results of each model

0,8486
MAE is: 21620.51899676774
MSE is: 953134790.9843001
RMSE is: 30872.881157810654

0,8479
MAE is: 22034.17440555345
MSE is: 957066907.7115191
RMSE is: 30936.49798719175

0,8013
MAE is: 25185.78090877306
MSE is: 1250517532.8826165
RMSE is: 35362.65732213314
4.3 Experiment from external data

Import external data Price prediction


Our team takes an external data Predicted price of the test
set to predict the price of that data set to be run on the
dataset selected model 1
05.
Summary
The objective of this chapter is to present the research results, limitations of the topic,
practical significance and future development direction of the topic.
5.1 Results

01
Analyze data Visualize
Know analyze the sample data from visualize the information received to
the dataset on Kaggle draw insight and especially
04 02

Build model Soft skills


Build Linear Regression model to predict price 03 Trained in problem-solving thinking,
research skills, teamwork skills and
based on available features.
completing work on time
5.2 Limitation & Future works
Limitations Future works
- Initial time was wasted -Build the above House Pricing
-Analytical knowledge in favor of prediction model using other models
math is not good such as Decision Tree, Ridge, and
-The results of the project have not Lasso.
really achieved all the initial goals
set out
Thanks for listening!

You might also like