Analysis and Prediction of House Prices by Linear Regression Model
Analysis and Prediction of House Prices by Linear Regression Model
04. 05.
Evaluation of Summary
models and
experiments
About my team
04 02
Evaluation Modeling
03
02
Data analysis and
processing
My group will learn about the house price data set and
clean the data set before entering
Info() comand
01. df_train.info() 02. df_test.info()
?
handle missing data?
Handing missing data
The purpose of missing data processing is to clean the data
for the following steps more conveniently, and at the same
time to reduce data distortion, this is an important step
because it will affect the results. of the data modeling and
prediction of the problem.
Drop columns with Replace values with data Fill data type other
missing data above type 'object' with most ‘object ’ to the mean of
5% appearing value that column
Sum the null values in the dataset and represent them, sort
the column of null values from high to low, for general
visualization of null data so that it can be handled
Find columns with a percentage of null values above 5% to determine which columns are
above 5% to drop it
• Columns in the train data table with more than 1387 null values will be dropped
• Columns in the set data table with more than 1387 null values will be dropped
Perform drop data null
Classify data
Expressed by the
statement code
Drop data
Visualize missing data
with chart
Replace values is of another type, the value will be entered according to mean().
Check dataset
2.3 Handling outlier
- Categorical data refers to a data type that can be stored and identified based on
the names or labels given to them.
- The data collected in the categorical form is also known as qualitative data.
Each dataset can be grouped and labelled depending on their matching qualities,
under only one category. This makes the categories mutual exclusive.
Calculate and graph the columns of the dataset
Calculate and graph the columns of the dataset
Looking at the dispersion of each numerical feature, we
see that:
01. 02.
Distribution types will There will be variables with small
include discrete and changes
continuous variables in the Þ Solution: Remove any variable
data set. where 95% of the values are similar
or constant.
Import the library
VarianceThreshold, which is a
Feature Selector that removes all
features with low variance.
Find out 10 features with strong correlation with SalePrice (correlation level > 0.5)
and 4 features with weak correlation with SalePrice (correlation level from 0.3-0.5).
Train: 35 columns
Test: 34 columns
Visualize data into a chart
Visualize data into a chart
Results
Train: 22 columns
Test: 21 columns
Variation of target variable with
each categorical feature
Remove 1 of 2 co-dependent
variables (remove 3 columns)
Results
Train: 19 columns
Test: 18 columns
Convert to numeric values
Our team will convert categorical entries into numeric entries using
the dummies() function
Convert data
Drop 3 columns
Convert data
The shape of both datasets (categorical features only) after all these
changes are given below will be the same (both 128 columns)
Join numerical and
categorical datasets together
Results
Train: 142 columns
Test: 141 columns
Combined using the concat function
2.5 Find important features
according to XGBoost
XGboost (Extreme Gradient Boosting) is one of the most commonly used
machine learning methods today.
Import the
LinearRegression class 1 Calculate the model's
properties: intercept_, .coef_ 2
3.1.2 Build data for model 1
3.1.2 Build data for model 1
3.1.2 Build data for model 1
3.2 Build data for model 2
0,8486
MAE is: 21620.51899676774
MSE is: 953134790.9843001
RMSE is: 30872.881157810654
0,8479
MAE is: 22034.17440555345
MSE is: 957066907.7115191
RMSE is: 30936.49798719175
0,8013
MAE is: 25185.78090877306
MSE is: 1250517532.8826165
RMSE is: 35362.65732213314
4.3 Experiment from external data
01
Analyze data Visualize
Know analyze the sample data from visualize the information received to
the dataset on Kaggle draw insight and especially
04 02