0% found this document useful (0 votes)
24 views5 pages

Wang 2021

This paper presents a comparative analysis of house price prediction using OLS linear regression and random forest models based on a dataset with 20 features. The study emphasizes the importance of various factors such as house area, age, and proximity to amenities in determining house prices, and highlights the strengths of random forest in handling non-linear relationships and outliers. The results indicate that while OLS provides a straightforward approach, random forest offers improved accuracy and robustness in predictions.

Uploaded by

4002dj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

Wang 2021

This paper presents a comparative analysis of house price prediction using OLS linear regression and random forest models based on a dataset with 20 features. The study emphasizes the importance of various factors such as house area, age, and proximity to amenities in determining house prices, and highlights the strengths of random forest in handling non-linear relationships and outliers. The results indicate that while OLS provides a straightforward approach, random forest offers improved accuracy and robustness in predictions.

Uploaded by

4002dj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

House-price Prediction Based on OLS Linear Regression and

Random Forest
Yige Wang∗
Beijing Jiaotong University, Beijing 100044, China
[email protected]
ABSTRACT multiple linear regression to predict house prices. They found that
Based on a simple dataset with 20 features of houses, this paper the house area, age, proximity to transportation, school, and scenic
builds two forecasting models to predict house prices. It primarily view are the most important environmental factors on house price.
takes a series of data processes, including missing value process, Bourassa, Cantoni, & Hoesli [4] compared OLS linear regression
outlier removal, new variables creation, and correlation analysis. and other three models to explore the characteristic of OLS in
Then this paper conducts the first theoretical model using OLS the relationship between spatial pattern and house prices. They
method. Next, it uses secondary cross-validation to optimize the presented that OLS emphasize neighborhood factors but ignore
parameters of random forest and gets a respectively realistic model. spatial structure. Anderson & West [5], Crompton [6], Dehring &
Finally based on the results, this paper analyzes and compares the Dunse [7] concluded house sales price and a variety of open space,
advantages and disadvantages of the two models. such as urban parks, land in conservation easements, agriculture
farmlands, golf site, and other greenbelts have positive relationships
CCS CONCEPTS by hedonic methods.
Antipov, Fan et al. [8] used the tree decision to evaluate house
• Applied computing; • Operations research; • Consumer
characteristics and to predict house price. They concluded ran-
products;
dom forest is stable in controlling outliers and effective in missing
values as well as categorical variables. Hong [9] maintained that
KEYWORDS
traditional hedonic pricing method is unstable and inaccurate, but
OLS linear regression, random forest, house-price forecasting random forest is competent because it can capture the complexity
ACM Reference Format: and non-linear relationship in practical operation. Yoo [10] applied
Yige Wang. 2021. House-price Prediction Based on OLS Linear Regression machine learning regression methods to house prices prediction
and Random Forest. In 2021 2nd Asia Service Sciences and Software Engineer- to select feature variables and compared the results using machine
ing Conference (ASSE ’21), February 24–26, 2021, Macau, Macao. ACM, New learning with traditional OLS method. He found that random forest
York, NY, USA, 5 pages. https://doi.org/10.1145/3456126.3456139 is useful in selecting importance of variables for the hedonic price
equation because there is an in-depth hypothesis of variables show-
1 INTRODUCTION ing customers’ preferences by random forest method. Ceh [11] used
In the house-price market, the most growing topic in research is GIS explanatory dataset, such as structural and age features of the
‘forecasting’ as its importance and popularity among the masses houses, and neighborhood information to process random forest
and also companies of different scales due to financial benefits and model. He also measured the performance using R2, MAPE, and
its low risk. The real estate industry accounts for a large part of COD (coefficient of dispersion), revealing that machine learning
GPA, and it can directly influence citizens’ consumption level as method on house price prediction is prospective.
well as the overall domestic economy. Thus, accurate house-price OLS linear regression and random tree model are some of the
forecasting not only benefits investors and consumers but also is most popular forecasting models. The former is easier to under-
a foundation of steady economic development. Attributing to the stand and focuses on explaining the correlatives among variables.
weakness of traditional methods, such as subjectivity and empiri- The latter is bagging learning and needs a number of arithmetic
cism, some scholars proposed the Hedonic method to improve the operations. It can deal with non-linear and overfitting problems.
accuracy. The Hedonic method, combined with OLS regression Therefore, this research aims to use the two methods in house price
method, and random forest has been widely applied in house price forecasting and compare the difference between them.
predicting [1].
Based on the Hedonic model, Thibodeau [2], Selim [3] studied
the impact of environmental characteristics on house prices, built 2 DATA PREPROCESSING
Permission to make digital or hard copies of all or part of this work for personal or 2.1 Data Access and Description
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation The house price dataset in USA, comes from Kaggle, containing
on the first page. Copyrights for components of this work owned by others than ACM 21613 samples with 20 features, which is suitable for this study
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a both in data size and features. This dataset includes architectural
fee. Request permissions from [email protected]. features of houses (e.g. number of waterfronts, time of yr_built,
ASSE ’21, February 24–26, 2021, Macau, Macao number of bedrooms), location of residential houses (e.g. latitude
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8908-2/21/02. . . $15.00 and longitude), and other features (e.g. price and grade). As there is
https://doi.org/10.1145/3456126.3456139 no null value in the dataset, we skip processing the missing value.

89
ASSE ’21, February 24–26, 2021, Macau, Macao YIGE WANG

Table 1: Category of 20 Feature Variables

Variable Type Architectural Features Geographic Features Others


Categorical Variable waterfront, yr_built, yr_renovated, zipcode, lat, long id, date
Numerical Variable bedrooms, bathrooms, sqft_living, price, view,
sqft_lot, floors, sqft_living15, condition, grade,
sqft_lot15

Table 2: The Description of Variables

Variable Description
sqft_living Square footage of the apartment interior living space
grade An index from 1 to 13, where 1-3 falls short of building construction and design, 7 has an average level of
construction and design, and 11-13 has a high level of construction and design
sqft_above Square footage of the apartment interior housing space that is above the ground level
sqft_living15 Square footage of the apartment interior living space for the nearest 15 neighbors
date The date when the house was sold
yr_built The year when the house was built

Figure 1: Boxplot of Price Before and After Outliers Removal

2.2 Outlier Removal In this housing price predicting dataset, the units of each vari-
Boxplot is an effective method to detect outliers that are defined as able are different. To eliminate the adverse effect brought by the
the values smaller than QL − 1.5IQR or larger than QL + 1.5IQR dimension of variables, they are all standardized.
(QL is lower quartile; QU is upper quartile). Draw a boxplot of price
with the outliers highlighted in red, and then remove the outliers. 2.4 Correlation Analysis
Due to a large number of variables, the correlation chart was made
for variables with price correlations greater than 0.3. It can be seen
2.3 New Variable Creation and Variable that price has the strongest correlation with the score and house
Transformation area, and also has a correlation with latitude and bathrooms.
Because there is no variable indicating house age and whether
the house is renovated, we create a new variable "built_year", ex- 3 METHODS
pressing the house age, which is the year sold minus the year built. 3.1 OLS Linear Regression
And create new variable "yr_renovated", expressed by 0 and 1 (0
3.1.1 Theory of OLS Linear Regression. Linear regression model
represents no renovation).
assumes regression function E(Y |X ) on the input variable X1, X2,
Detecting the normal distribution of data, and there are some
. . .Xp linear, that is for input vector XT = (X1, X2, . . .Xp) and the
variables that were significantly skewed to the left or right, which
output variables of ynecessary to predict. The following form is:
would lead to heteroscedasticity in the regression analysis. For
p
example, “sqft_above” needs applying logarithmic transformation Õ
due to its skewness 1.068. As a result, the skewness declines to f (X ) = β 0 + X j βj (1)
0.1736, meeting the requirement of normal distribution. We apply j=1
the logarithmic transformation to other variables whose skewness In the formula β j is an unknown parameter and needs optimizing.
is greater than 1. Variable X j is a feature variable. It can come from:

90
House-price Prediction Based on OLS Linear Regression and Random Forest ASSE ’21, February 24–26, 2021, Macau, Macao

Figure 2: Distribution of Sqft_basement

Figure 3: Correlation of Variables Greater than 0.3

(1) Quantitative input variables. This research sets the house price as the dependent variable
(2) Transform quantitative input variables, such as logarithmic Y and the variables after data processing as the independent
transformation square root transformation, etc. variables X j . The Sklearn function in Python is used to divide
(3) The polynomial generated by the expansion of the basis the data into 80% training set and 20% test set. Then use
function, such as X 2 = X 1 , X 3 = X 1 the least square method to give an optimal solution of the
(4) Sorting or dummy encoding of categorical variables. For parameter β = (β0, β1, β2, ..., βj)
example, X j is a qualitative input variable with five levels, Because X is full column rank, strives for the partial deriva-
the five levels can be transformed into 1, 2, . . ., 5. Or one hot tives on loss function L(Y , f (X )), and make the first order
coding, which constitutes several 0 and one 1. differential equals to 0. The final solution is:
(5) Cross effects of variables, such as X 3 = X 1 · X 2 .
β˜ = (X T X )−1X T y (3)

n
Õ Then input feature vector x 0 , and predicted value is given
L(Y , f (X )) = (yi − f (x i ))2 by f˜ (x 0 ) = (1 : x 0 )T β˜
i=1 When the number of feature variables is large, the linear regres-
n p
Õ Õ sion prediction model often has small deviation and large variance,
= (yi − β 0 − x i j β i )2 (2) namely, overfitting. In order to improve the accuracy of the model, a
i=1 j=1 sacrifice will be made to shrink some feature variable coefficient that
= (y − X β)T (y − X β) is not very explanatory to 0. This study chooses lasso regression,

91
ASSE ’21, February 24–26, 2021, Macau, Macao YIGE WANG

Table 3: Outcome of OLS Linear Regression

Dep. Variable: price R-squared: 0.687


Model: OLS Adj. R-squared: 0.687
Method: Least Squares F-statistic: 1770
No. Observations: 16118 Prob (F-statistic): 0
Df Residuals: 16097 Log-Likelihood: -13460
Df Model: 20 AIC: 2.70E+04
Covariance Type: nonrobust BIC: 2.71E+04

Table 4: Sort of Feature Importance

coef std err t P>|t| [0.025 0.975]


if_renovated -4.163 0.542 -7.680 0.000 -5.225 -3.101
built_year 3.195 0.442 7.230 0.000 2.329 4.061
grade 0.373 0.007 51.592 0.000 0.359 0.387
lat 0.372 0.005 79.101 0.000 0.363 0.381
ln_sqft_living15 0.168 0.007 23.668 0.000 0.154 0.181

Figure 4: Bagging Structure

which is adding a penalty term to loss function, to avoid overfitting that the results of the overall model have high accuracy and gener-
phenomenon. alization performance. Its good performance is mainly attributed
to "randomness" and "forest", one making it resistant to overfitting
n p
Õ Õ and one making it more accurate.
L(Y, f(X)) = (yi − f (x i ))2 + λ |ω j | (4) Except for bagging algorithm, random forest applies a unique ap-
i=1 j=1 proach to node splitting. In traditional classification and regression
trees, node splitting is usually conducted using all the predictors,
3.1.2 Results of OLS Linear Regression. The training set was mod-
whereas a random subset of the predictors is used to select on each
eled and the results were shown in the following table. The sig-
node in random forest node splitting. When a tree is constructed,
nificance level is passed. R2 was 0.687 so the fitting effect was
nearly one third of the observations are hardly applied to any indi-
impractical, and the accuracy verified by the testing set was only
vidual trees, which is called out-of-bag (OOB) due to the hold-out
0.682.
cases. In the OOB approach, the errors excluded from the data gen-
The important features are recognized. The result presents con-
erated by each regression tree are used to give random forest a
sumers emphasize more on house condition and grade than spiral
notice of the relative advantages and correlation of the tree [12].
characteristics.
3.2.2 Secondary Cross Validation. In machine learning, generaliza-
3.2 Random Forest tion error is used to measure the accuracy of the prediction model.
3.2.1 Theory of Random Forest Regression. Random forest is an When the random forest model is complex, overfitting may appear,
Ensemble Learning algorithm, belonging to bagging. By combining leading to large generalization error. In contrast, when the model
several weak classifiers, the final results are voted or averaged, so is simple, underfitting will come out, with large generation error.

92
House-price Prediction Based on OLS Linear Regression and Random Forest ASSE ’21, February 24–26, 2021, Macau, Macao

Table 5: Value Range of Secondary Cross Validation Param-


eters

Parameter First Time Second Final


Time
n_estimators [1,200] [167,176] 170
min_samples_split[2, 5, 10] [3, 9] 5
min_samples_leaf [1, 2, 4] [2, 3] 2
max_features [‘auto’,’sprt’] [‘auto’] ‘auto’
max_depth [10,100] or [76,84] 84
[None]
bootstrap [True, False] [True] TRUE

Table 6: Sort of Feature Importance

Variable Importance
lat 0.44
ln_sqft_living 0.32 Figure 5: Sort of Variables Importance using Random Forest
grade 0.06
long 0.05
than OLS in the face of complex and nonlinear data. Therefore, in
ln_sqft_living15 0.03
these two conditions, one is a large number of observations in this
ln_sqft_lot 0.02
dataset, the other is complex and has noise samples, we recommend
ln_sqft_lot15 0.02
to choose random forest.
bathrooms 0.01
view 0.01 REFERENCES
[1] Shekarian, E., & Fallahpour, A.. Predicting house price via gene expression pro-
gramming. International Journal of Housing Markets and Analysis, 6(3) (2013),
250-268.
Therefore, parameters such as max_depth and min_samples_leaf [2] Thibodeau, T.G. Marking Single-Family Property Values to Market. Real Estate
will be adjusted to reduce the depth, branches and complexity of Economics, 2003, 31:1, 1-22.
the tree. [3] Selim, H.. "Determinants of house prices in Turkey: hedonic regression versus
artificial neural network", Expert Systems with Applications, Vol. 36 (2009) No. 2,
An optimal combination of parameters will significantly prompt pp. 2843-2852.
the performance of a random forest model. This study uses 17290 [4] Bourassa, S. C., Cantoni, E., & Hoesli, M.. Predicting house prices with spatial
out of 21613 samples as training set and uses the remaining 4323 dependence: A comparison of alternative methods. The Journal of Real Estate
Research, 32(2) (2010), 139-160.
samples to test the model. Training and testing sets are selected [5] Anderson, S. T., & West, S. E. Open space, residential property values, and spatial
randomly according to the municipality to which each residential context. Regional Science and Urban Economics, 36 (2006), 773–789.
[6] Crompton, J. L. The impact of parks on property values: A review of the empirical
property belongs. In the course of cross validation, 80% of original evidence. Journal of Leisure Research, 33(1) (2001), 1–31.
training data are selected to train model and 20% of training sets are [7] Dehring, C., & Dunse, N. Housing density and the effect of proximity to public
selected to validate the model. For the reason that there are various open space in Aberdeen, Scotland. Real Estate Economics, 34(4) (2006), 553–566.
[8] Antipov, E. A., & Pokryshevskaya, E. B.. Mass appraisal of residential apartments:
parameters with a large range to test in this model, secondary cross- An application of Random forest for valuation and a CART-based approach for
validation is used to select the best combination of parameters. model diagnostics. Expert Systems with Applications, 39(2) (2012), 1772-1778.
[9] Hong, J., Choi, H., & Kim, W. S.. A house price valuation based on the ran-
3.2.3 Results of Random Forest. Use the optimal parameter to estab- dom forest approach: the mass appraisal of residential property in South Korea.
International Journal of Strategic Property Management, 24(3) (2020), 140-152.
lish the random forest model, and get the final result, the accuracy [10] Yoo, Sanglim, Jungho Im, and John E. Wagner. "Variable Selection for Hedonic
is 0.8589, which precedes OLS model. Model using Machine Learning Approaches: A Case Study in Onondaga County,
The figure below is the importance order of feature variables NY." Landscape and Urban Planning 107, no. 3 (2012): 293-306.
[11] Čeh, Marjan, Milan Kilibarda, Anka Lisec, and Branislav Bajat. "Estimating the
according to the random forest algorithm. The dotted red line rep- Performance of Random Forest Versus Multiple Regression for Predicting Prices
resents the cumulative importance is 0.95. Four of the top five of the Apartments." ISPRS International Journal of Geo-Information 7, no. 5 (2018):
variables are related with geographical characteristics (latitude, 168.
[12] Breiman, L. Random forests. Machine Learning, 45(1) (2001), 5–32.
sqft_living, longtitude, and sqft_living15).

4 CONCLUSION
In terms of important feature ranking, although there are similar
features, OLS ranks house age and renovation first, while decision
tree focuses on spatial and geographical features. In terms of price
prediction, decision tree fitting effect is obviously more practical

93

You might also like