House Price Prediction Using Machine Learning
House Price Prediction Using Machine Learning
DOI: 10.54254/2755-2721/53/20241426
Chenxi Li
School Of International Education, GuangDong University Of Technology, No. 11,
Guangzhou, China
Abstract. The role of the real estate industry in economic development and social progress
reflects the economic well-being of individuals and regions. With the increase of people's income
level, the demand for housing is also increasing. Therefore, making a more accurate house price
forecast will help people make the most correct strategy to buy a house when they need it. This
study focuses on house price prediction in King County, Washington, a diverse real estate market.
Leveraging machine learning models such as linear regression, random forest, neural networks
and XGBoost, these supervised learning models are used to delve into house price forecasting.
This research includes random forest and XGBoost, are implemented using Scikit-Learn tools.
Besides, the Feedforward Neural Network is introduced with the drop out layer in order to reduce
the occurrence of model fitting situations. The findings reveal that XGBoost achieves the highest
accuracy, making it well-suited for precise price predictions. Additionally, the research identifies
grade, sqft_living, and latitude as the three most influential features significantly affecting house
prices within the dataset.
Keywords: House price prediction, linear regression, random forest, neural networks, XGBoost.
1. Introduction
The real estate industry is crucial for economic development and societal progress, reflecting the
aspirations of individuals and families and the overall economic health of a region. A. H. Maslow states
in A Theory of Human Motivation: “Undoubtedly these physiological needs are the most pre-potent of
all needs” (1943, p.374) [1]. Among physiological needs, shelter (house), as a necessity, is essential for
people. Hence, it is essential for people like policymakers, real estate professionals, and homeowners to
comprehend the factors that influence housing prices. In this context, the study of house price prediction
has gained significant attention, given its potential to offer insights into the factors that drive housing
market fluctuations and their importance for various stakeholders.
This paper focuses on the task of house price prediction in King County, Washington. King County,
situated in the heart of the Pacific Northwest, represents a diverse and dynamic real estate market,
characterized by a mix of urban, suburban, and rural areas. With its vibrant economy, cultural attractions,
and natural beauty, King County has drawn a diverse population, contributing to the complexity of its
housing market. It is the most populated county in Washington and the 13th largest populated county in
the US [2]. According to Maslow’s hierarchy of needs, the base level of the pyramid is physiological
needs, which include shelter. Thus, there is currently a significant need for residential properties in King
County.
© 2024 The Authors. This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0
(https://creativecommons.org/licenses/by/4.0/).
225
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
The importance of precise price predictions in the real estate market is profound, as these forecasts
have the potential to significantly influence the decisions of a multitude of stakeholders, including
prospective homebuyers, sellers, real estate agents, investors, and policymakers. An accurate and
reliable predictive model is, therefore, a cornerstone of informed decision-making in the housing
industry.
This research goes beyond previous economic analysis and uses well-founded models in machine
learning to explore this problem. Four supervised learning models, linear regression, RF, ANN and
XGBoost will be used to predict the relationship between different features of the house and the price
of the house. The result will conclude the important features that influence the house price. This will
provide a guide for future house purchases or investments.
The following is how the paper is organized: The section 2 will go over our data processing, including
data selection and pre-processing. Section 3 is about the methodology of our research. It will include
the four models we used to process our data. Section 4 will illustrate further experiments based on the
results produced by our models. Finally, the conclusions are in section 5.
2. Data Processing
After the process with the dataset and features, the final dataset contains 21,611 data with 19 features
(including price), 6 of which are categorical values and 12 of which are numerical values. Remove
examples 12 and 19 due to the absence of sqft_above feature. Additionally, exclude the features id and
date on account of subjectivity. Each property is described in depth in Table 1.
Table 1. List of Attributes (list few important attributes)
Attribute Data
Description
Name Type
bedrooms int64 Number of bedrooms
bathrooms float64 Number of bathrooms
sqft_living int64 Size of the apartment's internal living area in square feet
A scale from 1 to 13 is used, with 1–3 representing poor building
grade int64 construction and design, 7–11 representing average and 11–13
representing excellent building construction and design.
sqft_above int64 The area of inner housing that is above ground measured in square feet
yr_built int64 The year the house was initially built
lat float64 Latitude
226
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
Table 1. (continued)
long float64 Longitude
The area of the interior where the 15 closest neighbors' dwelling quarters
sqft_living15 int64
are located
227
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
3. Methodology
228
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
Model Prediction: We use the test dataset to make predictions with the trained model, obtaining
predicted house prices.
229
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
3.4. XGBoost
XGBoost is one of the applications in gradient boosting machines (gbm), which is known as one of the
top performing supervised learning algorithms. It is applicable to both regression and classification
issues [9]. The structure of XGBoost is showed in Figure 6 [10]. The way XGBoost works is as follow
as well:
⚫ Initializing the Model:
Initialize a model, typically a decision tree with only one leaf node. The initial leaf node's prediction
is set to the average of all target values in the training data:
𝐹0 (𝑥) = ∑ 𝑦𝑖 /𝑚, where i ranges from 1 to m ( m is the number of training samples, Seven tenths of
21,611 in our research)
230
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
The final model output is used for making predictions in regression problems.
4.1.1. The Result Of Linear regression. Before running the linear regression model, the relationship
between different factors has been analyzed (Figure 7). Most of the relationships between price and the
231
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
area of different parts of the house are linear, which satisfied the prerequisite for running a linear
regression model.
Following the analysis of the relationship between price and area, a simple LR model for the price
and living area of the house was built by using gradient descent to find the minimum cost function
(Figure 8).
Figure 8. Linear regression model for price and living area (Original)
Subsequently, after evaluating the performance of the simple linear regression model, we proceeded
to develop a multiple linear regression model based on our hypothesis using the Ordinary Least Squares
(OLS) regression test. The resulting model demonstrated a commendable R-squared value of 0.706,
affirming the appropriateness of the linear regression model for this research.
4.1.2. The Result Of Random forest and XGBoost. In the process of hyperparameter optimization
of Random Forest and XGBoost models, we use the RandomizedSearchCV function from python's
scikit-learn library to achieve better performance of our model. The following two graphs (Figure 9 and
Figure 10) show the performance improvement of the two algorithms after 100 iterations. It can be seen
that their RMSLE starts to decline significantly and becomes stable at about 80 iterations, and the
hyperparameters reach the optimal level.
232
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
The result of our research gives an exceptional R-squared value of 0.878 in random forest algorithm
and 0.888 in XGBoost algorithm.
4.1.3. The Result of Artificial Neural network. In ANN, we use ReLU function as activation function,
because its characteristics are helpful to train deep neural networks, and the calculation of ReLU is
simpler and more efficient than other activation functions. Because it does not cause gradient
disappearance problems, neural networks can be trained more easily. The TensorFlow library is used
for hyperparameter optimization.
The Figure 11 below is our training of ANN learning rate to find the best learning rate. It can be seen
that when the learning rate is 0.1, the MSE quickly reaches a relatively stable value, but it is easy to
miss the local minimum and the best parameter combination. When the learning rate is 0.0001, it takes
a long time to reach the relatively stable value of MSE, and the time cost is relatively high. Therefore,
we choose 0.001 with moderate decline speed as the learning rate for the configuration of
hyperparameters.
The result of the ANN with 0.001 learning ratio gives 0.846 in R-squared value.
233
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
The above Table 2 shows the comparison between the algorithms that are used in this research, where
it is found that XGBoost gives the highest accuracy, 88.8 percent. While the Linear regression is the
lowest at 70.6 percent with the Neural network in 84.6 percent and Random forest in 87.8 percent. The
degree of fit of the four algorithms is also shown below in Figure 12.
Overall, XGBoost and RF are generally more complex nonlinear models in terms of complexity, and
they can better capture complex relationships and nonlinear patterns in the data (because the relationship
between house prices and features is generally complex and non-linear). In contrast, linear regression,
which performs poorly when dealing with non-linear relationships. ANN is somewhere in between, it
can learn nonlinear relationships, but its performance is highly dependent on the choice of network
structure and parameter adjustment (given that the neural network in our study only uses a relatively
simple three-layer feedforward neural network, the parameters are default, and the amount of data is
limited).
High adaptive advantage over integrated methods: XGBoost and RF are both integrated learning
methods that improve performance by combining multiple weak learners. This makes them more robust
and predictive in modeling, able to deal effectively with noise and outliers.
Properties of the data set itself: Given the complex non-linear relationships, interactions, and
importance of features in house price data, XGBoost and RF are often better able to fit these properties.
ANN can also learn nonlinear relationships, but requires more data and tuning to maximize its
performance, and the number of data sets here may not be enough for ANN to further learn to improve
model performance.
234
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
While running the RF model, LR model and XGBoost model (ANN model generally not gives the
importance of different features), we conducted an assessment of feature importance within our dataset.
We used the coefficients trained by polynomial linear regression as the feature importance; RF-
feature_importances_ and feature_importance were used to obtain the contribution degree of the feature
value in the RF and XGBoost to perform feature ranking as the importance of the feature. The result of
the feature importance comparison shows below (Table 3).
Table 3. Feature Importance
Features Linear regression Random forest XGBoost
Bedrooms 0.06 0.0037 0.0038
Bathrooms 0.07 0.0071 0.0160
Sqft_living 0 0.2731 0.2197
Sqft_lot 0 0.0144 0.0072
Floors 0.01 0.0020 0.0039
Waterfront 0.97 0.0317 0.1151
View 0.09 0.0136 0.0756
Condition 0.04 0.0031 0.0088
Grade 0.16 0.2959 0.3139
Sqft_above 0 0.0238 0.0173
235
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
Table 3. (continued).
5. Conclusion
This article examines three different house price prediction models and compares their accuracy. They
are linear regression, random forest and network neural. This article explores the impact of each feature
on price at same time.
In the experiment, we can see that XGBoost has the highest accuracy, linear regression has the lowest.
Therefore, for predicting house prices, considering accuracy, the XGBoost method would be more
suitable. In comparison with neural networks, random forests and XGBoost although linear regression
has the lowest accuracy, its accuracy is relatively close to the other three methods. Considering that the
linear regression method is relatively simple and straightforward, when time is limited and accuracy
requirements are not so high, the linear regression method will be an option. More research is needed in
this part.
Regarding the importance of features, we can see that the three most influential features affecting
house prices in our dataset are lat, waterfront and grade in the experiment. In other words, for the houses
we choose, their latitude, waterfront condition and house grade can most affect their prices. This result
may hold true for other houses and be helpful in their price predictions. However, more research is
needed to confirm this.
236
Proceedings of the 4th International Conference on Signal Processing and Machine Learning
DOI: 10.54254/2755-2721/53/20241426
Future research should focus on the comparison of the time and space complexity of random forest
and linear regression and the impact of the features of houses in other areas on their prices.
References
[1] Maslow, A. H. (1943). A theory of human motivation. Psychological Review, 50 (4), 370-96.
[2] “King County, Washington”, wikipedia, 31 October 2023,
https://en.wikipedia.org/wiki/King_County,_Washington
[3] House Sales in King County, USA,
https://www.kaggle.com/datasets/harlfoxem/housesalesprediction/data
[4] Visualization-on-a-Map https://www.kaggle.com/code/chrisbronner/regression-r2-0-82-and-
map-visualization#3.
[5] N. N. Ghosalkar and S. N. Dhage, "Real Estate Value Prediction Using Linear Regression," 2018
Fourth International Conference on Computing Communication Control and Automation
(ICCUBEA), Pune, India, 2018, pp. 1-5, doi:10.1109/ICCUBEA.2018.8697639
[6] Breiman L. Random Forests. SpringerLink. https://doi.org/10.1023/A:1010933404324 (accessed
September 11, 2019).
[7] Raschka S, Mirjalili V. Python machine learning: Machine learning and deep learning with
Python, scikit-learn, and TensorFlow. 2nd ed. Birmingham: Packt Publishing; 2017.
[8] Cireşan, D.C.; Meier, U.; Gambardella, L.M.; Schmidhuber, J. Deep, big, simple neural nets for
handwritten digit recognition. Neural Comput. 2010, 22, 3207–3220.
[9] T. Chen, C. Guestrin XGBoost: A scalable tree boosting system, Association for Computing
Machinery (2016), pp. 785-794
[10] W. Dong, Y. Huang, B. Lehane, G. Ma XGBoost algorithm-based prediction of concrete electrical
resistivity for structural health monitoring and Automation in Construction, 114 (2020), p.
103155
237