0% found this document useful (0 votes)
20 views14 pages

Comparative Study of House Price Prediction Using Machine Learning Research Paper

This document presents a comparative study on house price prediction using various machine learning algorithms, including Linear Regression, Decision Trees, and Support Vector Machines (SVM). The methodology involves data acquisition, preprocessing, and evaluation of model performance using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared. The ultimate goal is to develop an accurate and interpretable model for predicting house prices in the Delhi region based on historical data and relevant property features.

Uploaded by

Renee Winters
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views14 pages

Comparative Study of House Price Prediction Using Machine Learning Research Paper

This document presents a comparative study on house price prediction using various machine learning algorithms, including Linear Regression, Decision Trees, and Support Vector Machines (SVM). The methodology involves data acquisition, preprocessing, and evaluation of model performance using metrics like Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and R-squared. The ultimate goal is to develop an accurate and interpretable model for predicting house prices in the Delhi region based on historical data and relevant property features.

Uploaded by

Renee Winters
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Comparative Study of House Price Prediction using

Machine Learning

Abstract: House Price Prediction (HPP) is commonly used to estimate the changes in
housing price. Since housing price is strongly correlated to other factors such as location,
area, population, it requires other information apart from HPP to predict individual
housing price. There has been a considerably large number of papers adopting traditional
machine learning approaches to predict housing prices accurately, but they rarely
concern about the performance of individual models and neglect the less popular yet
complex models. House price prediction is an important topic of real estate. The literature
attempts to derive useful knowledge from historical data of property markets. Machine
learning techniques are applied to analyze historical property transactions in India to
discover useful models for house buyers and sellers. Moreover, experiments demonstrate
using different machine learning techniques like support vector machine (SVM), linear
regression, decision tree and then comparing them on the basis of least occurred error by
comparing Mean Squared Error (MSE), Root Mean Square Error (RMSE), R – Squared
Score to find out which gives the best answer.

There are many factors that have to be taken into consideration for predicting house price
and try to predict efficient house pricing for customers with respect to their budget as
well as also according to their priorities. So, we are creating a housing cost prediction
model. By using Machine learning algorithms like Linear Regression, Decision Tree
Regression, Support Vector Machines (SVM).

I. INTRODUCTION
In the realm of real estate, accurately determining the value of a property is of paramount
importance, be it for buying, selling, or investing purposes. Traditionally, House Price
Prediction (HPP) has been used as a general indicator of price changes or new property values
based on aggregated transaction data. However, relying solely on this rough estimate can be
inefficient when trying to predict the price of a specific house. With the emergence of Big Data,
machine learning has proven to be a powerful tool in predicting house prices with greater
accuracy, regardless of historical data from previous years. As Artificial Intelligence is
involving everywhere in the world there is stupendous amount of increase in technology in our
day-today life and implementation of various advanced machines has been increased. Many
studies have already demonstrated the potential of machine learning approaches [2], [3], [4].
To achieve this, we leverage historical data on house sales along with relevant property
information such as location, size, and number of bedrooms. By employing various machine
learning algorithms, our objective is to build models capable of making intelligent predictions
about house prices. Throughout our research, we will explore different machine learning
methods, ranging from simple techniques like Linear Regression to more complex ones like
Decision Trees.

The goal of our project is to create a model that is not only effective in predicting house prices
but also easy to comprehend and applicable in real-world scenarios. By comparing the
performance of different machine learning techniques, we aim to identify the most suitable
approach for accurate house price prediction. This journey will encompass an array of machine
learning approaches, each with its own strengths and characteristics. We will navigate through
the simplicity of Linear Regression, the adaptability of Decision Trees, and the sophistication
of Support Vector Machines and Neural Networks. The outcome of our efforts will be a
dependable, interpretable, and scalable house price prediction model, ready to be deployed in
practical situations. By comprehensively comparing the strengths and limitations of different
machine learning techniques, we will be equipped to make well-informed decisions and select
the optimal approach for various real estate markets. Our ultimate aim is to simplify the process
of predicting house prices for the average person while achieving the best possible results in
the dynamic real estate industry.[7][8]

II. METHODOLOGY
In this project we have used many machine learning algorithms like linear regression, decision
tree and support vector machine regression. 80% of data from dataset is used for training
purpose and remaining 20% of data used for testing purpose. The first critical step in this
project is data acquisition. We aim to gather a comprehensive dataset of real estate properties
in the Delhi region, encompassing various features such as property location, size, number of
rooms, available amenities, proximity to essential facilities (e.g., schools, hospitals, public
transportation), historical price trends, and any other pertinent variables. This rich dataset will
serve as the foundation for our predictive models. Next, we proceed with Feature Engineering,
which involves preprocessing the data and selecting appropriate features from the acquired
dataset.[5] We ensure data quality by addressing any potential issues, handling missing values,
and conducting feature scaling to normalize the data and avoid biases during model training.
Having prepared the data, we now move on to the selection of diverse machine learning
algorithms to apply to our dataset. In this specific study, we have chosen three different
algorithms for experimentation:

1) Linear Regression

2) Decision Tree

3) Support Vector Machine (SVM)

We have lot to research in house price prediction and knowledge of machine learning is
required. In general house prices are made considering various variables. They call these
factors to be concept, strength and placement. Even we consider physical conditions that
includes no. of rooms, dimensions of the property, age of the property, garage and kitchen
scaling. [5]

Fig. 2 Flowchart of Process

Fig. 1 Flowchart description for Linear regression


In this project of ours, we have used many different machine learning algorithms such as
Linear regression, Decision Tree, Support Vector Machine. There are many factors which
affect the price of the house which consists of attributes like BHK, Locality, Bathroom,
Locality, Area and many others.

Then we take RMSE, MSE, R squared under consideration for performance matrix of all
three algorithms and determine the most accurate model which predicts the best results.

III. PROPOSED SYSTEM


The first crucial step in our project is data acquiring, here we are taking or acquiring dataset
from Kaggle, like in this we are using dataset of New Delhi, After our important step is data
cleaning, where we meticulously process the dataset to ensure its quality and reliability. We
perform various tasks such as removing outliers, which are data points significantly different
from others and can skew our analysis. Additionally, we address missing values, ensuring that
our dataset is complete and accurate. We also examine the distribution of data to verify if it
follows a normal pattern, which is essential for some machine learning algorithms to work
optimally. Moreover, we analyze the correlation between different attributes to understand their
relationships and potential impact on our predictive models. After data cleaning and
preprocessing, we employ three different data preprocessing methods: Robust Scaler,
Numerical Transformer, and categorical Transformer. These techniques help us scale and
transform the features appropriately to prevent any biases in our machine learning models.
Now, it's time to explore various machine learning models to find the best-suited one for our
house price prediction task. We consider a range of models, including:

1. Linear Regression: This model establishes a linear relationship between the target variable
(house price) and the input features. It provides interpretable results, making it easy to
understand the impact of each feature on the predicted price.

2. Decision Tree: A decision tree is a predictive model that uses a tree-like structure to make
decisions based on input features, recursively splitting the data into branches to arrive at a final
outcome or prediction.

3. Support Vector Regression (SVR): SVR aims to find a hyperplane that best fits the data while
allowing for some margin of error. It is effective in capturing complex relationships between
features and the target variable.
To evaluate the performance of these models, we calculate the Root Mean Square Error
(RMSE), a common metric for regression tasks. The RMSE measures the average difference
between the predicted and actual house prices. Lower RMSE values indicate better model
accuracy.

By meticulously exploring and assessing the performance of these machine learning models,
we aim to identify the most accurate and reliable approach for house price prediction. Our
ultimate goal is to create a powerful and interpretable model that can provide valuable insights
for real-world applications in the real estate industry.

IV. IMPLEMENTATION
A. Exploring the data

All the tuples of the dataset define of the New Delhi. That data was collected from Kaggle of
Delhi property Prices of some well-known localities at the time of 2018.

There are some attributes which are present in the dataset which are described below.

1. Area: The total size or area of the property in square feet. Larger areas may generally have
higher prices.

2. BHK (Bedrooms, Hall, Kitchen): The number of bedrooms in the property. Properties with
more bedrooms may have higher prices.

3. Bathroom: The number of bathrooms in the property. More bathrooms may increase the
property's value.

4. Furnishing: Indicates whether the property is fully furnished, semi-furnished, or unfurnished.


Furnished properties might command higher prices.

5. Locality: The specific neighbourhood or area where the property is situated. Different
localities can have varying impacts on property prices.

6. Parking: Specifies if the property has parking facilities. Properties with parking may be more
desirable and thus have higher prices.

7. Price: The target variable, representing the actual sale price of the property. This is the value
we want to predict using the other attributes.
8. Status: The current status of the property (e.g., ready to move, under construction). Status
can influence the property's market value.

9. Transaction: Refers to the type of transaction (e.g., new property, resale). Different
transaction types may affect prices differently.

10. Type: Refers to the type of property (e.g., builder property, apartment). Different property
types affect the prices.

11. Per_Sqft: The price per square foot of the property. It gives a measure of how much the
property costs per unit of area and helps in comparisons.

By analysing these attributes and their relationships, machine learning models can predict
house prices more accurately based on the historical data and patterns within the dataset.

B. Data pre processing

Data pre-processing plays a crucial role in preparing the dataset for effective model training
and accurate predictions. Some key steps involved in data pre-processing for house price
prediction are:

1. Handling Missing Values: Addressing missing values in features like area, number of
bedrooms, and bathrooms to avoid bias in the analysis. Missing values can be imputed with
mean, median, or using advanced imputation techniques.

2. Dealing with Outliers: Identifying and handling outliers in the dataset, as extreme values can
negatively impact the model's performance. Outliers in house prices or other attributes are
either removed or transformed to minimize their influence.

3. Encoding Categorical Variables: Converting categorical features like property furnishing,


locality, parking availability, status, and transaction type into numerical representations using
techniques like one-hot encoding or label encoding.

4. Feature Scaling: Scaling numerical features like area, price per square foot, number of
rooms, and bathrooms to bring them to a similar scale. Common scaling methods include Min-
Max scaling or Standardization.

5. Feature Engineering: Creating new relevant features or transforming existing ones, such as
calculating the price per square foot or aggregating amenities to create a composite feature.
6. Handling Skewed Data: Addressing skewed distributions in variables, such as applying log
transformations to house prices or other features with heavy tails.

7. Handling Multi-collinearity: Identifying and addressing high correlations between


independent variables to avoid redundancy and improve model interpretability.

8. Splitting Data: Dividing the dataset into training and test sets to train the model on one subset
and evaluate its performance on another, ensuring a fair assessment of model generalization.

9. Dealing with Imbalanced Data: Handling imbalanced classes in the target variable (e.g.,
underrepresented classes for high-priced houses) through techniques like oversampling, under
sampling, or using appropriate evaluation metrics.

10. Handling Textual Data: If the dataset contains textual data, preprocessing text features
involves tokenization, stemming, and removing stop words to convert text into meaningful
numerical representations.

By executing these data pre-processing steps thoughtfully, we can enhance the quality of the
dataset and enable machine learning models to make accurate predictions of house prices based
on the provided attributes.

C. Exploring various machine learning models

1. Linear Regression:

Multiple linear regression is a statistical technique used to model the relationship between a
dependent variable and two or more independent variables. It extends the concept of simple
linear regression, which deals with only one independent variable, to multiple predictors. The
goal of multiple linear regression is to find the best-fitting linear equation that explains the
relationship between the dependent variable and the multiple independent variables.

Mathematically, the multiple linear regression model can be represented as:

Y = β0 + β1X1 + β2X2 + ... + βn*Xn + ε

Where,

▪ Y is dependent variable which we need to find (like here y is house price which we are trying
to predict)
▪ X1,X2,…Xn are independent Variables on basis of which we will get predicted output of
Y(like here these are BHK, Bathroom, Furnishing etc…)

▪ β0 is the y-intercept, representing the value of Y when all independent variables are zero.

▪ β1, β2, ..., βn are the coefficients (regression coefficients) that represent the change in Y
corresponding to a one-unit change in each independent variable, holding other variables
constant.

▪ ε is the error term, representing the difference between the predicted Y value and the actual
Y value.

2. Decision Tree:

Decision tree regression is a supervised machine learning technique used for regression tasks.
Unlike classification decision trees, which predict discrete categorical outputs, decision tree
regression predicts continuous numerical values. The model works by recursively partitioning
the data into subsets, fitting

a simple regression model (usually the mean or median) to each subset, and then using these
predictions to make final predictions for new data points. Decision tree regression is a powerful
and intuitive technique for solving regression problems. To mitigate overfitting, techniques like
pruning, cross-validation, and ensemble methods (e.g., Random Forests) can be used to
enhance predictive performance and robustness.

3. Support Vector Machine (SVM)

Support Vector Machine (SVM) regression is a supervised machine learning algorithm used
for regression tasks. It is an extension of the SVM algorithm, which is primarily used for
classification problems. SVM regression aims to find the best-fitting hyperplane (or
hyperplanes in the case of multiple dimensions) that maximizes the margin around the training
data points while minimizing the prediction error. Support Vector Machine regression is a
powerful algorithm for regression tasks, especially when dealing with complex and non-linear
relationships between features and the target variable.

D. Performance Model

1. Mean Squared Error (MSE): MSE measures the average squared difference between the
predicted values and the actual values in the dataset. MSE is calculated by taking the sum of
the squared differences between predicted and actual values and then dividing it by the number
of data points. A lower MSE indicates better model performance, as it means the model's
predictions are closer to the actual values.

2. Root Mean Squared Error (RMSE): RMSE is the square root of MSE, and it represents the
standard deviation of the residuals (prediction errors) of the regression model. RMSE is
calculated by taking the square root of the MSE. A lower RMSE indicates better predictive
accuracy, with a value of 0 indicating a perfect fit of the model to the data. On basis of these
evaluation matrices, we will visualize our result and then summarization is done.

3. R-squared (R2) Coefficient of Determination: R-squared measures the proportion of the


variance in the dependent variable that is explained by the independent variables in the model.
R2 is calculated as the ratio of the explained variance to the total variance. It ranges from 0 to
1, where 0 indicates that the model does not explain any variance, and 1 indicates a perfect fit
where all variance is explained. A higher R2 indicates a better fit of the model to the data, with
a value of 1 indicating that the model explains all the variance.

V. RESULTS
Here we are taking Lajpat Nagar status.

Fig. 3 Attributes provided as our need


Fig. 4 Results for fig.3
Fig. 5 Comparison Graphs

VI. CONCLUSION
The study shows a comparison between the regression algorithms when predicting house prices
in Delhi. In this report, the Multiple Linear regression, Decision tree and SVM machine
learning algorithms are used to construct a prediction model to predict potential selling prices
for any real estate property. In conclusion, when evaluating the performance of regression
models, it is essential to consider multiple evaluation metrics, including Mean Squared Error
(MSE), Root Mean Squared Error (RMSE), and R-squared (R2). Each metric provides valuable
insights into the model's predictive accuracy, goodness-of-fit, and interpretability.

So, as we can see from our results on the basis of errors, here we have taken three different
kind of errors, in which MSE and RMSE implies on having lower value whereas R2 implies
on having higher value which is closer to 1 and having the perfect fit. Thus, according to our
visualised data on the basis of graphs which we computed, we can say that Decision Tree
Regression is the most accurate and error free machine learning method. By incorporating
multiple evaluation metrics and validating the model's assumptions, we can make informed
decisions about model selection and ensure the model's reliability in making accurate
predictions in real-world scenarios

VII. FUTURE OUTCOMES

The comparative study of house price prediction using machine learning techniques will have
significant future outcomes. It will lead to the development of improved predictive models,
empowering real estate stakeholders to make data-driven decisions on property investments
and pricing strategies. The study's interpretability aspect will offer valuable market insights,
helping analysts and policymakers understand factors influencing house prices. Assessing
generalizability will determine model effectiveness in different regions and over time,
benefiting companies operating in multiple real estate markets. Practical implementation
guidance will accelerate machine learning adoption in the real estate industry. Furthermore, the
study's findings can inspire further research in real estate prediction and machine learning,
leading to advancements in hybrid models and feature engineering techniques. By assessing
robustness, risks associated with inaccurate predictions can be mitigated, contributing to stable
real estate markets. Improved house price predictions can have broader economic and social
impacts, potentially reducing housing bubbles and fostering well-informed urban development.
Overall, the study's outcomes will serve as benchmarks and best practices, driving future
interdisciplinary collaboration and transforming house price prediction for more accurate,
efficient, and transparent real estate transactions.
REFERENCES

[1] Dataset from Kaggle https://www.kaggle.com/datasets/neelkamal692/delhi-house-price-


prediction

[2] Quang Truong, Minh Nguyen, Hy Dang, Bo Mei (2019). Housing Price Prediction via
Improved Machine Learning Techniques. Quang Truong et al. / Procedia Computer Science
174 (2020) 433–442 435

[3] Alisha Kuvalekar , Shivani Manchewar and Sidhika Mahadik , House Price Forecasting
Using Machine Learning (April 8, 2020). Proceedings of the 3rd International Conference on
Advances in Science & Technology (ICAST) 2020

[4] ANAND G. RAWOOL, DATTATRAY V. ROGYE, SAINATH G. RANE, DR. VINAYK


A. BHARADI (2021). House Price Prediction Using Machine Learning. MAY 2021 | IRE
Journals | Volume 4 Issue 11 | ISSN: 2456-8880

[5] Sumanth Mysore, Abhinay Muthineni, Vaishnavi Nandikandi, Sudersan Behera (2022).
Prediction of House Prices Using Machine Learning. ISSN: 2321-9653; IC Value: 45.98; SJ
Impact Factor: 7.538 Volume 10 Issue VI June 2022.

[6] Nor Hamizah Zulkifley, Shuzlina Abdul Rahman, Nor Hasbiah Ubaidullah, Ismail Ibrahim
(2020). House Price Prediction using a Machine Learning Model: A Survey of
Literature.International Journal of Modern Education and Computer Science. 12. 46-54.
10.5815/ijmecs.2020.06.04.

[7] Raul-Tomas Mora-Garcia, Maria-Francisca Cespedes-Lopez and V. Raul Perez-Sanchez


(2022). Housing Price Prediction Using Machine Learning

Algorithms in COVID-19 Times. Land 2022, 11, 2100. https://doi.org/

10.3390/land11112100

[8] Fan C, Cui Z, Zhong X. House Prices Prediction with Machine Learning Algorithms.
Proceedings of the 2018 10th International Conference on Machine Learning and Computing
ICMLC 2018. doi:10.1145/3195106.3195133.

[9] R Manjula Real estate value prediction using multivariate regression models

Materials Science and Engineering Conference Series, volume 263, issue 4.


[10] A Varma House Price Prediction Using Machine Learning And Neural Networks 2018
Second International Conference on Inventive Communication and Computational
Technologies, p. 1936 – 1939

[11] E.Laxmi Lydia, Gogineni Hima Bindu, Aswadhati Sirisham, Pasam Prudhvi Kiran
Electronic Governance of Housing Price using Boston Dataset Implementing through Deep
Learning Mechanism (2019) International Journal of Recent Technology and Engineering
(IJRTE) ISSN: 2277-3878, Volume-7 Issue-6S2, April 2019

[12] A. G. Sarip, M. B. Hafez, and M. N. Daud, “Application of fuzzy regression model for
real estate price prediction,” Malaysian Journal of Computer Science, vol. 29, no. 1, pp. 15–
27, 2016.

[13] Atharva chogle, Priyanka khaire, Akshata gaud, Jinal Jain. House Price Forecasting using

Data Mining Techniques (2017) International Journal of Advanced Research in Computer and
Communication Engineering ISO 3297:2007 Certified Vol. 6, Issue 12, December 2017

[14] Byeonghwa Park , Jae Kwon Bae (2015). Using machine learning algorithms for housing
price prediction , Volume 42, Pages 2928-2934

[15] Cespedes-Lopez, M.F.; Mora-Garcia, R.T.; Perez-Sanchez, R.; Marti-Ciriquian, P. The


Influence of Energy Certification on Housing Sales Prices in the Province of Alicante (Spain).
Appl. Sci. 2020, 10, 7129.

You might also like