0% found this document useful (0 votes)
32 views5 pages

Ahtesham 2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views5 pages

Ahtesham 2020

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

VW,QWHUQDWLRQDO$UDE&RQIHUHQFHRQ,QIRUPDWLRQ7HFKQRORJ\ $&,7

House Price Prediction using


Machine Learning Algorithm - The Case of Karachi City,
Pakistan
Maida Ahtesham, Narmeen Zakaria Bawany, Kiran Fatima
Research Center for Computing,
Department of Computer Science and Software Engineering,
Jinnah University for Women, Karachi, Pakistan
Email: [email protected], [email protected], [email protected]

Abstract— House prices are a significant impression of the The rest of the paper is structured as follow. Section 2
economy, and its value ranges are of great concerns for the presents literature review of house price predictions and
clients and property dealers. Housing price escalate every year background study. Section 3 includes processing and
that eventually reinforced the need of strategy or technique analysis of data. Section 4 presents the implementation of
that could predict house prices in future. There are certain machine learning technique that have been applied for house
factors that influence house prices including physical price prediction. Followed by empirical results and
conditions, locations, number of bedrooms and others. discussion in Section 5. Conclusion of the study is outlined
Traditionally predictions are made on the basis of these in Section 6.
factors. However such prediction methods require an
appropriate knowledge and experience regarding this domain. II. BACKGROUND AND RELATED WORK
Machine Learning techniques have been a significant source of
advanced opportunities to analyze, predict and visualize Machine learning focuses on developing self-learning
housing prices. In this paper, Gradient Boosting Model algorithms as to project future activity based on previous
XGBoost is utilized to predict housing prices. Publicly data. House price prediction works on similar phenomenon.
available dataset containing 38,961 records of Karachi city is This section presents various concepts and existing studies
attained from an Open Real Estate Portal of Pakistan. Lot of on this particular domain. Many researchers have worked on
work has been done in predicting house prices across many predicting a housing model, the process of developing an
countries, however very limited amount of work has been done opinion of value is an important tool for evaluating property
for predicting house prices in Pakistan. Our proposed house values when purchasing, selling, insuring, lending or taxing
price prediction model is able to predict 98% accuracy. on residency property, said Zhao et al. [1] who applied deep
learning in combination with extreme Gradient Boosting
Keywords— Open Real Estate Portal, Gradient Boosting (XGBoost) for real estate price predictions, by analyzing
Model XGBoost, Housing Price Prediction, Machine Learning historical property sale records. The dataset was extracted
I. INTRODUCTION from Online Real Estate website. The data split into 80% as
training set and 20% as testing test. Each record in dataset
House price prediction refers to a concept of evaluating contains address, bedrooms, bathrooms, ensuites, garages,
property prices by using various techniques. It serves as a land size, and property image. XGBoost hybrid model
first hand assistant for people in purchase or sale of achieved Mean Absolute percentage error of 8.70% whereas
properties [1]. Despite having a large number of increase 13.01% for k-NN hybrid model. Experimental evaluation of
property demands there is no appropriate mechanism that this research propose that deep learning combined with
could help predict house prices in future. XGBoost can help attained better results. According to
Machine learning has been used for image recognition, Satish et al. [2] regression deals with specifying the
spam reorganization, medical diagnosis for more than a relationship between dependent also called as response or
decade. Machine Learning based predictions achieve better outcome and independent variable or predictor. The study
results when put in practice. Almost every economic domain aimed to predict future house price with the help of machine
now benefits from machine learning prediction models. In learning algorithm. They compared and explored various
this research paper, House price prediction has been prediction methods in order to select the method of
performed using machine learning technique XGBoost. The prediction. Lasoo regression was selected as their model
data set has been taken from an Open Real State Portal of because of its adoptable and probabilistic methodology,
Pakistan [9]. This is a huge dataset as it comprises housing other machine learning models were XGBoost and Neural
records of many cities of Pakistan, including Karachi, System. The study found that Lasoo regression, in the view
Islamabad, Lahore, Rawalpindi and Faisalabad. This of accuracy, reliably outperforms in the execution of house
research paper focuses on predicting housing prices of price prediction.
Karachi using publicly available dataset. The housing dataset Another research by T. D. Phan [3] used machine
consist of 38,961 records with distinct set of features. learning techniques to analyze historical property
Computational experiment has been performed to develop transactions, the study aimed to get helpful information from
prediction model with high accuracy and low MAE. recorded information of property markets in Melbourne city,
Australia and to discover helpful models to anticipate the

‹,(((

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:09:59 UTC from IEEE Xplore. Restrictions apply.
estimation of house given a set of attributes. Dataset consist name XGBoost refers to drive the limits of computational
of 34,857 observations and 21 attributes. The study showed resources in order to boost algorithms.
high disparity between house costs in the most costly and
This algorithm have certain features that help in
most moderate rural areas in the city of Melbourne. In this
achieving greater efficiency and performance of model. A
paper different regression models were implemented in order
tree boosting algorithm by Tianqi Chen [8] is highly
to obtain better results. It was demonstrated that the blend of
productive machine learning algorithm that supports parallel
Stepwise and Support Vector Machine established on Mean
and distributed computing that speed up learning and predict
Square Error measurement is an efficient approach. It is
high accuracy.i
observed that regression tree is as good as linear regression
but polynomial regression resulted with lower errors. House price prediction helps the developer in forecasting
Whereas, neural network didn’t work efficiently with the prices in a genuine range which also helps clients to
dataset. The study of A. Chouthai et al. [4] predicted house decide when and where to buy a house. Buying a suitable
prices using different machine learning algorithms to build house is getting difficult due to rising prices. This paper aims
the prediction model for houses, such as logistic regression, to cover housing market problem of Karachi city which is
support vector regression, Lasso Regression technique and third mega city of the world. Very limited work has been
Decision Tree employed. The study contains data of 100 done for house price prediction in Pakistan. In order to
homes along with their parameters. Dataset was ere divided explore the house pricing trends in Karachi- the biggest city
with 50% to train the machine and 50% for testing purpose. of Pakistan, this work utilizes the dataset available at Open
This research resulted with accurate results. Real State Portal of Pakistan [9]. The study presented the
A. Sinha [5] employed different machine learning results on the basis of various train/test ratios i.e. 60/40,
70/30 and 50/50.
techniques for predicting the house prices. Ordinary Least
Squares algorithms used in this analysis. Various factors III. DATA ANALYSIS
were taken to predict the price like lot size, bedrooms,
bathrooms, location, drawing room, and material used in A. Data Exploration
house, interiors, parking area and mainly on square feet per The real estate property dataset was collected from
area, etc. As the scope of this paper to predict the house cost, property data for Pakistan website called Open Data Pakistan
several matrices are used for feature extraction and these [1]. Original Dataset contains 168447 instances and 20
variables are called feature data set. The study showed that features or variables as given in Table I. It includes property
the location of the house, along with the amenities were listing of various cities of Pakistan i.e. Islamabad,
highly influenced. The study of Z. Peng et al. [6] aimed to Rawalpindi, Lahore, Faisalabad, and Karachi.
predict the price of second hand houses more accurately. The
dataset with 35,417 observations extracted from Chengdu TABLE I.  FEATURE DESCRIPTION OF ACTUAL DATASET
HOME LINK network was taken. The dataset was ere
preprocessed, cleaned by removing inconsistencies or
incomplete data, corrected anomalies or outdated Name Type Description
information and important characteristics were selected. In Property_id Numerical Different types of
1 properties i.e. House
this way 27,961 records were obtained. Afterwards, multiple and flat
linear regression, decision tree and XGboost models were Location_id Numerical Locations or areas
used for predicting housing price score curve, and the 2 where the property
appropriate prediction model was selected with considerable situated
preprocessing. The results showed that the accuracy 3
Page_url Categorical Property
obtained by using XGboost prediction model was highest, advertisement link
Property_type Categorical House type, Flat,
the score reached to 0.9251 and the XGboost model proved 4
Portion etc
to be efficient among others having good classification and Price Numerical House price
regression properties and possess positive aid in the 5
(Prediction outcome)
processing of such unbalanced data. 6 Location Categorical House location

C. S. Rolli [7] analyzed the real estate property prices of 7 City Categorical City located
three counties in California with the help of machine Province Categorical Where the city
learning algorithms. This study predicted selling and 8
(Karachi) is located
demand prices of house having features such as bedroom 9 Latitude Categorical House latitude
count, bathroom count, geographical locations, kitchen size,
Longitude Categorical House longitude
square feet etc. Multiple machine learning algorithms were 10
used such as Linear Regression, Gradient Boosting, and 11 Baths Numerical Number of bathrooms
Random forest Regression. 90% of the data was used as
Area Categorical House Area
training dataset and 10% as testing dataset. It was concluded 12
that among all regression techniques, XGboost achieved best 13 Purpose Categorical For sale or rent
results.
14 Bedrooms Numerical Number of Bedrooms
Linear Regression is commonly used as predictive
15 Date_added Numerical Date of advertising
analysis. This predictive analysis helps to determine effect
or impact of change. XGBoost or Xtreme Gradient Boosting 16 Agency Categorical Advertising Agency
is machine learning algorithm used for regression problems
17 Agent Categorical Advertising agent
and is known for its flexibility, performance and speed. The

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:09:59 UTC from IEEE Xplore. Restrictions apply.
B. Removing Missing Data
Name Type Description In order to fit data into model. Pre-assessment of missing
Area_type Categorical Square Foot (1 Square data is performed. Dataset contained feature with missing or
18
= 0-0036 Marla)
Area Size Numerical House area
NaN values. Number of entries in dataset were initially high,
19 hence missing values entries are removed.
20 Area Category Categorical House area category

B. Data Preprocessing
Data preprocessing is a technique that is applied on the
dataset to make sure the data is effective to use. Before
applying models on house price prediction data
preprocessing is applied. Among dataset of various cities of
Pakistan, dataset of Karachi city is specifically selected for
house price prediction. After selecting Karachi dataset
investigation of missing data is performed. Rows with
missing values were removed from Karachi dataset. This
preprocessing also involved removing features that are less
effective. Preprocessed dataset consist of 38961 records and
14 features.
Final dataset that was obtained after preprocessing is Fig. 1. SalePrice Correlation Matrix
given in Table II.

TABLE II.  FEATURE DESCRIPTION OF DATASET AFTER


PREPROCESSING C. Separating Categorical and Numerical Data
In order to prepare data for training purpose, categorical
Name Type Description data is separated from numerical data. After which
Property_id Numerical Different types of categorical data is transformed into numerical data. Features
1 properties i.e. House having low correlation with the target variable were
and flat
Location_id Numerical Locations or areas
removed. The sales price correlation matrix of dataset
2 where the property generated by using Pearson correlation on 38960 records
situated shown in Fig 1.
Property_type Categorical House type, Flat,
3 D. Feature Selection
Portion etc
Price Numerical House price Identifying important features is essential step.
4
(Prediction outcome)
Location Categorical House location Generalizing model with less data is difficult as less features
5
fails to represents data well. Identifying the key features that
6 City Categorical City located are less or more important in house price prediction. Less
Province Categorical Where the city Important features are selected away or removed using
7
(Karachi) is located98 “crcols.remove” and important or partially important
8 Latitude Categorical House latitude features are kept for further processing
9 Longitude Categorical House longitude
E. Split DataSet into Training and Testing
10 Baths Numerical Number of bathrooms
Process of Splitting data set into training and testing
11 Area Categorical House Area divides data into smaller set for building and validating
12 Bedrooms Numerical Number of Bedrooms model. Training set has known output on which model
learns. Whereas Testing set is to test our models prediction.
13 Area Size Numerical House area
Prediction Model is check for accuracy and MAE with
14 Area Category Categorical House area category multiple train and test ratio of 60/40, 50/50 and 70/30

IV. IMPLEMENTATION F. Using XGBRegressor


XGBBoost algorithm is used for House price prediction.
A. Reading Data
This algorithm provides flexibility, speed and performance.
This study is performed using python machine learning This study prefers XGBoost in order to get better accuracy
libraries. Dataset files are loaded using pandas library. Other results and lower MAE. Model is trained using
Libraries used numpy, xgboost, scikit-learn, Matplotlib, XGBRegressor and is validated using validation dataset.
Seaborn.

V. RESULT AND DISCUSSION

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:09:59 UTC from IEEE Xplore. Restrictions apply.
This experiment of predicting house price has been absolute error are being calculated using XGBoost
deployed using XGBoost algorithm on python notebook. algorithm. Observations made from the obtained results
XGBoost is an application of gradient boosting decision tree shows that compared to all models used in predictions,
algorithm. It was designed to push the computational limits XGBoost model outperformed and provided with high rate
of boosted tree algorithm. Idea of selecting optimized of accuracy
distributed gradient boosting library is being its fast and
flexible nature and best used for tabular dataset and REFERENCES
classification and regression model. XGBoost allows
parallel processing that makes it 10 times faster than other [1] Y. Zhao, G. Chetty and D. Tran , "Deep
models. Data set employed in this experiment is taken from Learning with XGBoost for Real Estate
open data real state property dataset of Pakistan where Appraisal," in IEEE Symposium Series
specifically Karachi based dataset were selected. on Computational Intelligence (SSCI),
In order to predict house prices several metrics are used Xiamen, China, 2019.
such as feature selection. Error! Reference source not [2] G. N. Satish, C. V Raghavendran, M. D. S.
found.II Shows list and details of features that are being Rao, and C. Srinivasulu, “House Price
used for house price prediction. This dataset comprise of 14 Prediction Using Machine Learning,”
features and 38961 records. Feature selection is a procedure International Journal of Innovative
used in this process that required to manually or Technology and Exploring Engineering,
automatically selecting attributes that contributes to vol. 8, no. 9, pp. 717–722, 2019.
prediction variable. Initially this feature selection technique [[3] T. D. Phan, “Housing price prediction using
is used in preprocessing stage where number of features are machine learning algorithms: The case of
omitted on the basis of their less association with predicting Melbourne city, Australia,” Proceedings -
attribute. Later on while implementing XGBoost model International Conference on Machine
more feature were dropped to assure the efficient results. Learning and Data Engineering, iCMLDE
Data set is being tested using various testing and training 2018, pp. 8–13, 2019.
ratios to obtained multiple Model accuracy values and Mean
[4] A. Chouthai, M. A. Rangila, S. Amate, P.
Absolute errors.
Adhikari, and V. Kukre, “HOUSE PRICE
Accuracy and mean absolute error obtained are presented PREDICITION USING MACHINE
in Table III. Results have been presented on the basis of LEARNING,” pp. 4403–4406, 2019.
various training and testing ratio.
[5] A. Sinha, “Utilization Of Machine Learning
Models In Real Estate House Price
TABLE III.  TRAIN TEST RATIO
Prediction,” vol. 4, no. 1, pp. 18–23.
[6] Z. Peng, Q. Huang, and Y. Han, “Model
Train/Test Learning Model Mean Absolute Error Research on Forecast of Second-Hand House
Ratio Rate Accuracy
60/40 0.01 98% 22502.0824694 Price in Chengdu Based on XGboost
1 Algorithm,” 2019 IEEE 11th International
2
50/50 0.01 98% 22502.0824694 Conference on Advanced Infocomm
70/30 0.01 98% 22502.0824694
Technology, ICAIT 2019, pp. 168–172,
3 2019.
[7] C. S. Rolli, "ZILLOW HOME VALUE
PREDICTION USING XGBOOST,"
VI. CONCLUSION
California State University San Marcos,
In order to purchase real state property accurate 2019.
estimation of house prices is necessary. A real state property [8] T. Chen and C. Guestrin, "XGBoost: A
contains various factors. In order to predict house prices, Scalable Tree Boosting System," in KDD '16:
machine learning algorithms are considered to be efficient Proceedings of the 22nd ACM SIGKDD
techniques. This paper provides insight of XGBoost International Conference on Knowledge
algorithm as one of the useful algorithm for house price Discovery and Data Mining, 2016.
prediction and to provide flexible and efficient results. [9] "Open Data Pakistan,"
Dataset used for the experiment were obtained from Open https://opendata.com.pk/dataset/property-
Data Pakistan website. A real state property data set of data-for-pakistan
Karachi city that contain 38,961 records and features [10] S. Xiong, Q. Sun and A. Zhou, "Improve the
including location, city, property type (Flat, House), number House Price Prediction Accuracy with a
of bedrooms, baths, longitude, latitude, area, price. Stacked Generalization Ensemble Model,"
Preliminary assessment of dataset is done including
in International Conference on Internet of
removing NaN values and finding the characteristics that
Vehicles, 2019.
best match with predicting value. Model accuracy and Mean

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:09:59 UTC from IEEE Xplore. Restrictions apply.
[11] Q. Truong, M. Nguyen, H. Dang and B. Mei, International Conference on Advances in
"Housing Price Prediction via Improved Artificial Intelligence, 2019.
Machine Learning Techniques," in 2019 [14] T. Mohd, N. S. Jamil, N. Johari, L. Abdullah
International Conference on Identification, and S. Masrom, "An Overview of Real Estate
Information and Knowledge in the Internet of Modelling Techniques for House Price
Things (IIKI2019), 2019. Prediction," in Charting a Sustainable Future
1[12] L. Mrsic, H. Jerkovic and M. Balkovic, "Real of ASEAN in Business and Social Sciences,
Estate Market Price Prediction Framework 2020.
Based on Public Data Sources with Case
[15] "Machine Learning Mastery,"
Study from Croatia," in Asian Conference on
https://machinelearningmastery.com/gentle-
Intelligent Information and Database
introduction-xgboost-applied-machine-
Systems, 2020.
learning/
1[13] U. K. Cinar, "Combining Domain
Knowledge & Machine Learning: Making
Predictions using Boosting Techniques," in
ICAAI 2019: Proceedings of the 2019 3rd

Authorized licensed use limited to: University of Prince Edward Island. Downloaded on June 01,2021 at 01:09:59 UTC from IEEE Xplore. Restrictions apply.

You might also like