Housing Price Prediction Modeling Using Machine Learning
Housing Price Prediction Modeling Using Machine Learning
Dr. S P. Malarvizhi
Associate Professor, Sri Vasavi Engineering College, Andhra Pradesh, India
Email: [email protected]
Abstract—Machine Learning is seeing its growth more with test data to identify the accuracy level of the
rapidly in this decade. Many applications and algorithms classifier.
evolve in Machine Learning day to day. One such Decision Tree formation as shown in fig. 1 employs
application found in journals is house price prediction. divide and conquer strategy for splitting the training data
House prices are increasing every year which has into subsets by testing an attribute value. This involves
necessitated the modeling of house price prediction. attribute selection measures; the attribute which is to be
These models constructed, help the customers to purchase tested first is the one which is having high information
a house suitable for their need. Proposed work makes use gain. Same splitting process is recursively performed on
of the attributes or features of the houses such as number the subsets derived [2]. The splitting process of a subset
of bedrooms available in the house, age of the house, ends when all the tuples belong to the same attribute
travelling facility from the location, school facility value or when no remaining attributes or instances are left
available nearby the houses and Shopping malls available with. Decision Tree formation does not need any basic
nearby the house location. House availability based on domain knowledge. It can handle data of high
desired features of the house and house price prediction dimensions as well. Decision Tree Classifiers have good
are modeled in the proposed work and the model is accuracy in classification.
constructed for a small town in West Godavari district of Once the Decision Tree is formed, new instances can
Andhrapradesh. The work involves decision tree be classified easily by tracing the tree from root to leaf
classification, decision tree regression and multiple linear node. Classification through Decision Tree does not
regression and is implemented using Scikit-Learn require much computation. Decision Trees are capable of
Machine Learning Tool. handling both continuous and Categorical type of
attributes.
Index Terms—Decision tree, house price prediction, To avoid generation of meaningless and unwanted
decision tree regression, multiple linear regression. rules in Decision Trees, tree should not be deeper which
results in over fitting. Such a tree with over fitting works
more accurate with training data and less accurate with
I. INTRODUCTION test data. Pre pruning and Post pruning are the techniques
used in Decision Tree to reduce the size of the trees and
Data Mining is extracting knowledge or useful pattern
avoid over fitting. In Post Pruning the Decision Tree
from large databases. Classification is one of the data
branches and hence the level (depth) of the tree are
mining functionalities, employed for finding a model for
reduced after completely building the tree. In Pre
class attribute which is a function of other attribute values
Pruning, care is taken to avoid over fitting while building
[1].
the tree itself.
Decision Tree is a tool, which can be employed for
Decision Trees find its major applications in areas such
Classification and Prediction. It has a tree shape structure,
as medicine, weather, finance, entertainment, sports, etc.
where each and every internal node represents test on an
Decision Tress can also be used for prediction, data
attribute and the branches out of the node denotes the test
manipulation and handling of missing values. As an
outcomes.
example in digital mammography it is used for
80% of the known dataset can be used as training set
classifying tumor cells and normal cells [3].
and 20% can be used as test data set. Each record in the
This paper discusses about an application of Decision
dataset denotes X and Y values, where X is a set of
Tree, for purchasing a house in a city based on attribute
attribute values and Y is the class of the record which is
values such as transport facilities, number of bed rooms,
the last attribute in the dataset. Using the training set
and availability of schools, shopping facilities and
Decision Tree Classifier model is constructed and tested
Copyright © 2020 MECS I.J. Information Engineering and Electronic Business, 2020, 2, 15-20
16 House Price Prediction Modeling Using Machine Learning
medical facilities. child node in the tree (before splitting) [15]. Reduction of
larger impurity means the selected split attribute is a
better one. Many attribute selection measures exist. The
more pronounced ones which lead to better results in
terms of accuracy are the following three.
A. Information Gain
Expected reduction in entropy obtained by partitioning
the examples based on attribute is measured by
Information Gain. Attribute selection measure in ID3
algorithm is using information gain.
Fig. 1. Decision Tree Structure m
Patel and Upadhyay [4] have discussed various pruning Information gain is partially inclined towards tests with
methods and their features and hence pruning several outcomes. Hence information Gain obtained by
effectiveness is evaluated. They have also measured the splitting on attribute is highest and such a splitting is
accuracy for glass and diabetes dataset, employing hopeless for classification. Followed by ID3 its successor
WEKA tool, considering various pruning factors. C4.5 arrived, which used Gain Ratio as in equation (2) as
ID3 algorithm splits attribute based on their entropy. an extension to the information gain.
TDIDT algorithm is one which constructs a set of
classification rules through the intermediate Gain ( A ) (2)
GainRatio ( A )
representation of a decision tree [5,6]. Weka interface [7] SplitiInfo ( A )
is used for testing of data sets by means of a variety of
open source machine learning algorithms. Where Gain(A) is the reduction expected in the
Fan et al [8] has utilized decision tree approach for information requirement instigated by knowing the value
finding the resale prices of houses based on their of attribute A as shown in equation (3). SplitInfo(A) as
significant characteristics. In this paper, hedonic based in equation (4) is the value defined analogously with
regression method is employed for identifying the Info(D), Which is also known as entropy of D.
relationship between the prices of the houses and their
significant characteristics. Ong et al. [9] and Berry et al.
Gain ( A ) Info ( D ) Info A ( D ) (3)
[10] have also used hedonic based regression for house
prediction based on significant characteristics.
Shinde and Gawande [11], predicted the sale price of Dj Dj (4)
SplitInfo ( A )
v
* log 2 ( )
j 1
the houses using various machine learning algorithms like, D D
lasso, SVR, Logistic regression and decision tree and
compared the accuracy. Alfiyatin et al. [12] has modeled C. Gini Index
a system for house price prediction using Regression and Gini index is used in CART (Classification and
Particle Swarm Optimization (PSO). In this paper, it has Regression Trees). Gini index measures the impurity of D,
been proved that the house price prediction accuracy is a data partition or set of training tuples as given in
improved by combining PSO with regression. equation (5).
Timothy C. Au [13] addressed about the absent level
problems in Random Forests, Decision Trees, and 2
(5)
Categorical Predictors. Using three real data sets, the Gini ( D ) 1 Pi
authors have illustrated how the absent levels affect the
performance of the predictors.
Pi is the probability that a tuple in D belongs to the
class Ci.
III. ATTRIBUTE SELECTION MEASURES
Redundant attributes which are considered IV. PROPOSED METHOD
inappropriate for the data mining task is removed using a
process called Attribute selection [14]. Hence a desirable Proposed work aims at predicting the availability of
set of attributes results which is the ultimate goal of houses based on different features of the houses and also
Attribute selection algorithms. This attribute set produces the facilities available nearby the location of the houses.
analogous classification results as that of using all the Work also includes the price prediction of the houses
attributes. Best split attributes selection measures are based on the features of the house and facilities nearby its
defined in terms of impurity reduction from parent to location.
Copyright © 2020 MECS I.J. Information Engineering and Electronic Business, 2020, 2, 15-20
House Price Prediction Modeling Using Machine Learning 17
This work includes two parts namely, The decision tree classifier shown in fig. 2 is
constructed using Scikit Learn and the respective
(i)Decision Tree Classifier is used to predict the specifications involved are as shown below. It uses Gini
availability of houses as per the users’ requirement index as the measure to select the relevant attributes for
constraints and it produces responses like yes or no testing and splitting the training set.
respectively to tell whether a house is available or not.
(ii)Decision tree regression and Multiple Linear DecisionTreeClassifier(class_weight=None,
Regression methods are used to predict the prices of the criterion='gini', max_depth=None,
houses. max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
A real time dataset is prepared by analyzing the min_samples_leaf=1, min_samples_split=2,
location named Tadepalligudem of West Godavari min_weight_fraction_leaf=0.0, presort=False,
District in Andhrapradesh of India. The dataset contains random_state=None, splitter='best')
the following features of the houses such as Number of
bedrooms, age of the house, transport facility, schools
available in the nearby location and shopping facilities.
The proposed method helps to search houses in big
cities based on the following attributes.
The proposed work is implemented using Scikit Learn, Fig. 2. Decision Tree classifier for house availability
a machine learning tool.
A. Scikit Learn
The Scikit-Learn (SK Learn) is a Python Scientific V. EXPERIMENTAL EVALUATION
toolbox for machine learning and is based on SciPy,
which is a well-established Python ecosystem for science, A. House Availibility Prediction
engineering and mathematics. Scikit-learn provides an The Decision tree output for classifying the availability
ironic environment with state of the art implementations of houses has discrete binary values like Yes or No. The
of many wellknown machine learning algorithms, while output of the Decision tree Regression used for house
sustaining an easy to use interface tightly integrated with price prediction is a continuous one. The continuous
the Python language [16],[17]. Scikit-learn features values (Prices) are predicted with the help of a decision
various functionalities like Clustering algorithms, tree regression model.
Regression, Classification including random forests, Table 1 shows the sample dataset with ten records
gradient boosting, support vector machines, k-means and considered with some essential features for the area,
DBSCAN, and it has been designed to interoperate in Tadepalligudem selected in West Godavari District of
conjunction with the Python scientific and numerical Andhra Pradesh. But the original dataset consists of 50
libraries SciPy and NumPy. records of different combinations. In the table for
The step by step implementation using SK Learn is as attribute 3, travelling facility, range is taken from 1 to 3,
follows. where 1 denotes Bus facility is available nearby, 2
denotes Bus and train facilities are available nearby and 3
Step 1: Import the required libraries. denotes both bus and train facilities are farther to the
Step 2: Load the dataset. house location. In shopping facility attribute, 1 denotes
Step 3: Assign the values of columns 1 to 6 in the less shopping facility with vegetable market and small
Dataset to “X”. grocery shops, 2 denotes departmental stores and some
Step 4: Assign the values of column 7 which is the small malls and 3 denotes super markets with all facilities.
class In school attribute 1 denotes that government schools are
label to “Y”. alone available nearby, 2 denotes that government and
Step 5: Fit decision tree classifier to the dataset. private schools are available nearby and 3 denotes that
Step 6: Predict the class label for the test data. government, private and CBSE schools all are available
Copyright © 2020 MECS I.J. Information Engineering and Electronic Business, 2020, 2, 15-20
18 House Price Prediction Modeling Using Machine Learning
nearby the house location. Scikit Learn tool splits the number 25 is predicted accurately and record numbers 10
training data and test data in the ratio 80:20. and 34 are predicted with less deviation.
Copyright © 2020 MECS I.J. Information Engineering and Electronic Business, 2020, 2, 15-20
House Price Prediction Modeling Using Machine Learning 19
Step 5. Price prediction on test data. the houses. The developed model can be used to predict
The developed model coefficient values for various the availability and prices of houses for any new record
attributes are given below. as per the user constraints. In general the accuracy of
prediction can be improved by (i)Having a huge amount
Coeff 1 to 5 = [[5.8056887 -0.22656665 1.30837976 of data to get the best possible prediction
0.86415419 0.53264177]] (ii)Minimizing/eliminating bad assumptions,
Intercept 0 = [1.6068795] (iii)Identifying best features, that has more correlation
with the output price.
The influence of each and every feature on price is
shown in fig. 3 with multiple regression lines.
VI. CONCLUSION AND FUTURE SCOPE
This article uses the most fundamental machine
learning algorithms like decision tree classifier, decision
tree regression and multiple linear regression. Work is
implemented using Scikit-Learn machine learning tool.
This work helps the users to predict the availability of
houses in the city and also to predict the prices of the
houses. Two algorithms like decision tree regression and
multiple linear regression were used in predicting the
prices of the houses. Comparatively the performance of
multiple linear regression is found to be better than the
decision tree regression in predicting the house prices. In
future the dataset can be prepared with more features and
advanced machine learning techniques can be for
Fig. 3. Multiple Linear Regression curves for house price prediction
constructing the house price prediction model.
REFERENCES
Table 4. Predicted price values
[1] Jiawei Han, MichelineKamber, “Data Mining Concepts
Record Actual Price
Predicted Price (Lakhs) and Techniques”, pp. 279-328, 2001.
No. (Lakhs) [2] J. R.Quinlan,” Simplifying decision trees”, Int. J. Human-
5 20 21.04942136 Computer Studies.
11 21 16.86709527 [3] Maria-Luiza Antonie, et. al., “Application of Data Mining
17 19 21.12892981 Techniques for Medical Image Classification”,
Proceedings of the Second International Workshop on
20 23 22.97885908
multimedia Data Mining(MDM/KDD’2001) in
23 14 17.73124946 conjunction with ACM SIGKDD conference. San
25 13 12.59635298 Francisco,USA, August 26,2001.
29 20 16.3344535 [4] Nikita Patel and Saurabh Upadhyay, “Study of Various
36 24 26.0487519 Decision Tree Pruning Methods with their Empirical
Comparison in WEKA”, International Journal of
40 11 11.39291899
Computer Applications, Volume 60– No.12, December
2012, pp 20-25.
Red colour line in figure denotes the number of [5] J.R. Quinlan, “C4.5: programs for Machine Learning”,
bedrooms, blue colour line denotes the age of the house, Morgan Kaufmann, New York, 1993.
green colour line denotes the travelling facility, yellow [6] J.R. Quinlan, “Induction of Decision Trees”, Machine
colour line denotes the shopping facility and black colour Learning 1, 1986, pp.81-106.
line denotes the school facility. According to the figure.3 [7] SamDrazin and Matt Montag”, Decision Tree Analysis
using Weka”, Machine Learning-Project II, University of
number of bedrooms is the attribute having more Miami.
influence on price and age of the house is the attribute [8] Gang-Zhi Fan, Seow Eng Ong and Hian Chye Koh,
having less influence on price. The predicted price values “Determinants of House Price: A Decision Tree
are given in table 4. Prices of record numbers 5, 20, 25 Approach”, Urban Studies, Vol. 43, No. 12, November
and 40 are predicted more or less correctly. 2006, PP.NO.2301- 2315.
The performance metrics for multiple regression such [9] Ong, S. E., Ho, K. H. D. and Lim, C. H., “A constant-
as MAE, MSE and RMSE values are given below. quality price index for resale public housing flats in
Singapore”, Urban Studies, 40(13), 2003, pp. 2705 –2729.
Mean Absolute Error: 1.9527234112192413 [10] Berry, J., McGreal, S., Stevenson, S., “Estimation of
apartment submarkets in Dublin, Ireland”, Journal of Real
Mean Squared Error: 6.0653477870232635 Estate Research, 25(2), 2003, pp. 159–170.
Root Mean Squared Error: 2.462792680479472 [11] Neelam Shinde, Kiran Gawande, “Valuation of house
prices using Predictive Techniques”, International Journal
Performance of multiple linear regressions is better of Advances in Electronics and Computer Science, ISSN:
than decision tree regression for predicting the prices of 2393-2835, Volume-5, Issue-6, Jun.-2018 pp. 34 to 40.
Copyright © 2020 MECS I.J. Information Engineering and Electronic Business, 2020, 2, 15-20
20 House Price Prediction Modeling Using Machine Learning
[12] Adyan Nur Alfiyatin , Hilman Taufiq, Ruth Ema Febrita Authors’ Profiles
and Wayan Firdaus Mahmudy, “Modeling House
Price Prediction using Regression Analysis and Particle Dr.M.Thamarai received the Ph.D. degree in
Swarm Optimization”, (IJACSA) International Journal of Digital Image processing from Anna
Advanced Computer Science and Applications, Vol. 8, No. University Chennai in 2014. Currently she is
10, 2017, pp. 323 to 326. working as a professor in ECE department at
[13] Timothy C. Au, “Random Forests, Decision Trees, and Sri Vasavi Engineering College, Andhra
Categorical Predictors: The Absent Levels Problem”, Pradesh since 2018. She has participated and
Journal of Machine Learning Research 19, 2018, pp. no.1- published papers in many National and
30. Internal Conferences and also published 15 papers in National
[14] K.C. Tan, E.J. Teoh, Q. Yu, K.C. Goh,” A hybrid and International journals. Her research interests are Digital
evolutionary algorithm for attribute selection in data image processing, Video coding, Machine Learning and VLSI
mining”, Department of Electrical and Computer implementation of Image processing algorithms.
Engineering, National University of Singapore, 4
Engineering Drive 3, Singapore 117576,
Singapore.Rochester, Institute of Technology, USA. Dr.SP.Malarvizhi received the Ph.D. degree
[15] Liangxiao Jiang, Chaoqun LI, “An Empirical Study on in Data Mining from Anna University
Attribute Selection Measures in Decision Tree Learning”, Chennai in 2016. Currently she is working as
Journal of Computational Information Systems6:1, 2010, an Associate Professor in CSE department at
pp. 105-112. Sri Vasavi Engineering College, Andhra
[16] http://scikit-learn.org/stable/index.html Pradesh since 2017. She has participated and
[17] http://scikit-learn.org/stable/auto_examples/index.html. published papers in many National and
Internal Conferences and also published 8 papers in National
and International journals. Her research interests are Data
Mining, Big Data and Machine Learning.
How to cite this paper: M. Thamarai, S P. Malarvizhi, " House Price Prediction Modeling Using Machine Learning",
International Journal of Information Engineering and Electronic Business(IJIEEB), Vol.12, No.2, pp. 15-20, 2020. DOI:
10.5815/ijieeb.2020.02.03
Copyright © 2020 MECS I.J. Information Engineering and Electronic Business, 2020, 2, 15-20