Project Report Gr-12
Project Report Gr-12
FUTURE INSTITUTE OF
ENGINEERING AND MANAGEMENT
1
CERTIFICATE OF THE PROJECT WORK
We do hereby certifying that the work which is being presented in the Major Project Report entitled
“HOUSE PRICE PREDICTOR”, in partial fulfilment of the requirements for the award of the
Bachelor of Computer Application and submitted to the Department of BCA of Future Institute
of Engineering and Management, Kolkata, WB is an authentic record of our own work carried out
during the period from 15.05.2022 to 14.06.2022 under the supervision of Prof. Bidisha Patra. The
matter presented in this thesis has not been submitted by us for the award of any other degree
elsewhere.
29401219023
29401219035
29401219055
29401219057
29401219051
This is to certify that the above statement made by the candidates is correct to the best of my
knowledge.
Name of Supervisor
[Designation of Supervisor]
Date
2
Acknowledgement
I take this opportunity to express my deep gratitude and sincerest thanks to my Project
mentor, Prof. Bidisha Patra for giving most valuable suggestion, helpful guidance
I would like to give a special recognition to my colleagues. Last but not the least I am
grateful to all the faculty members of our department and their support.
3
TABLE OF CONTENTS
Abstract 6
1 INTRODUCTION 7
1.1 AIM and IMPORTANCE 7
1.1.1 Aim 7
1.1.2 Need and Motivation 8
1.1.3 Methodology 9
1.1.4 Evaluation metrics, Computer specifications 10
2. DATASET 11
4
4. RESULTS AND DISCUSSIONS 19
5
Abstract
House Price Index (HPI) is commonly used to estimate the changes in housing price.
Since housing price is strongly correlated to other factors such as location, area,
population, it requires other information apart from HPI to predict individual housing
price. There has been a considerably large number of papers adopting traditional machine
learning approaches to predict housing prices accurately, but they rarely concern about
the performance of individual models and neglect the less popular yet complex models.
As a result, to explore various impacts of features on prediction methods, this paper will
apply both traditional and advanced machine learning approaches to investigate the
difference among several advanced models. This paper will also comprehensively
6
INTRODUCTION
Aim
• Identify the important home price attributes which feed the model’s predictive power.
7
Need and Motivation
Having lived in India for so many years if there is one thing that I had been taking for
granted, it’s that housing and rental prices continue to rise. Since the housing crisis of
2008, housing prices have recovered remarkably well, especially in major housing
markets. However, in the 4th quarter of 2016, I was surprised to read that Kolkata
housing prices had fallen the most in the last 4 years. In fact, median resale prices for
condos and coops fell 6.3%, marking the first time there was a decline since Q1 of 2017.
The decline has been partly attributed to political uncertainty domestically and abroad
and the 2014 election. So, to maintain the transparency among customers and also the
comparison can be made easy through this model. If customer finds the price of house at
some given website higher than the price predicted by the model, so he can reject that
house.
8
Methodology
The experiment is done to pre-process the data and evaluate the prediction accuracy of the
models. The experiment has multiple stages that are required to get the prediction results. These
Pre-processing: both datasets will be checked and pre-processed. These methods have
various ways of handling data. Thus, the preprocessing is done on multiple iterations
where each time the accuracy will be evaluated with the used combination.
Data splitting: dividing the dataset into two parts is essential to train the model with
one and use the other in the evaluation. The dataset will be split 75% for training and
Evaluation: the accuracy of both datasets will be evaluated by measuring the R2 and
RMSE rate when training the model alongside an evaluation of the actual prices on the
test dataset with the prices that are being predicted by the model.
Performance: alongside the evaluation metrics, the required time to train the model
Correlation: correlation between the available features and house price will be
evaluated using the Pearson Coefficient Correlation to identify whether the features
9
Evaluation Metrics
The prediction accuracy will be evaluated by measuring the Root Mean Square Error (RSME)
of the model used in training. RSME shows the error percentage between the actual and
Computer Specifications
The needed time to train the model depends on the capability of the used system during the
experiment. Some libraries use GPU resources over the CPU to take a shorter time to train a
model.
10
DATASET
Here we have web scrapped the Data from “UCI Machine Learning Repository” website
which is a collection of databases, domain theories, and data generators that are used by
the machine learning community for the empirical analysis of machine learning
algorithms.
11
12
Data Exploration
Data exploration is the first step in data analysis and typically involves summarizing the main
characteristics of a data set, including its size, accuracy, initial patterns in the data and other
attributes. It is commonly conducted by data analysts using visual analytics tools, but it can
also be done in more advanced statistical software, Python. Before it can conduct analysis on
data collected by multiple data sources and stored in data warehouses, an organization must
know how many cases are in a data set, what variables are included, how many missing
values there are and what general hypotheses the data is likely to support. An initial
exploration of the data set can help answer these questions by familiarizing analysts with the
We divided the data 8:2 for Training and Testing purpose respectively.
13
Data Visualization
visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data. In the
world of Big Data, data visualization tools and technologies are essential to analyse
14
15
Data Selection
Data selection is defined as the process of determining the appropriate data type and
source, as well as suitable instruments to collect data. Data selection precedes the
actual practice of data collection. This definition distinguishes data selection from
selective data reporting (selectively excluding data that is not supportive of a research
hypothesis) and interactive/active data selection (using collected data for monitoring
The primary objective of data selection is the determination of appropriate data type,
the nature of the investigation, existing literature, and accessibility to necessary data
sources.
16
Correlation Scatter Matrix
17
Data Transformation
The log transformation can be used to make highly skewed distributions less skewed. This
can be valuable both for making patterns in the data more interpretable and for helping to
It is hard to discern a pattern in the upper panel whereas the strong relationship is shown
clearly in the lower panel. The comparison of the means of log-transformed data is actually a
comparison of geometric means. This occurs because, as shown below, the anti-log of the
18
LANGUAGE AND MODELS USED
Python
Pandas
NumPy
Matplotlib
Scikit Learn
Anaconda
Jupyter notebook
19
MODELS USED
Regression Model
• It is mostly used for finding out the relationship between variables and forecasting.
• The decision tree is used to fit a sine curve with addition noisy observation. As a result, it
learns local linear regressions approximating the sine curve.
• Bagging, in the Random Forest method, involves training each decision tree on a
different data sample where sampling is done with replacement.
• The basic idea behind this is to combine multiple decision trees in determining the
final output rather than relying on individual decision trees.
20
RESULTS AND DISCUSSIONS
Random Forest Regression Model displayed the best performance for this Dataset and can
be used for deploying purposes.
Decision Tree Regressor Model and Linear Regression are far behind, so can’t be
recommended for further deployment purposes.
4.14796932816945
4.14512470542556
R MSE BA R GR A PH
LR DTR RFR
2.90346586503298
4.5
3.5
2.5
1.5
0.5
0
RMSE GRAPH
21
SCREENSHOTS OF THE PROJECT
Train-Test splitting
22
Predicting the price
23
Future scope and further enhancement of the Project
Since this project has been done by using Machine Learning, therefore this project can be
further enhanced using more advanced Machine Learning and Data Analysis technologies.
Conclusion
So, our Aim is achieved as we have successfully ticked all our parameters as mentioned in
our Aim Column. It is seen that circle rate is the most effective attribute in predicting the
house price and that the Random Forest Regression Model is the most effective model for our
References/Bibliography
• https://scikit-learn.org/
• http://stackoverflow.com/
24
25