0% found this document useful (0 votes)
1K views

Project Report Gr-12

This document presents a project report for developing a house price predictor model. It includes the names and roll numbers of the 5 students working on the project under the guidance of Prof. Bidisha Patra. The aim is to create an effective price prediction model using machine learning algorithms and validate the model's accuracy. Multiple regression models will be tested and evaluated based on their RMSE to identify the best performing model. The document outlines the methodology, data processing steps, evaluation metrics and computer specifications used for the project.

Uploaded by

Samik Dey
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

Project Report Gr-12

This document presents a project report for developing a house price predictor model. It includes the names and roll numbers of the 5 students working on the project under the guidance of Prof. Bidisha Patra. The aim is to create an effective price prediction model using machine learning algorithms and validate the model's accuracy. Multiple regression models will be tested and evaluated based on their RMSE to identify the best performing model. The document outlines the methodology, data processing steps, evaluation metrics and computer specifications used for the project.

Uploaded by

Samik Dey
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 25

HOUSE PRICE PREDICTOR

A Project Report for Major Project

NAME UNIVERSITY ROLL NO.


I. Samik Dey 29401219023
II. Swapnomoy Ghosh 29401219035
III. Abhijit Chakraborty 29401219055
IV. Sattick Nag 29401219057
V. Piyali Mondal 29401219051

Under the guidance of


Prof. Bidisha Patra

FUTURE INSTITUTE OF
ENGINEERING AND MANAGEMENT

1
CERTIFICATE OF THE PROJECT WORK

We do hereby certifying that the work which is being presented in the Major Project Report entitled

“HOUSE PRICE PREDICTOR”, in partial fulfilment of the requirements for the award of the

Bachelor of Computer Application and submitted to the Department of BCA of Future Institute

of Engineering and Management, Kolkata, WB is an authentic record of our own work carried out

during the period from 15.05.2022 to 14.06.2022 under the supervision of Prof. Bidisha Patra. The

matter presented in this thesis has not been submitted by us for the award of any other degree

elsewhere.

Name & Signature of the Candidate(s)

29401219023

29401219035

29401219055

29401219057

29401219051
This is to certify that the above statement made by the candidates is correct to the best of my
knowledge.

Signature of the Supervisor

Name of Supervisor
[Designation of Supervisor]
Date

[Signature of the [Signature of the Panel members]


Head of the Department]

2
Acknowledgement

I take this opportunity to express my deep gratitude and sincerest thanks to my Project

mentor, Prof. Bidisha Patra for giving most valuable suggestion, helpful guidance

and encouragement in the execution of this project work.

I would like to give a special recognition to my colleagues. Last but not the least I am

grateful to all the faculty members of our department and their support.

3
TABLE OF CONTENTS

Chapter Number Contents Page Number

Abstract 6
1 INTRODUCTION 7
1.1 AIM and IMPORTANCE 7
1.1.1 Aim 7
1.1.2 Need and Motivation 8
1.1.3 Methodology 9
1.1.4 Evaluation metrics, Computer specifications 10

2. DATASET 11

2.1.1 Data Exploration 12


2.1.2 Data Visualization 13
2.1.3 Data Selection 14 - 15
2.1.4 Data Transformation 16

3 LANGUAGE AND MODELS USED 17


3.1 Python 17
3.1.1 Jupyter Notebook 17
3.1.2 NumPy 17
3.1.3 Pandas 17
3.1.4 Matplotlib 17

3.2 Models Used 18


3.2.1 Multiple Linear Regression 18
3.2.2 Decision Regressor 18
3.2.3 Random Forest Regressor 18

4
4. RESULTS AND DISCUSSIONS 19

4.1 BEST SUITED MODEL

5. SCREENSHOTS OF THE PROJECT 20 - 21

6. Future scope and further enhancement of the Project, Conclusion,


Bibliography 22

5
Abstract

House Price Index (HPI) is commonly used to estimate the changes in housing price.

Since housing price is strongly correlated to other factors such as location, area,

population, it requires other information apart from HPI to predict individual housing

price. There has been a considerably large number of papers adopting traditional machine

learning approaches to predict housing prices accurately, but they rarely concern about

the performance of individual models and neglect the less popular yet complex models.

As a result, to explore various impacts of features on prediction methods, this paper will

apply both traditional and advanced machine learning approaches to investigate the

difference among several advanced models. This paper will also comprehensively

validate multiple techniques in model implementation on regression and provide an

optimistic result for housing price prediction.

6
INTRODUCTION

AIM and IMPORTANCE

Aim

These are the Parameters on which we will evaluate ourselves-

• Create an effective price prediction model

• Validate the model’s prediction accuracy

• Identify the important home price attributes which feed the model’s predictive power.

7
Need and Motivation

Having lived in India for so many years if there is one thing that I had been taking for

granted, it’s that housing and rental prices continue to rise. Since the housing crisis of

2008, housing prices have recovered remarkably well, especially in major housing

markets. However, in the 4th quarter of 2016, I was surprised to read that Kolkata

housing prices had fallen the most in the last 4 years. In fact, median resale prices for

condos and coops fell 6.3%, marking the first time there was a decline since Q1 of 2017.

The decline has been partly attributed to political uncertainty domestically and abroad

and the 2014 election. So, to maintain the transparency among customers and also the

comparison can be made easy through this model. If customer finds the price of house at

some given website higher than the price predicted by the model, so he can reject that

house.

8
Methodology

The experiment is done to pre-process the data and evaluate the prediction accuracy of the

models. The experiment has multiple stages that are required to get the prediction results. These

stages can be defined as:

 Pre-processing: both datasets will be checked and pre-processed. These methods have

various ways of handling data. Thus, the preprocessing is done on multiple iterations

where each time the accuracy will be evaluated with the used combination.

 Data splitting: dividing the dataset into two parts is essential to train the model with

one and use the other in the evaluation. The dataset will be split 75% for training and

25% for testing.

 Evaluation: the accuracy of both datasets will be evaluated by measuring the R2 and

RMSE rate when training the model alongside an evaluation of the actual prices on the

test dataset with the prices that are being predicted by the model.

 Performance: alongside the evaluation metrics, the required time to train the model

will be measured to show the algorithm vary in terms of time.

 Correlation: correlation between the available features and house price will be

evaluated using the Pearson Coefficient Correlation to identify whether the features

have a negative, positive or zero correlation with the house price.

9
Evaluation Metrics

The prediction accuracy will be evaluated by measuring the Root Mean Square Error (RSME)

of the model used in training. RSME shows the error percentage between the actual and

predicted data, which in this case, the house prices.

Computer Specifications

The needed time to train the model depends on the capability of the used system during the

experiment. Some libraries use GPU resources over the CPU to take a shorter time to train a

model.

Client Machine Server Machine


HDD 1 TB HDD 1 TB
Processor 2.30 GHz Dual- Processor 1.6GHz Quad-Core
Core processor processor
Memory 4 GB Memory 8 GB
Operating System Windows 10 Operating System Windows 10

10
DATASET

Here we have web scrapped the Data from “UCI Machine Learning Repository” website

which is a collection of databases, domain theories, and data generators that are used by

the machine learning community for the empirical analysis of machine learning

algorithms.

Dataset looks as follows-

11
12
Data Exploration

Data exploration is the first step in data analysis and typically involves summarizing the main

characteristics of a data set, including its size, accuracy, initial patterns in the data and other

attributes. It is commonly conducted by data analysts using visual analytics tools, but it can

also be done in more advanced statistical software, Python. Before it can conduct analysis on

data collected by multiple data sources and stored in data warehouses, an organization must

know how many cases are in a data set, what variables are included, how many missing

values there are and what general hypotheses the data is likely to support. An initial

exploration of the data set can help answer these questions by familiarizing analysts with the

data with which they are working.

We divided the data 8:2 for Training and Testing purpose respectively.

13
Data Visualization

Data visualization is the graphical representation of information and data. By using

visual elements like charts, graphs, and maps, data visualization tools provide an

accessible way to see and understand trends, outliers, and patterns in data. In the

world of Big Data, data visualization tools and technologies are essential to analyse

massive amounts of information and make data-driven decisions.

14
15
Data Selection

Data selection is defined as the process of determining the appropriate data type and

source, as well as suitable instruments to collect data. Data selection precedes the

actual practice of data collection. This definition distinguishes data selection from

selective data reporting (selectively excluding data that is not supportive of a research

hypothesis) and interactive/active data selection (using collected data for monitoring

activities/events, or conducting secondary data analyses). The process of selecting

suitable data for a research project can impact data integrity.

The primary objective of data selection is the determination of appropriate data type,

source, and instrument(s) that allow investigators to adequately answer research

questions. This determination is often discipline-specific and is primarily driven by

the nature of the investigation, existing literature, and accessibility to necessary data

sources.

16
Correlation Scatter Matrix

17
Data Transformation

The log transformation can be used to make highly skewed distributions less skewed. This

can be valuable both for making patterns in the data more interpretable and for helping to

meet the assumptions of inferential statistics.

It is hard to discern a pattern in the upper panel whereas the strong relationship is shown

clearly in the lower panel. The comparison of the means of log-transformed data is actually a

comparison of geometric means. This occurs because, as shown below, the anti-log of the

arithmetic mean of log-transformed values is the geometric mean.

18
LANGUAGE AND MODELS USED

Python

Python is widely used in scientific and numeric computing:

 SciPy is a collection of packages for mathematics, science, and engineering.


 Pandas is a data analysis and modelling library.
 IPython is a powerful interactive shell that features easy editing and recording of a work
session, and supports visualizations and parallel computing.
 The Software Carpentry Course teaches basic skills for scientific computing, running
bootcamps and providing open-access teaching materials.

Libraries and Software Used for this Project include –

 Pandas
 NumPy
 Matplotlib
 Scikit Learn
 Anaconda
 Jupyter notebook

19
MODELS USED

Regression Model

• Linear Regression is a machine learning algorithm based on supervised learning.

• It performs a regression task. Regression models a target prediction value based on


independent variables.

• It is mostly used for finding out the relationship between variables and forecasting.

Decision Tree Regressor Model


• Decision tree regression observes features of an object and trains a model in the structure of
a tree to predict data in the future to produce meaningful continuous output.

• The decision tree is used to fit a sine curve with addition noisy observation. As a result, it
learns local linear regressions approximating the sine curve.

Random Forest Regression Model

• A Random Forest is an ensemble technique capable of performing both regression and


classification tasks with the use of multiple decision trees and a technique called
Bootstrap Aggregation, commonly known as bagging.

• Bagging, in the Random Forest method, involves training each decision tree on a
different data sample where sampling is done with replacement.

• The basic idea behind this is to combine multiple decision trees in determining the
final output rather than relying on individual decision trees.

20
RESULTS AND DISCUSSIONS

Best Suited Model

So, our study showed that,

Random Forest Regression Model displayed the best performance for this Dataset and can
be used for deploying purposes.

Decision Tree Regressor Model and Linear Regression are far behind, so can’t be
recommended for further deployment purposes.
4.14796932816945

4.14512470542556

R MSE BA R GR A PH
LR DTR RFR
2.90346586503298

4.5

3.5

2.5

1.5

0.5

0
RMSE GRAPH

21
SCREENSHOTS OF THE PROJECT

Train-Test splitting

Selecting a desired model

Testing the model on test data

22
Predicting the price

23
Future scope and further enhancement of the Project

Since this project has been done by using Machine Learning, therefore this project can be

further enhanced using more advanced Machine Learning and Data Analysis technologies.

Conclusion

So, our Aim is achieved as we have successfully ticked all our parameters as mentioned in

our Aim Column. It is seen that circle rate is the most effective attribute in predicting the

house price and that the Random Forest Regression Model is the most effective model for our

Dataset with final RMSE score of 2.9034658650329894.

References/Bibliography

• UCI machine learning repository

• https://scikit-learn.org/

• Python Machine Learning By Example Author – Yuxi (Hayden) Liu

• http://stackoverflow.com/

24
25

You might also like