0% found this document useful (0 votes)
11 views57 pages

Report Final

The project report focuses on house price prediction using machine learning techniques, specifically analyzing historical property transactions in Bangalore, India. The aim is to create an effective price prediction model that assists both buyers and sellers in understanding market trends. The report includes a detailed methodology, system specifications, and the technologies used, highlighting the advantages and limitations of the model.

Uploaded by

Pavan Kalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views57 pages

Report Final

The project report focuses on house price prediction using machine learning techniques, specifically analyzing historical property transactions in Bangalore, India. The aim is to create an effective price prediction model that assists both buyers and sellers in understanding market trends. The report includes a detailed methodology, system specifications, and the technologies used, highlighting the advantages and limitations of the model.

Uploaded by

Pavan Kalyan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 57

A PROJECT REPORT ON

House price prediction


SUBMITTED IN PARTIAL FULFILMENT FOR THE AWARD OF THE

BACHELOR OF DATA SCIENCE BATCH 2020 –


2023

PROJECT SUBMITTED BY:


B. DHARANI - 107020539006
D. VARALAKSHMI - 107020539012
G. DIVYA - 107020539015
Y. PAVAN KALYAN - 107020539042

UNDER GUIDENCE OF
D NEETHA, MCA

(Department of computer science)

BABU JAGJIVAN RAM GOVERNMENT DEGREE COLLEGE


(AFFILIATED TO OSMANIA UNIVERSITY)
BABU JAGJIVAN RAM GOVERNMENT DEGREE COLLEGE
(AFFILIATED TO OSMANIA UNIVERSITY)

CERTIFICATE

This is to certify that the project titled “HOUSE PRICE PREDICTION” is a


bonafide certificate work done and submitted by for partial fulfillment of
bachelor degree in Data Science by B DHARANI – 107020539006, D
VARALAKSHMI - 107020539012, G DIVYA – 107020539015 and Y
PAVAN KLAYAN - 107020539042 under the guidance of
D NEETHA, MCA (Department of Computer Science). To the best of
knowledge the matter presented in this project report has not been submitted
earlier to any other university.

Signature of Guide External Examiner


ACKNOWLEDGEMENT

Apart from the efforts of ourself, the success of any project depends on
largely on the encouragement and guidelines of many others. I take this
opportunity to express my gratitude to the people who have been
instrumental in the successful completion of this project.

We are greatly proud of our BJR GOVERNMENT DEGREE COLLEGE for


providing us the needful education. We thank our principal madam DR. P. V.
GEETHA LAKSHMI PATNAIK and our Head of the Department DR.
SAMBASHIVA RAO sir for their cordial support as they gave the permission
to use all required equipment and the necessary material to complete the
project.

We would like to extend my sincerest gratitude to our Lecturer NEETHA


mam for his guidance and supervision as well as providing necessary
information regarding the project and also for the support in completing the
project.

Finally we also extend our heartiest thanks to my parents, friends


and well wishers for being with us and extending encouragement
throughout the project.

Abstract
House price forecasting is an important topic of real estate. The literature attempts to

derive useful knowledge from historical data of property markets. Machine learning

techniques are applied to analyze historical property transactions in India to discover useful

models for house buyers and sellers. Revealed is the high discrepancy between house

prices in the most expensive and most affordable suburbs in the city of Bangalore. Moreover,

experiments demonstrate that the Multiple Linear Regression that is based on mean squared

error measurement is a competitive approach.

Table of Contents

House price prediction......................................................................................................................................................


SUBMITTED IN PARTIAL FULFILMENT FOR THE AWARD OF THE...................................................................................
BACHELOR OF DATA SCIENCE BATCH 2020 – 2023.......................................................................................................
BABU JAGJIVAN RAM GOVERNMENT DEGREE COLLEGE...............................................................................................
BABU JAGJIVAN RAM GOVERNMENT DEGREE COLLEGE...............................................................................................
CERTIFICATE..................................................................................................................................................................
ACKNOWLEDGEMENT...................................................................................................................................................
1 INTRODUCTION.............................................................................................................................................................
1.1...............................................................................................................................................AIM and IMPORTANCE
5
1.2............................................................................................................................................DRAWBACKS OF MODEL
6
1.3...........................................................................................................................................ADVANTAGES OF MODEL
6
1.4.....................................................................................................................................................FEASIBILITY STUDY
6
SYSTEM SPECIFICATION....................................................................................................................................................
HARDWARE REQUIREMENTS........................................................................................................................................
SOFTWARE REQUIREMENTS.........................................................................................................................................
2.......................................................................................................................................LANGUAGE AND MODELS USED
..........................................................................................................................................................................................
HTML...........................................................................................................................................................................
CSS..............................................................................................................................................................................
JAVA SCRIPT................................................................................................................................................................
Libraries Used for this Project include.........................................................................................................................
PANDAS.......................................................................................................................................................................
Advantages.................................................................................................................................................................
 Fast and efficient for manipulating and analyzing data..........................................................................................
 Data from different file objects can be loaded.......................................................................................................
 Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data.............
 Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects.............
 Data set merging and joining.................................................................................................................................
 Flexible reshaping and pivoting of data sets..........................................................................................................
 Provides time-series functionality..........................................................................................................................
 Powerful group by functionality for performing split-apply-combine operations..................................................
NUMPY........................................................................................................................................................................
MATPLOTLIB................................................................................................................................................................
SKLEARN......................................................................................................................................................................
MODELS USED.............................................................................................................................................................
3................................................................................................................................................Folder structure of project
19
4........................................................................................................................................................................... DATASET
20
5.........................................................................................................................................EXPLORATORY DATA ANALYSIS
21
6.....................................................................................................................................................DATA PREPROCESSING
23
REMOVING OUTLIERS.................................................................................................................................................
7.........................................................................................................................................................LINEAR REGRESSION
34
8..............................................................................................................................................K FOLD CROSS VALIDATION
37
9................................................................................................................................................................................ FLASK
40
10.................................................................................................................................................................SOURCE CODE
43
10.1.......................................................................................................................................................Python ML code
44
10.2.......................................................................................................................................................................Util.py
45
10.3.................................................................................................................................................................. server.py
46
10.4........................................................................................................................................................................HTML
47
10.5...........................................................................................................................................................................
49
10.6.............................................................................................................................................................JAVA SCRIPT
53
11......................................................................................................................................................................Conclusion
55
12.................................................................................................................................................................... REFERENCES
55
1 INTRODUCTION

1.1 AIM and IMPORTANCE

Aim

These are the Parameters on which we will evaluate ourselves-

• Identify the important home price attributes which feed the model’s predictive
power.

• Create an effective price prediction model

• Validate the model’s prediction accuracy

Need and Motivation


Having lived in India for so many years if there is one thing that I had been taking for

granted, it’s that housing and rental prices continue to rise. Since the housing crisis of 2008,

housing prices have recovered remarkably well, especially in major housing markets.

However, in the 4th quarter of 2016, I was surprised to read that Bombay housing prices had

fallen the most in the last 4 years. In fact, median resale prices for condos and coops fell

6.3%, marking the first time there was a decline since Q1 of 2017. The decline has been

partly attributed to political uncertainty domestically and abroad and the 2014 election. So, to

maintain the transparency among customers and also the comparison can be made easy
through this model. If we had a website which tells us the price of the house with certain

features like area number of bedrooms etc. If customer finds the price of house higher than

the price predicted by the model, so he can reject that house.

1.2 DRAWBACKS OF MODEL

The real estate is a field which always keeps updating. The prices of the does not be

constant they always changes there may be different factors for the changes we can’t take all

the factors into consideration. So we built a model on basic factors like area, number of

bedrooms, etc which are the factors everyone care about. Our model is prepared on a stable

data so it predicts based on that data.

1.3 ADVANTAGES OF MODEL

This model can be very useful who wants to buy a house in Bangalore city.

 Can be helpful to the people who are planning to buy a house to know the range of
the price range of the house.

 It is also helpful to the people who wants to sell the house. They can know the price
range they can put to their house.

 It is important to drive real estate efficiency.

 We can analyse which location has higher prices and which location has lower prices.

1.4 FEASIBILITY STUDY

The feasibility analysis is an analytical program through project manager determines the
project success ratio and through it the project manager able to see either project is good
enough to release into market or not the key considerations involved in the feasibility analysis
are:

• Economical Feasibility

• Technical Feasibility

• Operational Feasibility
Economical Feasibility

Economic feasibility is the cost and logistical outlook for a business project or
endeavor. Prior to
embarking on a new venture, most businesses conduct an economic feasibility study, which is a
study that analyzes data to determine whether the cost of the prospective new venture will
ultimately be profitable to the company. Economic feasibility is sometimes determined within an
organization, while other times companies hire an external company that specializes in
conducting economic feasibility studies for them.
Hence there is no cost involved to use this model so this is the economical feasibility
model.

Technical Feasibility

What is technical feasibility, can be described as the formal process of assessing whether
it is technically possible to manufacture a product or service. Before launching a new offering or
taking up a client project, it is essential to plan and prepare for every step of the operation.
Technical feasibility helps determine the efficacy of the proposed plan by analyzing the process,
including tools, technology, material, labour and logistics.

Software technologies used are Python, Flask, HTML, CSS, Java Script. It is possible to
update the system in the future. No special hardware is required for using this model. Hence this
is technically feasible.

Operation Feasibility

Operational feasibility is the measure of how well a proposed system solves the
problems, and takes advantage of the opportunities identified during scope definition and how it
satisfies the requirements identified in the requirements analysis phase of system development.
The operational feasibility assessment focuses on the degree to which the proposed
development project fits in with the existing business environment and objectives with regard to
development schedule, delivery date, corporate culture and existing business processes.
To ensure success, desired operational outcomes must be imparted during design and
development. These include such design-dependent parameters as reliability, maintainability,
supportability, usability, producibility, disposability, sustainability, affordability and others. These
parameters are required to be considered at the early stages of design if desired operational
behaviours are to be realized. A system design and development requires appropriate and
timely application of engineering and management efforts to meet the previously mentioned
parameters.
A system may serve its intended purpose most effectively when its technical and
operating characteristics are engineered into the design. Therefore, operational feasibility is a
critical aspect of systems engineering that needs to be an integral part of the early design
phases.
There is no advance features for this website just we need to select the appropriate
values even uneducated can use this model.

SYSTEM SPECIFICATION

HARDWARE REQUIREMENTS

• System : corei5
• Hard disk : 500 GB
• Ram : 4 GB

SOFTWARE REQUIREMENTS

Software is a group of programs that computers need to do a particular task. It is an


essential requirement of computer system.

The software used to develop project is

• Operating system : windows 10


• Python 3.11
• NUMPY
• PANDAS
• SCIKIT-LEARN
• MATPOTLIB
• FLASK
2 LANGUAGE AND MODELS USED

Python

Python is a popular coding language which is used in various fields like Data science,
Machine learning, Data analytics, Artificial Intelligence, Backend
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code
reuse. The Python interpreter and the extensive standard library are available in source or
binary form without charge for all major platforms, and can be freely distributed.

It contains various modules

 NUMPY
It is used for numerical analysis and to create multi-dimensional arrays.

 PANDAS
It is used for data manipulation, data preprocessing and data exploration.

 MATPLOTLIB
It is used for plotting different types of charts.

 SCIKIT-LEARN
It is used for developing various kinds of machine learning models.

 TENSORFLOW
It is used for developing deep learning models.
HTML

HTML is an acronym which stands for Hyper Text Markup Language which is used for
creating web pages and web applications. Let's see what is meant by Hypertext Markup
Language, and Web page.

 Hyper Text: HyperText simply means "Text within Text." A text has a link within it, is a
hypertext. Whenever you click on a link which brings you to a new webpage, you have
clicked on a hypertext. HyperText is a way to link two or more web pages (HTML
documents) with each other.

 Markup language: A markup language is a computer language that is used to apply


layout and formatting conventions to a text document. Markup language makes text
more interactive and dynamic. It can turn text into images, tables, links, etc.

 Web Page: A web page is a document which is commonly written in HTML and
translated by a web browser. A web page can be identified by entering an URL. A Web
page can be of the static or dynamic type. With the help of HTML only, we can create
static web pages.

Hence, HTML is a markup language which is used for creating attractive web pages with
the help of styling, and which looks in a nice format on a web browser. An HTML document is
made of many HTML tags and each HTML tag contains different content.

CSS

CSS stands for Cascading Style Sheets. It is a style sheet language which is used to
describe the look and formatting of a document written in markup language. It provides an
additional feature to HTML. It is generally used with HTML to change the style of web pages
and user interfaces. It can also be used with any kind of XML documents including plain XML,
SVG and XUL.

CSS is used along with HTML and JavaScript in most websites to create user interfaces
for web applications and user interfaces for many mobile applications.

o You can add new looks to your old HTML documents.


o You can completely change the look of your website with only a few changes in CSS
code.
JAVA SCRIPT

JavaScript is a scripting or programming language that allows you to implement complex


features on web pages every time a web page does more than just sit there and display static information
for you to look at displaying timely content updates, interactive maps, animated 2D/3D graphics, scrolling
video jukeboxes, etc. you can bet that JavaScript is probably involved. It is the third layer of the layer
cake of standard web technologies, two of which (HTML and CSS) we have covered in much more detail
in other parts of the Learning Area.

The core client-side JavaScript language consists of some common programming features that
allow you to do things like:

 Store useful values inside variables. In the above example for instance, we ask for a
new name to be entered then store that name in a variable called name.
 Operations on pieces of text (known as "strings" in programming). In the above example
we take the string "Player 1: " and join it to the name variable to create the complete
text label, e.g. "Player 1: Chris".
 Running code in response to certain events occurring on a web page. We used
a click event in our example above to detect when the label is clicked and then run the
code that updates the text label.

What is even more exciting however is the functionality built on top of the client-side
JavaScript language. So called Application Programming Interfaces (APIs) provide you with
extra superpowers to use in your JavaScript code.

Libraries Used for this Project include

• PANDAS
• NUMPY
• MATPLOTLIB
• SCIKIT-LEARN
• FLASK
PANDAS

Pandas is an open-source library that is made mainly for working with relational or
labeled data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity for users.

Advantages

 Fast and efficient for manipulating and analyzing data.

 Data from different file objects can be loaded.

 Easy handling of missing data (represented as NaN) in floating point as well as


non-floating point data

 Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects

 Data set merging and joining.

 Flexible reshaping and pivoting of data sets

 Provides time-series functionality.

 Powerful group by functionality for performing split-apply-combine operations

Installation
pip install pandas

NUMPY

Array in NUMPY is a table of elements (usually numbers), all of the same type, indexed
by a tuple of positive integers. In NUMPY, number of dimensions of the array is called rank of the
array .A tuple of integers giving the size of the array along each dimension is known as shape of
the array. An array class in NUMPY is called as n dimensional array. Elements in NUMPY arrays
are accessed by using square brackets and can be initialized by using nested Python Lists.
Installation
pip install numpy

MATPLOTLIB

MATPLOTLIB is an amazing visualization library in Python for 2D plots of arrays.


MATPLOTLIB is a multi-platform data visualization library built on NUMPY arrays and designed
to work with the broader SCIPY stack. It was introduced by John Hunter in the year 2002. One
of the greatest benefits of visualization is that it allows us visual access to huge amounts of data
in easily digestible visuals. MATPLOTLIB consists of several plots like line, bar, scatter,
histogram etc. Installation : Windows, Linux and MACOS distributions have MATPLOTLIB and
most of its dependencies as wheel packages. Run the following command to install
MATPLOTLIB package

pip install matplotlib

SKLEARN

SCIKIT-LEARN(SKLEARN) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality reduction via a consistence
interface in Python .This library, which is largely written in Python, is built upon NUMPY, SCIPY
and MATPLOTLIB.

Installation:
pip install sklearn

MODELS USED

• Linear Regression Model


• Decision Tree Regressor Model
• Lasso Regression Model
• Ridge Regression
• SVM Regressor
Linear Regression Model

• Linear Regression is a machine learning algorithm based on supervised learning.

• It performs a regression task. Regression models a target prediction value based on


independent variables.

• It is mostly used for finding out the relationship between variables and forecasting.

DECISION TREE REGRESSOR MODEL

Decision Tree is a decision-making tool that uses a flowchart-like tree structure or is a


model of decisions and all of their possible results, including outcomes, input costs, and utility.
Decision-tree algorithm falls under the category of supervised learning algorithms. It works for
both continuous as well as categorical output variables.
The branches/edges represent the result of the node and the nodes have either:

1. Conditions [Decision Nodes]


2. Result [End Nodes]

The branches/edges represent the truth/falsity of the statement and take makes a
decision based on that in the example below which shows a decision tree that evaluates the
smallest of three numbers:
LASSO REGRESSION MODEL

Lasso Regression is also another linear model derived from Linear Regression which
shares the same hypothetical function for prediction. The cost function of Linear Regression is
represented by J.

Linear Regression model considers all the features equally relevant for prediction. When
there are many features in the dataset and even some of them are not relevant for the
predictive model. This makes the model more complex with a too inaccurate prediction on the
test set (or overfitting ). Such a model with high variance does not generalize on the new data.
So, Lasso Regression comes for the rescue. It introduced an L1 penalty (or equal to the
absolute value of the magnitude of weights) in the cost function of Linear Regression. Lasso
Regression performs both, variable selection and regularization too.

Mathematical Intuition:

During gradient descent optimization, added l1 penalty shrunk weights close to zero or
zero. Those weights which are shrunken to zero eliminates the features present in the
hypothetical function. Due to this, irrelevant features don’t participate in the predictive model.
This penalization of weights makes the hypothesis more simple which encourages the sparsity
( model with few parameters ).
If the intercept is added, it remains unchanged.
We can control the strength of regularization by hyperparameter lambda. All weights are
reduced by the same factor lambda.

Different cases for tuning values of lambda.


1. If lambda is set to be 0, Lasso Regression equals Linear Regression.

2. If lambda is set to be infinity, all weights are shrunk to zero.

If we increase lambda, bias increases if we decrease the lambda variance increase. As


lambda increases, more and more weights are shrunk to zero and eliminates features from the
model Implementation.
Ridge Regression

A Ridge regressor is basically a regularized version of a Linear Regressor. i.e to the


original cost function of linear regressor we add a regularized term that forces the learning
algorithm to fit the data and helps to keep the weights lower as possible. The regularized term
has the parameter ‘alpha’ which controls the regularization of the model i.e helps in reducing
the variance of the estimates.
Cost Function for Ridge Regressor.

Mathematical Intuition:

During gradient descent optimization of its cost function, added l2 penalty term leads to
reduces the weights of the model to zero or close to zero. Due to the penalization of weights,
our hypothesis gets simpler, more generalized, and less prone to overfitting. All weights are
reduced by the same factor lambda. We can control the strength of regularization by
hyperparameter lambda.
Different cases for tuning values of lambda.
1. If lambda is set to be 0, Ridge Regression equals Linear Regression
2. If lambda is set to be infinity, all weights are shrunk to zero.

So, we should set lambda somewhere in between 0 and infinity.

SVM REGRESSON

Support Vector Regression (SVR) is a type of machine learning algorithm used for

regression analysis. The goal of SVR is to find a function that approximates the relationship

between the input variables and a continuous target variable, while minimizing the prediction

error.
Unlike Support Vector Machines (SVMs) used for classification tasks, SVR seeks to find a

hyperplane that best fits the data points in a continuous space. This is achieved by mapping the

input variables to a high-dimensional feature space and finding the hyperplane that maximizes

the margin (distance) between the hyperplane and the closest data points, while also minimizing

the prediction error.

SVR can handle non-linear relationships between the input variables and the target variable by

using a kernel function to map the data to a higher-dimensional space. This makes it a powerful

tool for regression tasks where there may be complex relationships between the input variables

and the target variable.

Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems.

Let’s spend a few minutes understanding the idea behind SVR.

The Idea Behind Support Vector Regression

The problem of regression is to find a function that approximates mapping from an input domain

to real numbers on the basis of a training sample. So let’s now dive deep and understand how

SVR works actually.


Consider these two red lines as the decision boundary and the green line as the
hyperplane. Our objective, when we are moving on with SVR, is to basically consider the
points that are within the decision boundary line. Our best fit line is the hyperplane that has
a maximum number of points.

The first thing that we’ll understand is what is the decision boundary (the danger red line

above!). Consider these lines as being at any distance, say ‘a’, from the hyperplane. So, these

are the lines that we draw at distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is

basically referred to as epsilon.

Assuming that the equation of the hyperplane is as follows:

Y = wx+b (equation of hyperplane)

Then the equations of decision boundary become:

wx+b = +a

wx+b= -a

Thus, any hyperplane that satisfies our SVR should satisfy:

-a < Y- wx+b < +a

GRID SEARCH CV

Grid searching is a method to find the best possible combination of hyper-parameters at


which the model achieves the highest accuracy. Before applying Grid Searching on any
algorithm, Data is used to divided into training and validation set, a validation set is used to
validate the models. A model with all possible combinations of hyperparameters is tested on
the validation set to choose the best combination.
Grid Searching can be applied to any hyperparameters algorithm whose performance
can be improved by tuning hyperparameter. For example, we can apply grid searching on K-
Nearest Neighbors by validating its performance on a set of values of K in it. Same thing we
can do with Logistic Regression by using a set of values of learning rate to find the best
learning rate at which Logistic Regression achieves the best accuracy.
3 Folder structure of project

Front end of the website of our


model

Flask code

• Artifacts folder contains the json file of the names of the columns and pickle file of our
model.

• Util.py file contains all the utility functions.

• Client folder contains the code of html, css and java script.

• Model folder contains the source code of model .ipynb file.

• Requirements is a text file which contains all the important packages required to run
the model.
4 DATASET

Here we had taken data from the kaggle website which has several datasets for machine
learning

Link for the dataset : https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data

Our Data contains information about Bangalore Houses only.

Dataset looks as follows-

This dataset contains 9 columns in that bath balcony and price are the numerical columns
and remaining columns are categorical features and it contains 13247 rows. Size column is also
numerical but it contains bhk at the end so in preprocessing step we remove that to make it
numerical feature.
5 EXPLORATORY DATA ANALYSIS

Exploratory data analysis is also called as EDA. Exploratory data analysis was promoted by
John Tukey to encourage statisticians to explore data, and possibly formulate hypotheses that
might cause new data collection and experiments. EDA focuses more narrowly on checking
assumptions required for model fitting and hypothesis testing. It also checks while handling
missing values and making transformations of variables as needed. EDA builds a robust
understanding of the data, and issues associated with either the info or process. It’s scientific
approach to getting the story of the data.

Data exploration is the first step in data analysis and typically involves summarizing the main
characteristics of a data set, including its size, accuracy, initial patterns in the data and other
attributes. It is commonly conducted by data analysts using visual analytics tools, but it can also
be done in more advanced statistical software, Python. Before it can conduct analysis on data
collected by multiple data sources and stored in data warehouses, an organization must know
how many cases are in a data set, what variables are included, how many missing values there
are and what general hypotheses the data is likely to support. An initial exploration of the data
set can help answer these questions by familiarizing analysts with the data with which they are
working.

 checking correlation of every feature

1. From above image price_per_sqft is not correlated with price feature so that does not
play an important role for predicting the price of a house
2. bhk and bath are highly correlated.
3. We can create a new feature using bhk and bath features.
 checking histogram of price per sqft feature

1. From above image most of the values lie between 5000-10000.

 Checking histogram of bath feature

1. From above image most number houses has 2-5 bathrooms.


6 DATA PREPROCESSING

Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.
Some common steps in data preprocessing include:

 Data cleaning: this step involves identifying and removing missing, inconsistent, or irrelevant
data. This can include removing duplicate records, filling in missing values, and handling
outliers.

 Data integration: this step involves combining data from multiple sources, such as databases,
spreadsheets, and text files. The goal of integration is to create a single, consistent view of
the data.

 Data transformation: this step involves converting the data into a format that is more suitable
for the data mining task. This can include normalizing numerical data, creating dummy
variables, and encoding categorical data.

 Data reduction: this step is used to select a subset of the data that is relevant to the data
mining task. This can include feature selection (selecting a subset of the variables) or
feature extraction (extracting new variables from the data).

 Data discretization: this step is used to convert continuous numerical data into categorical
data, which can be used for decision tree and other categorical data mining techniques.

By performing these steps, the data mining process becomes more efficient and the
results become more accurate.

Preprocessing in Data Mining:

Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:

1. Data Cleaning:

The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.

Missing Data:

This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:

Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.

Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.

Noisy Data:

Noisy data is a meaningless data that can’t be interpreted by machines.It can be


generated due to faulty data collection, data entry errors etc. It can be handled in
following ways :
Binning Method:

This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.

Regression:

Here data can be made smooth by fitting it to a regression function.The


regression used may be linear (having one independent variable) or multiple
(having multiple independent variables).

Clustering:

This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.

2. Data Transformation:

This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:

Normalization:

It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)

Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.

Discretization:

This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.

3. Data Reduction:

Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to
get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:

Aggregation operation is applied to data for the construction of the data cube.

Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For
performing
attribute selection, one can use level of significance and p- value of the
attribute.the attribute having p-value greater than significance level can be
discarded.

Numerosity Reduction:

This enable to store the model of data instead of whole data, for example:
Regression Models.

Dimensionality Reduction:

This reduce the size of data by encoding mechanisms.It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).

DATA CLEANING

Data cleaning is the most important step because in real world the data may not be good
enough to use machine learning algorithm directly on it. Data scientists takes 80% for data
cleaning and data engineering

In our dataset we have a feature size in which we have values like 2 BHK , 3BHK Ml
algorithm cannot understand alphabets so we need to give only numbers we need to make 2
BHK to 2. We created a column bhk

df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))


HANDLING NULL VALUES

Our dataset does not have all the values some of the values are missing so we need to
handle them to fill them or we directly remove them. Most of the time we remove them to save
time but the information is lost when we remove them. So we use some statistical methods like
mean, median and mode to fill them

• Mean is used when we do not have too many outliers and it should be a numerical
column.
• Median is used when we have outliers in data.
• Mode is used when we have categorical column.

Our data consists null values

Dropping null values

FEATURE ENGINEERING

Feature engineering is most crucial step where create new features from the previous
features. Making features better to use it well by an ml algorithm.
We had created a new feature which is price per square feet which is more correlated with price
feature.

df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']

This is because the price


is given in lakhs

Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of features (or dimensions) in
a dataset while retaining as much information as possible. This can be done for a variety of
reasons, such as to reduce the complexity of a model, to improve the performance of a
learning algorithm, or to make it easier to visualize the data. There are several techniques for
dimensionality reduction, including principal component analysis (PCA), singular value
decomposition (SVD), and linear discriminant analysis (LDA). Each technique uses a different
method to project the data onto a lower-dimensional space while preserving important
information.
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on
it. Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.

Components of Dimensionality Reduction

There are two components of dimensionality reduction:

 Feature selection: In this, we try to find a subset of the original set of variables, or features,
to get a smaller subset which can be used to model the problem. It usually involves
three ways:

1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.

Methods of Dimensionality Reduction

The various methods used for dimensionality reduction include:

 Principal Component Analysis (PCA)


 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)

In our data location feature has too many values we cannot use them directly it may reduce the
performance of machine learning algorithm. We apply some filters to the feature to reduce the
location values. We have 13200 locations so we take locations which are more than 10 times in
the dataset. Now it is reduced to 240 unique locations and locations which are appeared less
than 10 times we make it as other. Total we have 241 unique location values

REMOVING OUTLIERS

An outlier is an object that deviates significantly from the rest of the objects. They can be
caused by measurement or execution error. The analysis of outlier data is referred to as outlier
analysis or outlier mining.
Most data mining methods discard outliers noise or exceptions, however, in some applications
such as fraud detection, the rare events can be more interesting than the more regularly
occurring one and hence, the outlier analysis becomes important in such case.

Detecting Outlier: Clustering based outlier detection using distance to the closest cluster. In
the K-Means clustering technique, each cluster has a mean value. Objects belong to the
cluster whose mean value is closest to it. In order to identify the Outlier, firstly we need to
initialize the threshold value such that any distance of any data point greater than it from its
nearest cluster identifies it as an outlier for our purpose. Then we need to find the distance of
the test data to each cluster mean. Now, if the

distance between the test data and the closest cluster to it is greater than the threshold value
then we will classify the test data as an outlier.
Outliers are the values which are suspicious and that data points are the errors in the data. We
need to remove them else that errors are also fed to algorithm and the algorithm learns that
errors also. There is a line that garbage in garbage out we need to give good data to algorithm
then it performs better.
In our data we have some outliers like normally square ft per bedroom is 300 (i.e. 2 bhk apartment is
minimum 600 sqft. If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and
can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to
be 300 sqft.

The outliers in our dataset are

The data also contains oultiers with respect to prices . If we see price per sqft for 2BHK and
3BHK in Rajajinagar and Hebbal.
2 bhk and 3bhk have similar prices

It also contains outliers with respect to bathrooms. In our dataset we have some data points
which 2 more bathrooms than number of bedrooms. So it seems unusual. They are also
removed at last we have 7239 rows and 5 columns [locations, total_sqft, bath, bhk, price]
ONE HOT ENCODING

Most real-life datasets we encounter during our data science project development have
columns of mixed data type. These datasets consist of both categorical as well as numerical
columns. However, various Machine Learning models do not work with categorical data and to
fit this data into the machine learning model it needs to be converted into numerical data. For
example, suppose a dataset has a Gender column with categorical elements like Male
and Female. These labels have no specific order of preference and also since the data is
string labels, machine learning models misinterpreted that there is some sort of hierarchy in
them.
One approach to solve this problem can be label encoding where we will assign a numerical
value to these labels for example Male and Female mapped to 0 and 1. But this can add bias
in our model as it will start giving higher preference to the Female parameter as 1>0 but
ideally, both labels are equally important in the dataset. To deal with this issue we will use the
One Hot Encoding technique. One hot encoding is a technique that we use to represent categorical
variables as numerical values in a machine learning model.

The advantages of using one hot encoding include:

 It allows the use of categorical variables in models that require numerical input.

 It can improve model performance by providing more information to the model about
the categorical variable.

 It can help to avoid the problem of ordinality, which can occur when a categorical
variable has a natural ordering (e.g. “small”, “medium”, “large”).

The disadvantages of using one hot encoding include:

 It can lead to increased dimensionality, as a separate column is created for each


category in the variable. This can make the model more complex and slow to train.

 It can lead to sparse data, as most observations will have a value of 0 in most of the
one-hot encoded columns.

 It can lead to overfitting, especially if there are many categories in the variable and
the sample size is relatively small.
 One-hot-encoding is a powerful technique to treat categorical data, but it can lead to
increased dimensionality, sparsity, and overfitting. It is important to use it cautiously
and consider other methods such as ordinal encoding or binary encoding.

Here locations is not a numerical feature ml algorithm takes only numerical features. To convert
locations feature to numerical value we use one hot encoding which takes row value and
converts to a dummy feature.

Here every unique location value is converted to a new feature after creating all the dummy
features we remove one of the dummy feature so that there wont be any dummy variable trap
for machine learning algorithm.
Now our data set becomes of shape 7239 rows and 244 columns i.e., features.

This is a supervised learning model so we divide our dataset to dependent and independent
features. Price is our dependent feature that is the value our model is going to predict. The
remaining 243 features are independent features.
Next we divide the data to training and testing data using scikit-learn train_test_split function We
divide the train and test to 80:20 ratio

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=10

)
7 LINEAR REGRESSION

Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.

EQUATION : y= a0+a1x + ε

Y = Dependent variable(Target variable)


X = Independent variable(predictor
variable)
a0 = intercept of line(gives an additional
degree of freedom)
a1 = linear regression coefficient(scale
factor to each input value)
ε = random error
The values for x and y variables are training datasets for Linear Regression model representation.

Finding the best fit line:


When working with linear regression, our main goal is to find the best fit line that means the
error between predicted values and actual values should be minimized. The best fit line will have
the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line of
regression, so we need to calculate the best values for a0 and a1 to find the best fit line, so to
calculate this we use cost function.

Cost function:

o The different values for weights or coefficient of lines (a 0, a1) gives the different line of regression,
and the cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:

For the above linear equation, MSE can be calculated as:

Where,

N = Total number of observations


Yi = Actual value a1xi + a0 = Predicted value

Residuals : The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.

Gradient Descent:

o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.

8 K FOLD CROSS VALIDATION

In machine learning, we couldn’t fit the model on the training data and can’t say
that the model will work accurately for the real data. For this, we must assure that our model got
the correct patterns from the data, and it is not getting up too much noise. For this purpose, we
use the cross-validation technique.
Cross validation is a technique used in machine learning to evaluate the performance of a model
on unseen data. It involves dividing the available data into multiple folds or subsets, using one
of these folds as a validation set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the model’s
performance.

The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By evaluating the
model on multiple validation sets, cross validation provides a more realistic estimate of the
model’s generalization performance, i.e., its ability to perform well on new, unseen data.

There are several types of cross validation techniques, including k-fold cross validation, leave
one-out cross validation, and stratified cross validation. The choice of technique depends on the
size and nature of the data, as well as the specific requirements of the modeling problem.
We had tried different types of algorithms Decision tree regressor and Lasso Regression.

IMPLEMENTATION OF LASSO AND DECISION TREE MODELS

from sklearn.model_selection import GridSearchCV


from sklearn.linear_model import Lasso

from sklearn.tree import DecisionTreeRegressor


def find_best_model_using_gridsearchcv(X,y):
algos = {
'lasso': {
'model': Lasso(),
‘params': {
'alpha': [1,2],
'selection': ['random', 'cyclic']
}
},
'decision_tree': {
'model': DecisionTreeRegressor(),
'params': {
'criterion' : ['mse','friedman_mse'],
'splitter': ['best','random']
}
}
}
scores = []
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
for algo_name, config in algos.items():
gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False)
gs.fit(X,y)
scores.append({
'model': algo_name,
'best_score': gs.best_score_,
'best_params': gs.best_params_
})

return pd.DataFrame(scores,columns=['model','best_score','best_params'])

find_best_model_using_gridsearchcv(X,y)

Creating a function that converts the given location to 0 or 1 and predicts the price .So based on
above results linear regression got higher accuracy.
Finally saving the model as a pickle file which can be used to predict directly by not running the
whole

9 FLASK

Flask is an API of Python that allows us to build up web-applications. It was developed by


Armin Ronacher. Flask’s framework is more explicit than Django’s framework and is also
easier to learn because it has less base code to implement a simple web-Application. A Web-
Application Framework or Web Framework is the collection of modules and libraries that helps
the developer to write applications without writing the low-level codes such as protocols,
thread management, etc. Flask is based on WSGI(Web Server Gateway Interface) toolkit and
Jinja2 template engine.
The Flask application is started by calling the run() function. The method should be restarted
manually for any change in the code. To overcome this, the debug support is enabled so as to
track any error.
app.debug = True

app.run()

app.run(debug = True)

Routing:

Nowadays, the web frameworks provide routing technique so that user can remember the
URLs. It is useful to access the web page directly without navigating from the Home page. It is
done through the following route() decorator, to bind the URL to a function.
# decorator to route URL
@app.route(‘/hello’)
# binding to the function of route
def hello_world():
return ‘hello world’

If a user visits http://localhost:5000/hello URL, the output of the hello_world() function will be
rendered in the browser. The add_url_rule() function of an application object can also be used
to bind URL with the function as in above example.
Flask support various HTTP protocols for data retrieval from the specified URL, these can be
defined as:-

Method Description

This is used to send the data in an without encryption of the form to the
GET
server.

HEAD provides response body to the form

Sends the form data to server. Data received by POST method is not cached
POST
by server.

PUT Replaces current representation of target resource with URL.

DELETE Deletes the target resource of a given URL

Handling Static Files :

A web application often requires a static file such as javascript or a CSS file to render the
display of the web page in browser. Usually, the web server is configured to set them, but
during development, these files are served as static folder in your package or next to the
module.

Other Important Flask Functions:


redirect(): It is used to return the response of an object and redirects the user to another target
location with specified status code.

Syntax: Flask.redirect(location, statuscode, response)

//location is used to redirect to the desired URL


//statuscode sends header value, default 302
//response is used to initiate response.

abort: It is used to handle the error in the code .


Syntax: Flask.abort(code)

The code parameter can take the following values to handle the error accordingly:
 400 – For Bad Request
 401 – For Unauthenticated
 403 – For Forbidden request
 404 – For Not Found
 406 – For Not acceptable
 425 – For Unsupported Media
 429 – Too many Requests

File-Uploading in Flask:

File Uploading in Flask is very easy. It needs an HTML form with enctype attribute and URL
handler, that fetches file and saves the object to the desired location. Files are temporary
stored on server and then on the desired location.
10 SOURCE CODE

10.1 Python ML code

Importing libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)

loading data

df1 = pd.read_csv("/content/drive/MyDrive/bhp
project/Bengaluru_House_Data.csv")
df1.head()

data preprocessing

df2 =
df1.drop(['area_type','society','balcony','availability'],axis='columns')
df3 = df2.dropna()
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']
df5.location = df5.location.apply(lambda x: 'other' if x in
location_stats_less_than_10 else x)
df6 = df5[~(df5.total_sqft/df5.bhk<300)]
def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) &
(subdf.price_per_sqft<=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out
df7 = remove_pps_outliers(df6)
def remove_bhk_outliers(df):
exclude_indices = np.array([])
for location, location_df in df.groupby('location'):
bhk_stats = {}
for bhk, bhk_df in location_df.groupby('bhk'):
bhk_stats[bhk] = {
'mean': np.mean(bhk_df.price_per_sqft),
'std': np.std(bhk_df.price_per_sqft),
'count': bhk_df.shape[0]
}
for bhk, bhk_df in location_df.groupby('bhk'):
stats = bhk_stats.get(bhk-1)
if stats and stats['count']>5:
exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
return df.drop(exclude_indices,axis='index')

df8 = remove_bhk_outliers(df7)
df9 = df8[df8.bath<df8.bhk+2]
df10 = df9.drop(['size','price_per_sqft'],axis='columns')
dummies = pd.get_dummies(df10.location)
df11 = pd.concat([df10,dummies.drop('other',axis='columns')],axis='columns')
df12 = df11.drop('location',axis='columns')

model building

X = df12.drop(['price'],axis='columns')
y = df12.price
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.2,random_state=10)

from sklearn.linear_model import LinearRegression


lr_clf = LinearRegression()
lr_clf.fit(X_train,y_train)
lr_clf.score(X_test,y_test)
output: 0.8629132245229443
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score

cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)

cross_val_score(LinearRegression(), X, y, cv=cv)
output:array([0.82702546, 0.86027005, 0.85322178, 0.8436466 , 0.85481502])

10.2 Util.py

import pickle
import json
import numpy as np

__locations = None
__data_columns = None
__model = None

def get_estimated_price(location,sqft,bhk,bath):
try:
loc_index = __data_columns.index(location.lower())
except:
loc_index = -1

x = np.zeros(len(__data_columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index>=0:
x[loc_index] = 1

return round(__model.predict([x])[0],2)

def load_saved_artifacts():
print("loading saved artifacts...start")
global __data_columns
global __locations

with open("./artifacts/columns.json", "r") as f:


__data_columns = json.load(f)['data_columns']
__locations = __data_columns[3:] # first 3 columns are
sqft, bath, bhk

global __model
if __model is None:
with open('./artifacts/banglore_home_prices_model.pickle',
'rb') as f:
__model = pickle.load(f)
print("loading saved artifacts...done")

def get_location_names():
return __locations

def get_data_columns():
return __data_columns

if __name__ == '__main__':
load_saved_artifacts()
print(get_location_names())
print(get_estimated_price('1st Phase JP Nagar',1000, 3, 3))

10.3 server.py

from flask import Flask, request, jsonify import


util

app = Flask(__name__)

@app.route('/get_location_names', methods=['GET'])
def get_location_names(): response = jsonify({
'locations': util.get_location_names()
})
response.headers.add('Access-Control-Allow-Origin', '*')

return response

@app.route('/predict_home_price', methods=['GET', 'POST']) def


predict_home_price():
total_sqft = float(request.form['total_sqft'])
location = request.form['location'] bhk =
int(request.form['bhk']) bath =
int(request.form['bath'])

response = jsonify({ 'estimated_price':


util.get_estimated_price(location,total_sqft,bhk,bath)
})
response.headers.add('Access-Control-Allow-Origin', '*')

return response

if __name__ ==
"__main__":
print("Starting Python Flask Server For Home Price Prediction...")
util.load_saved_artifacts() app.run()

10.4 HTML

<!DOCTYPE html>
<html>
<head>
<title>Banglore Home Price Prediction</title>
<script
src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></
scri pt>
<script src="app.js"></script>
<link rel="stylesheet" href="app.css">
</head>
<body>
<div class="img"></div>
<form class="form">
<h2>Area (Square Feet)</h2>
<input class="area" type="text" id="uiSqft" class="floatLabel"
name="Squareft" value="1000">
<h2>BHK</h2>
<div class="switch-field">
<input type="radio" id="radio-bhk-1" name="uiBHK" value="1"/>
<label for="radio-bhk-1">1</label>
<input type="radio" id="radio-bhk-2" name="uiBHK" value="2"
checked/>
<label for="radio-bhk-2">2</label>
<input type="radio" id="radio-bhk-3" name="uiBHK" value="3"/>
<label for="radio-bhk-3">3</label>
<input type="radio" id="radio-bhk-4" name="uiBHK" value="4"/>
<label for="radio-bhk-4">4</label>
<input type="radio" id="radio-bhk-5" name="uiBHK" value="5"/>
<label for="radio-bhk-5">5</label>
</div>
</form>
<form class="form">
<h2>Bath</h2>
<div class="switch-field">
<input type="radio" id="radio-bath-1" name="uiBathrooms"
value="1"/>
<label for="radio-bath-1">1</label>
<input type="radio" id="radio-bath-2" name="uiBathrooms"
value="2" checked/>

<label for="radio-bath-2">2</label>
<input type="radio" id="radio-bath-3" name="uiBathrooms"
value="3"/> <label for="radio-bath-3">3</label>
<input type="radio" id="radio-bath-4" name="uiBathrooms"
value="4"/> <label for="radio-bath-4">4</label>
<input type="radio" id="radio-bath-5" name="uiBathrooms"
value="5"/>
<label for="radio-bath-5">5</label>
</div>
<h2>Location</h2>
<div>
<select class="location" name="" id="uiLocations">
<option value="" disabled="disabled" selected="selected">Choose a
Location</option>
<option>Electronic City</option>
<option>Rajaji Nagar</option>

</select>
</div>
<button class="submit" onclick="onClickedEstimatePrice()"
type="button">Estimate Price</button>
<div id="uiEstimatedPrice" class="result"> <h2></h2> </div>
</body>
</html
10.5 CSS

@import url(https://fonts.googleapis.com/css?
family=Roboto:300);
.switch-field {
display: flex;
margin-bottom: 36px;
overflow: hidden;
}

.switch-field input {
position: absolute !important;
clip: rect(0, 0, 0, 0);
height: 1px;
width: 1px;
border: 0;
overflow: hidden;
}

.switch-field label {
background-color: #e4e4e4;
color: rgba(0, 0, 0, 0.6);
font-size: 14px;
line-height: 1;
text-align: center;
padding: 8px 16px;
border: 1px solid
rgba(0, 0, 0, 0.2);
box-shadow: inset 0 1px 3px
rgba(0, 0, 0, 0.3), 0 1px
rgba(255, 255, 255, 0.1);
transition: all 0.1s ease-in-out;
}

.switch-field label:hover {
cursor: pointer;

.switch-field input:checked + label


{
background-color: #a5dc86;
box-shadow: none;
}

.switch-field label:first-of-type
{
border-radius: 4px 0 0 4px;
}

.switch-field label:last-of-type {
border-radius: 0 4px 4px 0;
}
.form
{
max-width: 270px;
font-family: "Lucida Grande", Tahoma, Verdana, sans-
serif; font-weight: normal;
line-height: 1.625;
margin: 8px auto;
padding-left: 16px;
z-index: 2;
}
h2 {
font-size: 18px;
margin-bottom: 8px;
}
.area{
font-family: "Roboto", sans-
serif; outline: 0;
background: #f2f2f2;
width: 76%;
border: 0;
margin: 0 0 10px;
padding: 10px;
box-sizing: border-box;
font-size: 15px;
height: 35px;
border-radius: 5px;

.location{
font-family: "Roboto", sans-
serif; outline: 0;
background: #f2f2f2;
width: 76%;

border: 0;
margin: 0 0 10px;
padding: 10px;
box-sizing: border-
box; font-size:
15px; height: 40px;
border-radius: 5px;
}

.submit{
background: #a5dc86;
width: 76%;
border: 0;
margin: 25px 0 10px;
box-sizing: border-
box; font-size:
15px; height:
35px;
text-align: center;
border-radius: 5px;
}

.result{
background: #dcd686;
width: 76%;
border: 0;
margin: 25px 0 10px;
box-sizing: border-box;
font-size: 15px;
height: 35px;
text-align: center;
}
.img
{
background: url('https://images.unsplash.com/photo-
1564013799919ab600027ffc6?ixlib=rb-
1.2.1&auto=format&fit=crop&w=1350&q=80');
background-repeat: no-repeat;
background-size: auto;
background-size:100% 100%;
-webkit-filter: blur(5px);
-moz-filter: blur(5px);
-o-filter: blur(5px);
-ms-filter: blur(5px);
filter: blur(15px);
position: fixed;
width: 100%;
height: 100%;
top: 0;
left: 0;

z-index: -1;
}
body, html {
height: 100%;
}

10.6 JAVA SCRIPT

function getBathValue() {
var uiBathrooms = document.getElementsByName("uiBathrooms");
for(var i in uiBathrooms) {
if(uiBathrooms[i].checked) {
return parseInt(i)+1;
}
}
return -1; // Invalid Value
}

function getBHKValue() {
var uiBHK = document.getElementsByName("uiBHK");
for(var i in uiBHK) {
if(uiBHK[i].checked) {
return parseInt(i)+1;
}
}
return -1; // Invalid Value
}
function onClickedEstimatePrice() {
console.log("Estimate price button clicked");
var sqft = document.getElementById("uiSqft");
var bhk = getBHKValue();
var bathrooms = getBathValue();
var location = document.getElementById("uiLocations");
var estPrice = document.getElementById("uiEstimatedPrice");

var url = "http://127.0.0.1:5000/predict_home_price"; //Use this if you


are NOT using nginx which is first 7 tutorials
//var url = "/api.predict_home_price";
$.post(url, {
total_sqft: parseFloat(sqft.value),
bhk: bhk,
bath: bathrooms,
location: location.value
},function(data, status) {
console.log(data.estimated_price);
estPrice.innerHTML = "<h2>" + data.estimated_price.toString() + "
Lakh</h2>";
console.log(status);
});
}

function onPageLoad() {
console.log( "document loaded" );
var url = "http://127.0.0.1:5000/get_location_names";
//var url = "/api/get_location_names";
$.get(url,function(data, status) {
console.log("got response for get_location_names request");
if(data) {
var locations = data.locations;
var uiLocations = document.getElementById("uiLocations");
$('#uiLocations').empty();
for(var i in locations) {
var opt = new Option(locations[i]);
$('#uiLocations').append(opt);
}
}
});
}

window.onload = onPageLoad;

Image of web page


11 Conclusion

Our study showed that……..

Linear Regression displayed the best performance for this Dataset and can be used for
deploying purposes. Decision Tree Regressor and Lasso Regression are far behind, so can’t

be recommended for further deployment purposes.


Our Aim is achieved as we have successfully ticked all our parameters as mentioned in our
Aim Column. It is seen that area, bhk and bath is the most effective attribute in predicting the
house price and that the Linear Regression is the most effective model for our Dataset with
accuracy score of around 86%

12 REFERENCES

[1] Book by Aurelien Geron “hands on machine learning with Scikit-learn, Keras and Tensorflow:

Concepts, Tools, and Techniques to build intelligent systems”.


[2] Book by Margaret H.Dunham “Data mining introductory and advanced topics”.

[3] Hal Daumé III ciml: “A course in machine learning”.

[4] Dhaval patel “Project on machine learning”.

You might also like