Report Final
Report Final
UNDER GUIDENCE OF
D NEETHA, MCA
CERTIFICATE
Apart from the efforts of ourself, the success of any project depends on
largely on the encouragement and guidelines of many others. I take this
opportunity to express my gratitude to the people who have been
instrumental in the successful completion of this project.
Abstract
House price forecasting is an important topic of real estate. The literature attempts to
derive useful knowledge from historical data of property markets. Machine learning
techniques are applied to analyze historical property transactions in India to discover useful
models for house buyers and sellers. Revealed is the high discrepancy between house
prices in the most expensive and most affordable suburbs in the city of Bangalore. Moreover,
experiments demonstrate that the Multiple Linear Regression that is based on mean squared
Table of Contents
Aim
• Identify the important home price attributes which feed the model’s predictive
power.
granted, it’s that housing and rental prices continue to rise. Since the housing crisis of 2008,
housing prices have recovered remarkably well, especially in major housing markets.
However, in the 4th quarter of 2016, I was surprised to read that Bombay housing prices had
fallen the most in the last 4 years. In fact, median resale prices for condos and coops fell
6.3%, marking the first time there was a decline since Q1 of 2017. The decline has been
partly attributed to political uncertainty domestically and abroad and the 2014 election. So, to
maintain the transparency among customers and also the comparison can be made easy
through this model. If we had a website which tells us the price of the house with certain
features like area number of bedrooms etc. If customer finds the price of house higher than
The real estate is a field which always keeps updating. The prices of the does not be
constant they always changes there may be different factors for the changes we can’t take all
the factors into consideration. So we built a model on basic factors like area, number of
bedrooms, etc which are the factors everyone care about. Our model is prepared on a stable
This model can be very useful who wants to buy a house in Bangalore city.
Can be helpful to the people who are planning to buy a house to know the range of
the price range of the house.
It is also helpful to the people who wants to sell the house. They can know the price
range they can put to their house.
We can analyse which location has higher prices and which location has lower prices.
The feasibility analysis is an analytical program through project manager determines the
project success ratio and through it the project manager able to see either project is good
enough to release into market or not the key considerations involved in the feasibility analysis
are:
• Economical Feasibility
• Technical Feasibility
• Operational Feasibility
Economical Feasibility
Economic feasibility is the cost and logistical outlook for a business project or
endeavor. Prior to
embarking on a new venture, most businesses conduct an economic feasibility study, which is a
study that analyzes data to determine whether the cost of the prospective new venture will
ultimately be profitable to the company. Economic feasibility is sometimes determined within an
organization, while other times companies hire an external company that specializes in
conducting economic feasibility studies for them.
Hence there is no cost involved to use this model so this is the economical feasibility
model.
Technical Feasibility
What is technical feasibility, can be described as the formal process of assessing whether
it is technically possible to manufacture a product or service. Before launching a new offering or
taking up a client project, it is essential to plan and prepare for every step of the operation.
Technical feasibility helps determine the efficacy of the proposed plan by analyzing the process,
including tools, technology, material, labour and logistics.
Software technologies used are Python, Flask, HTML, CSS, Java Script. It is possible to
update the system in the future. No special hardware is required for using this model. Hence this
is technically feasible.
Operation Feasibility
Operational feasibility is the measure of how well a proposed system solves the
problems, and takes advantage of the opportunities identified during scope definition and how it
satisfies the requirements identified in the requirements analysis phase of system development.
The operational feasibility assessment focuses on the degree to which the proposed
development project fits in with the existing business environment and objectives with regard to
development schedule, delivery date, corporate culture and existing business processes.
To ensure success, desired operational outcomes must be imparted during design and
development. These include such design-dependent parameters as reliability, maintainability,
supportability, usability, producibility, disposability, sustainability, affordability and others. These
parameters are required to be considered at the early stages of design if desired operational
behaviours are to be realized. A system design and development requires appropriate and
timely application of engineering and management efforts to meet the previously mentioned
parameters.
A system may serve its intended purpose most effectively when its technical and
operating characteristics are engineered into the design. Therefore, operational feasibility is a
critical aspect of systems engineering that needs to be an integral part of the early design
phases.
There is no advance features for this website just we need to select the appropriate
values even uneducated can use this model.
SYSTEM SPECIFICATION
HARDWARE REQUIREMENTS
• System : corei5
• Hard disk : 500 GB
• Ram : 4 GB
SOFTWARE REQUIREMENTS
Python
Python is a popular coding language which is used in various fields like Data science,
Machine learning, Data analytics, Artificial Intelligence, Backend
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code
reuse. The Python interpreter and the extensive standard library are available in source or
binary form without charge for all major platforms, and can be freely distributed.
NUMPY
It is used for numerical analysis and to create multi-dimensional arrays.
PANDAS
It is used for data manipulation, data preprocessing and data exploration.
MATPLOTLIB
It is used for plotting different types of charts.
SCIKIT-LEARN
It is used for developing various kinds of machine learning models.
TENSORFLOW
It is used for developing deep learning models.
HTML
HTML is an acronym which stands for Hyper Text Markup Language which is used for
creating web pages and web applications. Let's see what is meant by Hypertext Markup
Language, and Web page.
Hyper Text: HyperText simply means "Text within Text." A text has a link within it, is a
hypertext. Whenever you click on a link which brings you to a new webpage, you have
clicked on a hypertext. HyperText is a way to link two or more web pages (HTML
documents) with each other.
Web Page: A web page is a document which is commonly written in HTML and
translated by a web browser. A web page can be identified by entering an URL. A Web
page can be of the static or dynamic type. With the help of HTML only, we can create
static web pages.
Hence, HTML is a markup language which is used for creating attractive web pages with
the help of styling, and which looks in a nice format on a web browser. An HTML document is
made of many HTML tags and each HTML tag contains different content.
CSS
CSS stands for Cascading Style Sheets. It is a style sheet language which is used to
describe the look and formatting of a document written in markup language. It provides an
additional feature to HTML. It is generally used with HTML to change the style of web pages
and user interfaces. It can also be used with any kind of XML documents including plain XML,
SVG and XUL.
CSS is used along with HTML and JavaScript in most websites to create user interfaces
for web applications and user interfaces for many mobile applications.
The core client-side JavaScript language consists of some common programming features that
allow you to do things like:
Store useful values inside variables. In the above example for instance, we ask for a
new name to be entered then store that name in a variable called name.
Operations on pieces of text (known as "strings" in programming). In the above example
we take the string "Player 1: " and join it to the name variable to create the complete
text label, e.g. "Player 1: Chris".
Running code in response to certain events occurring on a web page. We used
a click event in our example above to detect when the label is clicked and then run the
code that updates the text label.
What is even more exciting however is the functionality built on top of the client-side
JavaScript language. So called Application Programming Interfaces (APIs) provide you with
extra superpowers to use in your JavaScript code.
• PANDAS
• NUMPY
• MATPLOTLIB
• SCIKIT-LEARN
• FLASK
PANDAS
Pandas is an open-source library that is made mainly for working with relational or
labeled data both easily and intuitively. It provides various data structures and operations for
manipulating numerical data and time series. This library is built on top of the NumPy library.
Pandas is fast and it has high performance & productivity for users.
Advantages
Size mutability: columns can be inserted and deleted from DataFrame and
higher dimensional objects
Installation
pip install pandas
NUMPY
Array in NUMPY is a table of elements (usually numbers), all of the same type, indexed
by a tuple of positive integers. In NUMPY, number of dimensions of the array is called rank of the
array .A tuple of integers giving the size of the array along each dimension is known as shape of
the array. An array class in NUMPY is called as n dimensional array. Elements in NUMPY arrays
are accessed by using square brackets and can be initialized by using nested Python Lists.
Installation
pip install numpy
MATPLOTLIB
SKLEARN
SCIKIT-LEARN(SKLEARN) is the most useful and robust library for machine learning in
Python. It provides a selection of efficient tools for machine learning and statistical modeling
including classification, regression, clustering and dimensionality reduction via a consistence
interface in Python .This library, which is largely written in Python, is built upon NUMPY, SCIPY
and MATPLOTLIB.
Installation:
pip install sklearn
MODELS USED
• It is mostly used for finding out the relationship between variables and forecasting.
The branches/edges represent the truth/falsity of the statement and take makes a
decision based on that in the example below which shows a decision tree that evaluates the
smallest of three numbers:
LASSO REGRESSION MODEL
Lasso Regression is also another linear model derived from Linear Regression which
shares the same hypothetical function for prediction. The cost function of Linear Regression is
represented by J.
Linear Regression model considers all the features equally relevant for prediction. When
there are many features in the dataset and even some of them are not relevant for the
predictive model. This makes the model more complex with a too inaccurate prediction on the
test set (or overfitting ). Such a model with high variance does not generalize on the new data.
So, Lasso Regression comes for the rescue. It introduced an L1 penalty (or equal to the
absolute value of the magnitude of weights) in the cost function of Linear Regression. Lasso
Regression performs both, variable selection and regularization too.
Mathematical Intuition:
During gradient descent optimization, added l1 penalty shrunk weights close to zero or
zero. Those weights which are shrunken to zero eliminates the features present in the
hypothetical function. Due to this, irrelevant features don’t participate in the predictive model.
This penalization of weights makes the hypothesis more simple which encourages the sparsity
( model with few parameters ).
If the intercept is added, it remains unchanged.
We can control the strength of regularization by hyperparameter lambda. All weights are
reduced by the same factor lambda.
Mathematical Intuition:
During gradient descent optimization of its cost function, added l2 penalty term leads to
reduces the weights of the model to zero or close to zero. Due to the penalization of weights,
our hypothesis gets simpler, more generalized, and less prone to overfitting. All weights are
reduced by the same factor lambda. We can control the strength of regularization by
hyperparameter lambda.
Different cases for tuning values of lambda.
1. If lambda is set to be 0, Ridge Regression equals Linear Regression
2. If lambda is set to be infinity, all weights are shrunk to zero.
SVM REGRESSON
Support Vector Regression (SVR) is a type of machine learning algorithm used for
regression analysis. The goal of SVR is to find a function that approximates the relationship
between the input variables and a continuous target variable, while minimizing the prediction
error.
Unlike Support Vector Machines (SVMs) used for classification tasks, SVR seeks to find a
hyperplane that best fits the data points in a continuous space. This is achieved by mapping the
input variables to a high-dimensional feature space and finding the hyperplane that maximizes
the margin (distance) between the hyperplane and the closest data points, while also minimizing
SVR can handle non-linear relationships between the input variables and the target variable by
using a kernel function to map the data to a higher-dimensional space. This makes it a powerful
tool for regression tasks where there may be complex relationships between the input variables
Support Vector Regression (SVR) uses the same principle as SVM, but for regression problems.
The problem of regression is to find a function that approximates mapping from an input domain
to real numbers on the basis of a training sample. So let’s now dive deep and understand how
The first thing that we’ll understand is what is the decision boundary (the danger red line
above!). Consider these lines as being at any distance, say ‘a’, from the hyperplane. So, these
are the lines that we draw at distance ‘+a’ and ‘-a’ from the hyperplane. This ‘a’ in the text is
wx+b = +a
wx+b= -a
GRID SEARCH CV
Flask code
• Artifacts folder contains the json file of the names of the columns and pickle file of our
model.
• Client folder contains the code of html, css and java script.
• Requirements is a text file which contains all the important packages required to run
the model.
4 DATASET
Here we had taken data from the kaggle website which has several datasets for machine
learning
This dataset contains 9 columns in that bath balcony and price are the numerical columns
and remaining columns are categorical features and it contains 13247 rows. Size column is also
numerical but it contains bhk at the end so in preprocessing step we remove that to make it
numerical feature.
5 EXPLORATORY DATA ANALYSIS
Exploratory data analysis is also called as EDA. Exploratory data analysis was promoted by
John Tukey to encourage statisticians to explore data, and possibly formulate hypotheses that
might cause new data collection and experiments. EDA focuses more narrowly on checking
assumptions required for model fitting and hypothesis testing. It also checks while handling
missing values and making transformations of variables as needed. EDA builds a robust
understanding of the data, and issues associated with either the info or process. It’s scientific
approach to getting the story of the data.
Data exploration is the first step in data analysis and typically involves summarizing the main
characteristics of a data set, including its size, accuracy, initial patterns in the data and other
attributes. It is commonly conducted by data analysts using visual analytics tools, but it can also
be done in more advanced statistical software, Python. Before it can conduct analysis on data
collected by multiple data sources and stored in data warehouses, an organization must know
how many cases are in a data set, what variables are included, how many missing values there
are and what general hypotheses the data is likely to support. An initial exploration of the data
set can help answer these questions by familiarizing analysts with the data with which they are
working.
1. From above image price_per_sqft is not correlated with price feature so that does not
play an important role for predicting the price of a house
2. bhk and bath are highly correlated.
3. We can create a new feature using bhk and bath features.
checking histogram of price per sqft feature
Data preprocessing is an important step in the data mining process. It refers to the cleaning,
transforming, and integrating of data in order to make it ready for analysis. The goal of data
preprocessing is to improve the quality of the data and to make it more suitable for the specific
data mining task.
Some common steps in data preprocessing include:
Data cleaning: this step involves identifying and removing missing, inconsistent, or irrelevant
data. This can include removing duplicate records, filling in missing values, and handling
outliers.
Data integration: this step involves combining data from multiple sources, such as databases,
spreadsheets, and text files. The goal of integration is to create a single, consistent view of
the data.
Data transformation: this step involves converting the data into a format that is more suitable
for the data mining task. This can include normalizing numerical data, creating dummy
variables, and encoding categorical data.
Data reduction: this step is used to select a subset of the data that is relevant to the data
mining task. This can include feature selection (selecting a subset of the variables) or
feature extraction (extracting new variables from the data).
Data discretization: this step is used to convert continuous numerical data into categorical
data, which can be used for decision tree and other categorical data mining techniques.
By performing these steps, the data mining process becomes more efficient and the
results become more accurate.
Data preprocessing is a data mining technique which is used to transform the raw data in a
useful and efficient format.
Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data
cleaning is done. It involves handling of missing data, noisy data etc.
Missing Data:
This situation arises when some data is missing in the data. It can be handled in
various ways.
Some of them are:
This approach is suitable only when the dataset we have is quite large and multiple
values are missing within a tuple.
There are various ways to do this task. You can choose to fill the missing values
manually, by attribute mean or the most probable value.
Noisy Data:
This method works on sorted data in order to smooth it. The whole data is
divided into segments of equal size and then various methods are performed to
complete the task. Each segmented is handled separately. One can replace all
data in a segment by its mean or boundary values can be used to complete the
task.
Regression:
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected
or it will fall outside the clusters.
2. Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining
process. This involves following ways:
Normalization:
It is done in order to scale the data values in a specified range (-1.0 to 1.0 or 0.0
to 1.0)
Attribute Selection:
In this strategy, new attributes are constructed from the given set of attributes to
help the mining process.
Discretization:
This is done to replace the raw values of numeric attribute by interval levels or
conceptual levels.
3. Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While
working with huge volume of data, analysis became harder in such cases. In order to
get rid of this, we uses data reduction technique. It aims to increase the storage
efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:
Data Cube Aggregation:
Aggregation operation is applied to data for the construction of the data cube.
The highly relevant attributes should be used, rest all can be discarded. For
performing
attribute selection, one can use level of significance and p- value of the
attribute.the attribute having p-value greater than significance level can be
discarded.
Numerosity Reduction:
This enable to store the model of data instead of whole data, for example:
Regression Models.
Dimensionality Reduction:
This reduce the size of data by encoding mechanisms.It can be lossy or lossless.
If after reconstruction from compressed data, original data can be retrieved, such
reduction are called lossless reduction else it is called lossy reduction. The two
effective methods of dimensionality reduction are:Wavelet transforms and PCA
(Principal Component Analysis).
DATA CLEANING
Data cleaning is the most important step because in real world the data may not be good
enough to use machine learning algorithm directly on it. Data scientists takes 80% for data
cleaning and data engineering
In our dataset we have a feature size in which we have values like 2 BHK , 3BHK Ml
algorithm cannot understand alphabets so we need to give only numbers we need to make 2
BHK to 2. We created a column bhk
Our dataset does not have all the values some of the values are missing so we need to
handle them to fill them or we directly remove them. Most of the time we remove them to save
time but the information is lost when we remove them. So we use some statistical methods like
mean, median and mode to fill them
• Mean is used when we do not have too many outliers and it should be a numerical
column.
• Median is used when we have outliers in data.
• Mode is used when we have categorical column.
FEATURE ENGINEERING
Feature engineering is most crucial step where create new features from the previous
features. Making features better to use it well by an ml algorithm.
We had created a new feature which is price per square feet which is more correlated with price
feature.
df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']
Dimensionality Reduction
Dimensionality reduction is the process of reducing the number of features (or dimensions) in
a dataset while retaining as much information as possible. This can be done for a variety of
reasons, such as to reduce the complexity of a model, to improve the performance of a
learning algorithm, or to make it easier to visualize the data. There are several techniques for
dimensionality reduction, including principal component analysis (PCA), singular value
decomposition (SVD), and linear discriminant analysis (LDA). Each technique uses a different
method to project the data onto a lower-dimensional space while preserving important
information.
In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on
it. Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Feature selection: In this, we try to find a subset of the original set of variables, or features,
to get a smaller subset which can be used to model the problem. It usually involves
three ways:
1. Filter
2. Wrapper
3. Embedded
Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
In our data location feature has too many values we cannot use them directly it may reduce the
performance of machine learning algorithm. We apply some filters to the feature to reduce the
location values. We have 13200 locations so we take locations which are more than 10 times in
the dataset. Now it is reduced to 240 unique locations and locations which are appeared less
than 10 times we make it as other. Total we have 241 unique location values
REMOVING OUTLIERS
An outlier is an object that deviates significantly from the rest of the objects. They can be
caused by measurement or execution error. The analysis of outlier data is referred to as outlier
analysis or outlier mining.
Most data mining methods discard outliers noise or exceptions, however, in some applications
such as fraud detection, the rare events can be more interesting than the more regularly
occurring one and hence, the outlier analysis becomes important in such case.
Detecting Outlier: Clustering based outlier detection using distance to the closest cluster. In
the K-Means clustering technique, each cluster has a mean value. Objects belong to the
cluster whose mean value is closest to it. In order to identify the Outlier, firstly we need to
initialize the threshold value such that any distance of any data point greater than it from its
nearest cluster identifies it as an outlier for our purpose. Then we need to find the distance of
the test data to each cluster mean. Now, if the
distance between the test data and the closest cluster to it is greater than the threshold value
then we will classify the test data as an outlier.
Outliers are the values which are suspicious and that data points are the errors in the data. We
need to remove them else that errors are also fed to algorithm and the algorithm learns that
errors also. There is a line that garbage in garbage out we need to give good data to algorithm
then it performs better.
In our data we have some outliers like normally square ft per bedroom is 300 (i.e. 2 bhk apartment is
minimum 600 sqft. If you have for example 400 sqft apartment with 2 bhk than that seems suspicious and
can be removed as an outlier. We will remove such outliers by keeping our minimum thresold per bhk to
be 300 sqft.
The data also contains oultiers with respect to prices . If we see price per sqft for 2BHK and
3BHK in Rajajinagar and Hebbal.
2 bhk and 3bhk have similar prices
It also contains outliers with respect to bathrooms. In our dataset we have some data points
which 2 more bathrooms than number of bedrooms. So it seems unusual. They are also
removed at last we have 7239 rows and 5 columns [locations, total_sqft, bath, bhk, price]
ONE HOT ENCODING
Most real-life datasets we encounter during our data science project development have
columns of mixed data type. These datasets consist of both categorical as well as numerical
columns. However, various Machine Learning models do not work with categorical data and to
fit this data into the machine learning model it needs to be converted into numerical data. For
example, suppose a dataset has a Gender column with categorical elements like Male
and Female. These labels have no specific order of preference and also since the data is
string labels, machine learning models misinterpreted that there is some sort of hierarchy in
them.
One approach to solve this problem can be label encoding where we will assign a numerical
value to these labels for example Male and Female mapped to 0 and 1. But this can add bias
in our model as it will start giving higher preference to the Female parameter as 1>0 but
ideally, both labels are equally important in the dataset. To deal with this issue we will use the
One Hot Encoding technique. One hot encoding is a technique that we use to represent categorical
variables as numerical values in a machine learning model.
It allows the use of categorical variables in models that require numerical input.
It can improve model performance by providing more information to the model about
the categorical variable.
It can help to avoid the problem of ordinality, which can occur when a categorical
variable has a natural ordering (e.g. “small”, “medium”, “large”).
It can lead to sparse data, as most observations will have a value of 0 in most of the
one-hot encoded columns.
It can lead to overfitting, especially if there are many categories in the variable and
the sample size is relatively small.
One-hot-encoding is a powerful technique to treat categorical data, but it can lead to
increased dimensionality, sparsity, and overfitting. It is important to use it cautiously
and consider other methods such as ordinal encoding or binary encoding.
Here locations is not a numerical feature ml algorithm takes only numerical features. To convert
locations feature to numerical value we use one hot encoding which takes row value and
converts to a dummy feature.
Here every unique location value is converted to a new feature after creating all the dummy
features we remove one of the dummy feature so that there wont be any dummy variable trap
for machine learning algorithm.
Now our data set becomes of shape 7239 rows and 244 columns i.e., features.
This is a supervised learning model so we divide our dataset to dependent and independent
features. Price is our dependent feature that is the value our model is going to predict. The
remaining 243 features are independent features.
Next we divide the data to training and testing data using scikit-learn train_test_split function We
divide the train and test to 80:20 ratio
)
7 LINEAR REGRESSION
Linear regression is one of the easiest and most popular Machine Learning algorithms. It is a
statistical method that is used for predictive analysis. Linear regression makes predictions for
continuous/real or numeric variables such as sales, salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and one or
more independent (y) variables, hence called as linear regression. Since linear regression
shows the linear relationship, which means it finds how the value of the dependent variable is
changing according to the value of the independent variable.
EQUATION : y= a0+a1x + ε
Cost function:
o The different values for weights or coefficient of lines (a 0, a1) gives the different line of regression,
and the cost function is used to estimate the values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures how a linear regression
model is performing.
o We can use the cost function to find the accuracy of the mapping function, which maps the input
variable to the output variable. This mapping function is also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which is the
average of squared error occurred between the predicted values and actual values. It can be
written as:
Where,
Residuals : The distance between the actual value and predicted values is called residual. If the
observed points are far from the regression line, then the residual will be high, and so cost
function will high. If the scatter points are close to the regression line, then the residual will be
small and hence the cost function.
Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of the cost
function.
o A regression model uses gradient descent to update the coefficients of the line by
reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively update the
values to reach the minimum cost function.
In machine learning, we couldn’t fit the model on the training data and can’t say
that the model will work accurately for the real data. For this, we must assure that our model got
the correct patterns from the data, and it is not getting up too much noise. For this purpose, we
use the cross-validation technique.
Cross validation is a technique used in machine learning to evaluate the performance of a model
on unseen data. It involves dividing the available data into multiple folds or subsets, using one
of these folds as a validation set, and training the model on the remaining folds. This process is
repeated multiple times, each time using a different fold as the validation set. Finally, the results
from each validation step are averaged to produce a more robust estimate of the model’s
performance.
The main purpose of cross validation is to prevent overfitting, which occurs when a model is
trained too well on the training data and performs poorly on new, unseen data. By evaluating the
model on multiple validation sets, cross validation provides a more realistic estimate of the
model’s generalization performance, i.e., its ability to perform well on new, unseen data.
There are several types of cross validation techniques, including k-fold cross validation, leave
one-out cross validation, and stratified cross validation. The choice of technique depends on the
size and nature of the data, as well as the specific requirements of the modeling problem.
We had tried different types of algorithms Decision tree regressor and Lasso Regression.
return pd.DataFrame(scores,columns=['model','best_score','best_params'])
find_best_model_using_gridsearchcv(X,y)
Creating a function that converts the given location to 0 or 1 and predicts the price .So based on
above results linear regression got higher accuracy.
Finally saving the model as a pickle file which can be used to predict directly by not running the
whole
9 FLASK
app.run()
app.run(debug = True)
Routing:
Nowadays, the web frameworks provide routing technique so that user can remember the
URLs. It is useful to access the web page directly without navigating from the Home page. It is
done through the following route() decorator, to bind the URL to a function.
# decorator to route URL
@app.route(‘/hello’)
# binding to the function of route
def hello_world():
return ‘hello world’
If a user visits http://localhost:5000/hello URL, the output of the hello_world() function will be
rendered in the browser. The add_url_rule() function of an application object can also be used
to bind URL with the function as in above example.
Flask support various HTTP protocols for data retrieval from the specified URL, these can be
defined as:-
Method Description
This is used to send the data in an without encryption of the form to the
GET
server.
Sends the form data to server. Data received by POST method is not cached
POST
by server.
A web application often requires a static file such as javascript or a CSS file to render the
display of the web page in browser. Usually, the web server is configured to set them, but
during development, these files are served as static folder in your package or next to the
module.
The code parameter can take the following values to handle the error accordingly:
400 – For Bad Request
401 – For Unauthenticated
403 – For Forbidden request
404 – For Not Found
406 – For Not acceptable
425 – For Unsupported Media
429 – Too many Requests
File-Uploading in Flask:
File Uploading in Flask is very easy. It needs an HTML form with enctype attribute and URL
handler, that fetches file and saves the object to the desired location. Files are temporary
stored on server and then on the desired location.
10 SOURCE CODE
Importing libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline
import matplotlib
matplotlib.rcParams["figure.figsize"] = (20,10)
loading data
df1 = pd.read_csv("/content/drive/MyDrive/bhp
project/Bengaluru_House_Data.csv")
df1.head()
data preprocessing
df2 =
df1.drop(['area_type','society','balcony','availability'],axis='columns')
df3 = df2.dropna()
df3['bhk'] = df3['size'].apply(lambda x: int(x.split(' ')[0]))
df5['price_per_sqft'] = df5['price']*100000/df5['total_sqft']
df5.location = df5.location.apply(lambda x: 'other' if x in
location_stats_less_than_10 else x)
df6 = df5[~(df5.total_sqft/df5.bhk<300)]
def remove_pps_outliers(df):
df_out = pd.DataFrame()
for key, subdf in df.groupby('location'):
m = np.mean(subdf.price_per_sqft)
st = np.std(subdf.price_per_sqft)
reduced_df = subdf[(subdf.price_per_sqft>(m-st)) &
(subdf.price_per_sqft<=(m+st))]
df_out = pd.concat([df_out,reduced_df],ignore_index=True)
return df_out
df7 = remove_pps_outliers(df6)
def remove_bhk_outliers(df):
exclude_indices = np.array([])
for location, location_df in df.groupby('location'):
bhk_stats = {}
for bhk, bhk_df in location_df.groupby('bhk'):
bhk_stats[bhk] = {
'mean': np.mean(bhk_df.price_per_sqft),
'std': np.std(bhk_df.price_per_sqft),
'count': bhk_df.shape[0]
}
for bhk, bhk_df in location_df.groupby('bhk'):
stats = bhk_stats.get(bhk-1)
if stats and stats['count']>5:
exclude_indices = np.append(exclude_indices,
bhk_df[bhk_df.price_per_sqft<(stats['mean'])].index.values)
return df.drop(exclude_indices,axis='index')
df8 = remove_bhk_outliers(df7)
df9 = df8[df8.bath<df8.bhk+2]
df10 = df9.drop(['size','price_per_sqft'],axis='columns')
dummies = pd.get_dummies(df10.location)
df11 = pd.concat([df10,dummies.drop('other',axis='columns')],axis='columns')
df12 = df11.drop('location',axis='columns')
model building
X = df12.drop(['price'],axis='columns')
y = df12.price
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =
train_test_split(X,y,test_size=0.2,random_state=10)
cross_val_score(LinearRegression(), X, y, cv=cv)
output:array([0.82702546, 0.86027005, 0.85322178, 0.8436466 , 0.85481502])
10.2 Util.py
import pickle
import json
import numpy as np
__locations = None
__data_columns = None
__model = None
def get_estimated_price(location,sqft,bhk,bath):
try:
loc_index = __data_columns.index(location.lower())
except:
loc_index = -1
x = np.zeros(len(__data_columns))
x[0] = sqft
x[1] = bath
x[2] = bhk
if loc_index>=0:
x[loc_index] = 1
return round(__model.predict([x])[0],2)
def load_saved_artifacts():
print("loading saved artifacts...start")
global __data_columns
global __locations
global __model
if __model is None:
with open('./artifacts/banglore_home_prices_model.pickle',
'rb') as f:
__model = pickle.load(f)
print("loading saved artifacts...done")
def get_location_names():
return __locations
def get_data_columns():
return __data_columns
if __name__ == '__main__':
load_saved_artifacts()
print(get_location_names())
print(get_estimated_price('1st Phase JP Nagar',1000, 3, 3))
10.3 server.py
app = Flask(__name__)
@app.route('/get_location_names', methods=['GET'])
def get_location_names(): response = jsonify({
'locations': util.get_location_names()
})
response.headers.add('Access-Control-Allow-Origin', '*')
return response
return response
if __name__ ==
"__main__":
print("Starting Python Flask Server For Home Price Prediction...")
util.load_saved_artifacts() app.run()
10.4 HTML
<!DOCTYPE html>
<html>
<head>
<title>Banglore Home Price Prediction</title>
<script
src="https://ajax.googleapis.com/ajax/libs/jquery/3.4.1/jquery.min.js"></
scri pt>
<script src="app.js"></script>
<link rel="stylesheet" href="app.css">
</head>
<body>
<div class="img"></div>
<form class="form">
<h2>Area (Square Feet)</h2>
<input class="area" type="text" id="uiSqft" class="floatLabel"
name="Squareft" value="1000">
<h2>BHK</h2>
<div class="switch-field">
<input type="radio" id="radio-bhk-1" name="uiBHK" value="1"/>
<label for="radio-bhk-1">1</label>
<input type="radio" id="radio-bhk-2" name="uiBHK" value="2"
checked/>
<label for="radio-bhk-2">2</label>
<input type="radio" id="radio-bhk-3" name="uiBHK" value="3"/>
<label for="radio-bhk-3">3</label>
<input type="radio" id="radio-bhk-4" name="uiBHK" value="4"/>
<label for="radio-bhk-4">4</label>
<input type="radio" id="radio-bhk-5" name="uiBHK" value="5"/>
<label for="radio-bhk-5">5</label>
</div>
</form>
<form class="form">
<h2>Bath</h2>
<div class="switch-field">
<input type="radio" id="radio-bath-1" name="uiBathrooms"
value="1"/>
<label for="radio-bath-1">1</label>
<input type="radio" id="radio-bath-2" name="uiBathrooms"
value="2" checked/>
<label for="radio-bath-2">2</label>
<input type="radio" id="radio-bath-3" name="uiBathrooms"
value="3"/> <label for="radio-bath-3">3</label>
<input type="radio" id="radio-bath-4" name="uiBathrooms"
value="4"/> <label for="radio-bath-4">4</label>
<input type="radio" id="radio-bath-5" name="uiBathrooms"
value="5"/>
<label for="radio-bath-5">5</label>
</div>
<h2>Location</h2>
<div>
<select class="location" name="" id="uiLocations">
<option value="" disabled="disabled" selected="selected">Choose a
Location</option>
<option>Electronic City</option>
<option>Rajaji Nagar</option>
</select>
</div>
<button class="submit" onclick="onClickedEstimatePrice()"
type="button">Estimate Price</button>
<div id="uiEstimatedPrice" class="result"> <h2></h2> </div>
</body>
</html
10.5 CSS
@import url(https://fonts.googleapis.com/css?
family=Roboto:300);
.switch-field {
display: flex;
margin-bottom: 36px;
overflow: hidden;
}
.switch-field input {
position: absolute !important;
clip: rect(0, 0, 0, 0);
height: 1px;
width: 1px;
border: 0;
overflow: hidden;
}
.switch-field label {
background-color: #e4e4e4;
color: rgba(0, 0, 0, 0.6);
font-size: 14px;
line-height: 1;
text-align: center;
padding: 8px 16px;
border: 1px solid
rgba(0, 0, 0, 0.2);
box-shadow: inset 0 1px 3px
rgba(0, 0, 0, 0.3), 0 1px
rgba(255, 255, 255, 0.1);
transition: all 0.1s ease-in-out;
}
.switch-field label:hover {
cursor: pointer;
.switch-field label:first-of-type
{
border-radius: 4px 0 0 4px;
}
.switch-field label:last-of-type {
border-radius: 0 4px 4px 0;
}
.form
{
max-width: 270px;
font-family: "Lucida Grande", Tahoma, Verdana, sans-
serif; font-weight: normal;
line-height: 1.625;
margin: 8px auto;
padding-left: 16px;
z-index: 2;
}
h2 {
font-size: 18px;
margin-bottom: 8px;
}
.area{
font-family: "Roboto", sans-
serif; outline: 0;
background: #f2f2f2;
width: 76%;
border: 0;
margin: 0 0 10px;
padding: 10px;
box-sizing: border-box;
font-size: 15px;
height: 35px;
border-radius: 5px;
.location{
font-family: "Roboto", sans-
serif; outline: 0;
background: #f2f2f2;
width: 76%;
border: 0;
margin: 0 0 10px;
padding: 10px;
box-sizing: border-
box; font-size:
15px; height: 40px;
border-radius: 5px;
}
.submit{
background: #a5dc86;
width: 76%;
border: 0;
margin: 25px 0 10px;
box-sizing: border-
box; font-size:
15px; height:
35px;
text-align: center;
border-radius: 5px;
}
.result{
background: #dcd686;
width: 76%;
border: 0;
margin: 25px 0 10px;
box-sizing: border-box;
font-size: 15px;
height: 35px;
text-align: center;
}
.img
{
background: url('https://images.unsplash.com/photo-
1564013799919ab600027ffc6?ixlib=rb-
1.2.1&auto=format&fit=crop&w=1350&q=80');
background-repeat: no-repeat;
background-size: auto;
background-size:100% 100%;
-webkit-filter: blur(5px);
-moz-filter: blur(5px);
-o-filter: blur(5px);
-ms-filter: blur(5px);
filter: blur(15px);
position: fixed;
width: 100%;
height: 100%;
top: 0;
left: 0;
z-index: -1;
}
body, html {
height: 100%;
}
function getBathValue() {
var uiBathrooms = document.getElementsByName("uiBathrooms");
for(var i in uiBathrooms) {
if(uiBathrooms[i].checked) {
return parseInt(i)+1;
}
}
return -1; // Invalid Value
}
function getBHKValue() {
var uiBHK = document.getElementsByName("uiBHK");
for(var i in uiBHK) {
if(uiBHK[i].checked) {
return parseInt(i)+1;
}
}
return -1; // Invalid Value
}
function onClickedEstimatePrice() {
console.log("Estimate price button clicked");
var sqft = document.getElementById("uiSqft");
var bhk = getBHKValue();
var bathrooms = getBathValue();
var location = document.getElementById("uiLocations");
var estPrice = document.getElementById("uiEstimatedPrice");
function onPageLoad() {
console.log( "document loaded" );
var url = "http://127.0.0.1:5000/get_location_names";
//var url = "/api/get_location_names";
$.get(url,function(data, status) {
console.log("got response for get_location_names request");
if(data) {
var locations = data.locations;
var uiLocations = document.getElementById("uiLocations");
$('#uiLocations').empty();
for(var i in locations) {
var opt = new Option(locations[i]);
$('#uiLocations').append(opt);
}
}
});
}
window.onload = onPageLoad;
Linear Regression displayed the best performance for this Dataset and can be used for
deploying purposes. Decision Tree Regressor and Lasso Regression are far behind, so can’t
12 REFERENCES
[1] Book by Aurelien Geron “hands on machine learning with Scikit-learn, Keras and Tensorflow: