0% found this document useful (0 votes)
93 views37 pages

InterimReport House Price Prediction

Uploaded by

suryalakshmi147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views37 pages

InterimReport House Price Prediction

Uploaded by

suryalakshmi147
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 37

MBA

Semester – IV
Research
Project – Interim Report

Name Surya Lakshmi VS

Project House Price Prediction

Group 25

Date of Submission 16/06/2024

Page 1 of 37
A study on “House Price Prediction“

Research Project submitted to Jain Online (Deemed-to-be University)


In partial fulfillment of the requirements for the award of:
Master of Business Administration

Submitted by:

Surya Lakshmi VS

USN:
221VMBR05878

Under the guidance of:

Dr. C S Jyothirmayee

(Faculty-JAIN Online)

Jain Online (Deemed-to-be University)


Bangalore
Page 2 of 37
2022-23

DECLARATION

I, Surya Lakshmi VS, hereby declare that the Research Project Report titled “House
Price Prediction” has been prepared by me under the guidance of the Dr. C S
Jyothirmayee. I declare that this Project work is towards the partial fulfillment of the
University Regulations for the award of the degree of Master of Business
Administration by Jain University, Bengaluru. I have undergone a project for a
period of Eight Weeks. I further declare that this Project is based on the original
study undertaken by me and has not been submitted for the award of any
degree/diploma from any other University / Institution.

Place: Bangalore ______________________


Date: 16-06-2024 Surya Lakshmi VS
USN:221VMBR05878

Page 3 of 37
EXECUTIVE SUMMARY

EDA is an important step in any Data Analysis or Data Science project. EDA involves generating
summary statistics for numerical data in the dataset and creating various graphical
representations to understand the data better. The goal of EDA is to identify patterns,
anomalies, and relationships in the data that can be used to inform subsequent steps in the data
science process, such as building models or identifying insights. EDA is to help look at data
before making any assumptions. It can help identify obvious errors, as well as better understand
patterns within the data, detect outliers or anomalous events, find interesting relations among
the variables. It also helps answer the questions about standard deviations, categorical
variables, and confidence intervals. Finally, once EDA is complete and insights are drawn, its
features can then be used for more sophisticated data analysis or modelling, including machine
learning.

Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about standard
deviations, categorical variables, and confidence intervals. Once EDA is complete and insights
are drawn, its features can then be used for more sophisticated data analysis or modelling,
including machine learning.

In this article, we will understand EDA with the help of an example dataset. We will use python
language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib, seaborn, and
open datasets libraries. Then loading the dataset into a data frame and reading the dataset
using pandas, view the columns and rows of the data, perform descriptive statistics to know
better about the features inside the dataset, write the observations, finding the missing values
and duplicate rows. Discovering the anomalies in the given set and remove those anomalies.
Univariate visualization of each field in the raw dataset, with summary statistics. Bivariate
visualizations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.

Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean
squared error, R-Squared. Analyze these model matrix for all algorithms in the form of table
then identify the best fit.

Page 4 of 37
Some of the most common data science tools used to create an EDA include python, Jupyter.
The common packages used are pandas, numpy, matplotlib, seaborn, etc.

One important benefit of conducting exploratory data analysis is that it can help you organize a
dataset before you model it. This can help you start to make assumptions and predictions about
your dataset. Another benefit of EDA is that it can help you understand the variables in your
dataset. This can help you organize your dataset and begin to pinpoint relationships between
variables, which is an integral part of data analysis.

Conducting EDA can also help you identify the relationships between the variables in your
dataset. Identifying the relationships between variables is a critical part of drawing conclusions
from a dataset.

Another important benefit of EDA is helping you choose the right model for your dataset. You
can use all of the information that you gain from conducting an EDA to help you choose a data
model. It's important to choose the right data model because it can make it easier for everyone
in your organization to understand your data. Some commonly used data models that you can
choose from include:

You can also use EDA to help you find patterns in a dataset. Finding patterns in a dataset is
important because it can help you make predictions and estimations. This can help your
organization plan for the future and anticipate problems and solutions.

Page 5 of 37
TABLE OF CONTENTS

Title Page Nos.

Executive Summary 4-5

List of Tables 7

List of Graphs 7

Chapter 1: Introduction and Background 1-7

Chapter 2: Research Methodology 8-18

Chapter 3: Data Analysis and Interpretation 19-26

Annexures 27-29

List of Tables

Page 6 of 37
Table No. Table Title Page No.
1 Model Evaluation Comparison between all models 27

List of Graphs
Graph No. Graph Title Page No.
2.2.4 Bar graph for Univariate 15
2.2.4 Scatter plot for Bivariate 16
2.2.4 Heat map for Multi-variate 17
2.2.2.1 Histogram plot 18
2.2.2.2 Box plot 18
3.1 Scatter plot for Linear regression model 20
3.1 Distplot for Linear regression model 21
3.2 Scatter plot for Ridge regression model 22
3.3 Scatter plot for Lasso regression 23
3.4 Scatter plot for Support Victor Regression 23
3.4 Distplot for Support Victor Regression 24
3.5 Scatter plot for Random forest regressor 25
3.5 Distplot for Random forest regressor 25

CHAPTER 1

Page 7 of 37
INTRODUCTION AND BACKGROUND

Page 8 of 37
INTRODUCTION AND BACKGROUND

1.1 Executive Summary

EDA is an important step in any Data Analysis or Data Science project. EDA involves generating
summary statistics for numerical data in the dataset and creating various graphical
representations to understand the data better. The goal of EDA is to identify patterns,
anomalies, and relationships in the data that can be used to inform subsequent steps in the data
science process, such as building models or identifying insights. EDA is to help look at data
before making any assumptions. It can help identify obvious errors, as well as better understand
patterns within the data, detect outliers or anomalous events, find interesting relations among
the variables. It also helps answer the questions about standard deviations, categorical
variables, and confidence intervals. Finally, once EDA is complete and insights are drawn, its
features can then be used for more sophisticated data analysis or modelling, including machine
learning.

Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about standard
deviations, categorical variables, and confidence intervals. Once EDA is complete and insights
are drawn, its features can then be used for more sophisticated data analysis or modelling,
including machine learning.

In this article, we will understand EDA with the help of an example dataset. We will use python
language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib, seaborn, and
open datasets libraries. Then loading the dataset into a data frame and reading the dataset
using pandas, view the columns and rows of the data, perform descriptive statistics to know
better about the features inside the dataset, write the observations, finding the missing values
and duplicate rows. Discovering the anomalies in the given set and remove those anomalies.
Univariate visualization of each field in the raw dataset, with summary statistics. Bivariate
visualizations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.

Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean

1
squared error, R-Squared. Analyze these model matrix for all algorithms in the form of table
then identify the best fit.

Some of the most common data science tools used to create an EDA include python, Jupyter.
The common packages used are pandas, numpy, matplotlib, seaborn, etc.

One important benefit of conducting exploratory data analysis is that it can help you organize a
dataset before you model it. This can help you start to make assumptions and predictions about
your dataset. Another benefit of EDA is that it can help you understand the variables in your
dataset. This can help you organize your dataset and begin to pinpoint relationships between
variables, which is an integral part of data analysis.

Conducting EDA can also help you identify the relationships between the variables in your
dataset. Identifying the relationships between variables is a critical part of drawing conclusions
from a dataset.

Another important benefit of EDA is helping you choose the right model for your dataset. You
can use all of the information that you gain from conducting an EDA to help you choose a data
model. It's important to choose the right data model because it can make it easier for everyone
in your organization to understand your data. Some commonly used data models that you can
choose from include:

You can also use EDA to help you find patterns in a dataset. Finding patterns in a dataset is
important because it can help you make predictions and estimations. This can help your
organization plan for the future and anticipate problems and solutions.

1.2 Introduction and Background

If you come across any random home buyer questioning them about their dream house, then
there are high chances that their descriptions would not start off describing the various aspects
of house like the height of basement ceiling or the nearness to a commercial building.
Thousands of people seek to place their home on market with the motto of coming up with a
reasonable price. Generally, assessors apply their experience and common knowledge to gauge
a home based on its various characteristics like its location, commodities and its dimensions.
But, regression analysis comes up with another approach which provides much better home
prices with reliable predictions. Better still, assessor experience can help guide the modeling

2
process to fine tune a final predictive model. So, this model will help for both the home buyers
and home sellers. There is ongoing competition hosted by Kaggle.com from where I am
gathering the required data set [1]. The dataset of the competition furnishes good amount of
info which helps in price negotiations than the other features of home. This dataset also
supports advanced machine learning techniques like random forests and gradient boosting.

The real estate sector is an important industry with many stakeholders ranging from regulatory
bodies to private companies and investors. Among these stakeholders, there is a high demand
for a better understanding of the industry operational mechanism and driving factors. Today
there is a large amount of data available on relevant statistics as well as on additional contextual
factors, and it is natural to try to make use of these in order to improve our understanding of
the industry.

Let‘s suppose we want to make a data science project on the house price prediction of a
company. But before we make a model on this data we have to analyze all the information
which is present across the dataset like as what is the price of the house, what is the price they
are getting, what is the area of the house, and the living measures. These all steps of analyzing
and modifying the data come under EDA.

Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover
trends, patterns, or check assumptions in data with the help of statistical summaries and
graphical representations.

The main goal of the project is to find out the accurate predictions of the houses/ properties for
the next upcoming years. Here are the step by step process involved
1. Requirement Gathering – We have to gather the information extract the main information
from it.

2. Normalizing the data


3. Detecting Outliners in the data
4. Analysis and visualisation using the data

Types of EDA

Depending on the number of columns we are analyzing we can divide EDA into two types.

3
1. Univariate Analysis – In univariate analysis, we analyze or deal with only one variable at a
time. The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and find patterns
that exist within it.

2. Bi-Variate analysis – This type of data involves two different variables. The analysis of this
type of data deals with causes and relationships and the analysis is done to find out the
relationship between the two variables.

3. Multivariate Analysis – When the data involves three or more variables, it is categorized
under multivariate.

Depending on the type of analysis we can also subcategorize EDA into two parts.

1. Non-graphical Analysis – In non-graphical analysis, we analyze data using statistical tools


like mean median or mode or skewness

2. Graphical Analysis – In graphical analysis, we use visualizations charts to visualize trends


and patterns in the data

Data Encoding

There are some models like Linear Regression which does not work with categorical dataset in
that case we should try to encode categorical dataset into the numerical column. We can use
different methods for encoding like Label encoding or One-hot encoding. Pandas and sklearn
provide different functions for encoding in our case we will use the Label Encoding function
from sklearn to encode.

In this article, we will understand EDA with the help of an example dataset. We will use python
language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib, seaborn, and
open datasets libraries. Then loading the dataset into a data frame and reading the dataset
using pandas, view the columns and rows of the data, perform descriptive statistics to know
better about the features inside the dataset, write the observations, finding the missing values
and duplicate rows. Discovering the anomalies in the given set and remove those anomalies.
Univariate visualization of each field in the raw dataset, with summary statistics. Bivariate
visualizations and summary statistics that allow you to assess the relationship between each

4
variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.

Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean
squared error, R-Squared. Analyze these model matrix for all algorithms in the form of table
then identify the best fit.

1.3 Problem Statement

A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don‘t know the price which you may expect — it can‘t
be too low or too high. To find house price you usually try to find similar properties in your
neighborhood and based on gathered data you will try to assess your house price.

1.4 Objective of Study

• Create an effective price prediction model

• Validate the model‘s prediction accuracy

• Identify the important home price attributes which feed the model‘s predictive power
Take advantage of all of the feature variables available below, use it to analyse and predict
house prices.

1. Area_type: Type of built up area


2. availability: Date house will be available
3. location: Location of the house
4. size: size of the house including BHK’s
5. society: Locality/society of the property
6. total_sqft: square footage of the home
7. bath: number of bathrooms
8. balcony: number of balconies available
9. Price: Total price of the property

5
1.5 Company and industry overview

The real estate market is one of the most competitive in terms of pricing and same tends to be
vary significantly based on lots of factor, forecasting property price is an important modules in
decision making for both the buyers and investors in supporting budget allocation, finding
property finding stratagems and determining suitable policies hence it becomes one of the
prime fields to apply the concepts of machine learning to optimize and predict the prices with
high accuracy. The industry review give the clear idea and it will serve as the support for the
future projects. most of the authors have concluded that artificial neural network have more
influence in predicting the but in real world there are other algorithms which should have taken
into the consideration. Investor’s decisions are based on the market trends to reap maximum
returns. Developers are interested to know the future trends for their decision making, this
helps to know about the pros and cons and also help to build the project. To accurately estimate
property prices and future trends, large amount of data that influences land price is required for
analysis, modeling and forecasting. The factors that affect the land price have to be studied and
their impact on price has also to be modeled. It is inferred that establishing a simple Regression
linear mathematical relationship for these time-series data is found not viable for prediction.
Hence it became imperative to establish a non-linear model which can well fit the data
characteristic to analyze and predict future trends. As the real estate is fast developing sector,
the analysis and prediction of land prices using mathematical modeling and other techniques is
an immediate urgent need for decision making by all those concerned.

1.6 Overview of Theoretical Concepts

Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing
algorithms that enable computers to learn from and make predictions or decisions based on
data. In the context of real estate, ML can be used to analyze vast amounts of historical and
real-time data to predict house prices with a high degree of accuracy.

Key Theoretical Concepts

1. Regression Analysis
2. Decision Trees and Random Forests
3. Gradient Boosting Machines (GBMs)
4. Neural Networks
5. Feature Engineering and Selection
6. Cross-Validation and Model Evaluation
7. Handling Missing Data

Understanding these theoretical concepts is fundamental to developing accurate and


robust house price prediction models. By leveraging various machine learning algorithms
and techniques, we can analyze and interpret complex real estate data, ultimately

6
providing valuable insights and predictions that aid in making informed decisions in the
real estate market.

7
CHAPTER 2

Research Methodology

8
RESEARCH METHODOLOGY

2.1 Scope of the Study


This study has been organized through theoretical research and practical implementation of
regression algorithms. The theoretical part relies on peer-reviewed articles to answer the
research questions, which is going to be detailed. The practical part will be performed according
to the design described below and detailed furthermore.

2.2 Methodology
The methodology section outlines the research design, data collection methods, and analytical
techniques employed in developing the house price prediction model. The primary objective is
to create a robust model that accurately predicts house prices based on various property
features.

2.2.1 Research Design

The research adopts a quantitative approach, utilizing historical data and


statistical techniques to build predictive models. The study is exploratory, aiming to
identify key factors influencing house prices and apply machine learning algorithms to
predict these prices accurately.

Research Objectives

 Objective 1: Identify and collect relevant data on residential properties.


 Objective 2: Preprocess the data to handle missing values and encode categorical
variables.
 Objective 3: Develop and train machine learning models to predict house prices.
 Objective 4: Evaluate the models using appropriate metrics and validate their
performance.
 Objective 5: Deploy the best-performing model for real-time price prediction.

2.2.2 Data Collection

Data Sources

Data for the study is collected from various sources, including:

9
 Real Estate Listings: Data from online real estate platforms providing information on
property features and prices.
 Government Records: Publicly available data on property sales and transactions.
 Census Data: Demographic information relevant to real estate valuation.
 Geospatial Data: Location-specific data such as proximity to amenities, crime rates,
and school quality.

Data Features

The dataset includes the following key features:

 Square Footage: Total area of the property in square feet.


 Location: Geographic location of the property.
 Number of Bedrooms: Total number of bedrooms.
 Number of Bathrooms: Total number of bathrooms.
 Age of Property: Age of the property in years.
 Other Features: Additional features such as parking spaces, garden, pool, etc.

2.2.3 Sampling Method (if applicable)


2.2.4 Data Analysis Tools

Data analysis tools are essential for processing, analyzing, and visualizing data in
the house price prediction project. These tools facilitate data cleaning, feature
engineering, model development, evaluation, and deployment. This section
outlines the key tools and libraries used in the project.

Data Collection and Preparation Tools

Python: A versatile programming language widely used for data analysis,


machine learning, and web development.

Applications: Data collection, preprocessing, model building, and deployment.

Pandas: A powerful data manipulation and analysis library for Python.

Features:

o Data Frame objects for data manipulation.


o Functions for reading and writing data in various formats (CSV, Excel, SQL, etc.).
o Data cleaning and transformation operations.

10
Applications: Handling missing data, merging datasets, aggregating data, and
performing exploratory data analysis (EDA).

NumPy: A fundamental package for scientific computing with Python, providing


support for large, multi-dimensional arrays and matrices.

Features:

o Mathematical functions for array operations.


o Linear algebra, Fourier transform, and random number capabilities.

Applications: Efficient numerical computations, data manipulation, and preprocessing.

Data Visualization Tools

Matplotlib: A plotting library for Python that provides an object-oriented API for
embedding plots.

Features:

o Line plots, scatter plots, bar charts, histograms, etc.


o Customizable plots with various styles and formats.

Applications: Visualizing data distributions, trends, and relationships between features.

Seaborn: A data visualization library built on top of Matplotlib that provides a


high-level interface for drawing attractive and informative statistical graphics.

Features:

o Improved aesthetics and themes.


o Functions for visualizing categorical data, distributions, and matrix plots.

Applications: Creating informative and attractive statistical graphics to explore data


patterns and correlations.

2.3 Period of Study


The period of study refers to the specific timeframe during which data is collected, analyzed,
and utilized for developing the house price prediction model. Defining the period of study is
crucial as it impacts the relevance and accuracy of the model in reflecting current market
conditions.

11
Historical Data Collection

The historical data for this study spans a period of five years, from January 2018
to December 2022. This timeframe provides a comprehensive dataset that
captures various market cycles, trends, and seasonal variations in house prices.

Data Updates

To ensure the model remains relevant and accurate, data is updated quarterly. This
involves incorporating new property listings, sales transactions, and any
significant market changes. The regular updates help in refining the model and
adapting it to recent market conditions.

2.4 Utility of Research


The research on house price prediction has significant utility across various sectors and
stakeholders. Accurate and reliable predictions of house prices can inform decision-making,
optimize investments, and enhance understanding of the real estate market dynamics. This
section outlines the practical applications and benefits of the research.
In this project, I used python‘s powerful libraries to make the machine learning models efficient.
Majorly three essential libraries NumPy, Pandas, Sci-kit learn had been used in all the machine
learning models. NumPy is a powerful library for implementing scientific computing with Python.
The most important object of NumPy‘s is the homogeneous multidimensional array[16]. NumPy
saves us from writing inefficient and tiresome huge calculations. NumPy provides a way more
elegant solution for mathematical calculations in python. It provides an alternative to the
regular python lists. Numpy array is similar to a regular python list with one additional feature.
You can perform calculations over all entire arrays easily, super-fast as well. Pandas is a flexible
open source python library with high performance, flexible and expressive data structures.
Pandas works better with relational and labeled data. Though python is great for data mining
and preparation, python lags great in practical, real world data analysis and modeling [17].
Pandas helps great in filling these gaps. It is called the most powerful tool for data analysis and
data manipulation. Scikit-learn is a great open source package providing a good chain of
supervised and unsupervised algorithms [18]. Scikit-learn is built up on scientific python(SciPy).
This library is primarily focused on modeling data. Few popular models of Scikit-learn are
clustering, cross validation, ensemble methods, feature extraction and feature selection [18].

Getting the dataset :


In this section I will discuss how to load a dataset. In this project, pandas library was used to load
all the dataset files. Pandas is powerful and very efficient in analyzing the data and also enables
us to read the data of different formats. I choose CSV format because it is very easy to transfer
huge databases between the programs. Read_csv pandas function is used in reading the data.

12
This function assumes that the fields are comma separated by default. When a CSV is loaded, we
get a kind of object called a Data Frame, which is made up of rows and columns. Part of a data
frame is shown in Figure below The data extracted as:

The mean and standard deviation of the data set

Handling Missing data :

The important part and problem of data preprocessing is handling missing values in the dataset.
Data scientists must manage missing values because it can adversely affect the operation of
machine learning models. Data can be imputed in such a procedure, missing values can be filled
based on the other observations.

Techniques involved in imputing unknown or missing observations include:


1. Deleting the whole rows or columns with unknown or missing observations.
2. Missing values can be inferred by averaging techniques like mean, median, mode.
3. Imputing missing observations with the most frequent values.
4. Imputing missing observations by exploring correlations.

13
5. Imputing missing observations by exploring similarities between cases.

Missing values are usually represented with ‘nan‘, ‘NA‘ or ‘null‘. Below is the list of variables
with missing variables in the train dataset
Data cleaning: Handling NA values

Before and after removal of outliers: Rajajinagar

Uni-Variate, Bi-Variate, Multi-Variate:

14
Uni-Variate: Uni-Variate in House Price Prediction , chosen attribute like price because by price
is independent each other.

Bi-Variate: Bi-variate in House Price Prediction, chosen attributes like price, total_sqft because
by total_sqft price is calculated so these two variables are dependent to each other.

15
Multi-Variate: Multi-variate in House Price Prediction, chosen attributes like price, total_sqft,
area, bhk because area, bhk will calculates total_sqft and by total_sqft price is calculated so
these four variables are dependent to each other.

2.2.4.1 Plots: Histogram plot

16
17
2.2.4.2 Plots: Box plot

18
CHAPTER 3

DATA ANALYSIS AND INTERPRETATION

19
DATA ANALYSIS AND INTERPRETATION

3.1 Linear regression model

Linear regression model shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.

Scatter Plot for Linear Regression Model

Feature Engineering:
Add new feature(integer) for bhk (Bedrooms Hall Kitchen)

20
Explore total feature:

Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take
average of min and max value in the range. There are other cases such as 34.46Sq. Meter
which one can convert to square ft using unit conversion. I am going to just drop such corner
cases to keep things simple.

21
For below row, it shows total_sqft as 2475 which is an average of the range 2100-2850

Add new feature called price per square feet

22
Dimensionality Reduction
Any location having less than 10 data points should be tagged as "other" location. This way
number of categories can be reduced by huge amount. Later on when we do one hot
encoding, it will help us with having fewer dummy columns

23
Use K Fold cross validation to measure accuracy of our Linear Regression model

We can see that in 5 iterations we get a score above 80% all the time. This is pretty good but we want
to test few other algorithms for regression to see if we can get even better score. We will use
GridSearchCV for this purpose
Find best model using GridSearchCV

24
Based on above results we can say that LinearRegression gives the best score. Hence we will use that.
Test the model for few properties

25
26
ANNEXURE (if any)

1. Data Description

1.1 Data Sources

 Property Listings: Data collected from various real estate websites and agencies.
 Geospatial Data: Location-based data including proximity to amenities, transport links,
and neighborhood demographics.

1.2 Data Features

 Total Square Feet: The total area of the property in square feet.
 Number of BHK (Bedrooms, Hall, Kitchen): The configuration of the property.
 Number of Bathrooms: The total number of bathrooms in the property.
 Location: The geographical location or neighborhood of the property.
 Price: The listed or transaction price of the property.

2. Methodology

2.1 Data Collection

 Timeframe: Data collected from January 2018 to December 2022.


 Frequency: Data updated quarterly to include new listings and transactions.

2.2 Data Preprocessing

 Cleaning: Handling missing values, outliers, and ensuring data consistency.


 Normalization: Scaling features to ensure uniformity in the dataset.

2.3 Model Development

 Feature Engineering: Creating new features from existing data to improve model
performance.
 Model Selection: Evaluating various machine learning models such as Linear Regression,
Decision Trees, Random Forest, and Gradient Boosting.
 Training and Validation: Splitting data into training and validation sets to evaluate
model performance.

27
3. Model Details

3.1 Linear Regression Model

 Equation: Price=β0+β1×sqft+β2×bath+β3×bhk+βi×locationi\text{Price} = \beta_0 + \


beta_1 \times \text{sqft} + \beta_2 \times \text{bath} + \beta_3 \times \text{bhk} + \
beta_i \times \text{location}_iPrice=β0+β1×sqft+β2×bath+β3×bhk+βi×locationi
 Coefficients: The weights assigned to each feature indicating their impact on the price.

3.2 Model Evaluation

 Metrics Used: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean
Squared Error (RMSE), and R-squared (R²).
 Cross-Validation: K-fold cross-validation to ensure model robustness.

4. Results

4.1 Model Performance

 Training Accuracy: The accuracy of the model on the training dataset.


 Validation Accuracy: The accuracy of the model on the validation dataset.
 Error Metrics: Detailed error metrics for both training and validation sets.

5. Deployment

5.1 Flask API

 Endpoints:
o /get_location_names: Fetches the list of available locations.
o /predict_home_price: Predicts the price of a property based on input
features.
 Integration: Integration with a web application for user interaction.

6. User Interface

6.1 Web Application

 Frontend: HTML, CSS, JavaScript, and jQuery for user interface.


 Backend: Flask framework for handling API requests and responses.

28
6.2 Functionality

 Input Fields: Fields for entering square feet, number of BHK, number of bathrooms, and
selecting location.
 Output: Displaying the estimated price based on user inputs.

7. References

 Data Sources:
o Real estate websites (e.g., Zillow, Realtor.com)
 Academic References:
o Research papers and articles on real estate price prediction
o Documentation of machine learning algorithms used

29

You might also like