InterimReport House Price Prediction
InterimReport House Price Prediction
Semester – IV
Research
Project – Interim Report
Group 25
Page 1 of 37
A study on “House Price Prediction“
Submitted by:
Surya Lakshmi VS
USN:
221VMBR05878
Dr. C S Jyothirmayee
(Faculty-JAIN Online)
DECLARATION
I, Surya Lakshmi VS, hereby declare that the Research Project Report titled “House
Price Prediction” has been prepared by me under the guidance of the Dr. C S
Jyothirmayee. I declare that this Project work is towards the partial fulfillment of the
University Regulations for the award of the degree of Master of Business
Administration by Jain University, Bengaluru. I have undergone a project for a
period of Eight Weeks. I further declare that this Project is based on the original
study undertaken by me and has not been submitted for the award of any
degree/diploma from any other University / Institution.
Page 3 of 37
EXECUTIVE SUMMARY
EDA is an important step in any Data Analysis or Data Science project. EDA involves generating
summary statistics for numerical data in the dataset and creating various graphical
representations to understand the data better. The goal of EDA is to identify patterns,
anomalies, and relationships in the data that can be used to inform subsequent steps in the data
science process, such as building models or identifying insights. EDA is to help look at data
before making any assumptions. It can help identify obvious errors, as well as better understand
patterns within the data, detect outliers or anomalous events, find interesting relations among
the variables. It also helps answer the questions about standard deviations, categorical
variables, and confidence intervals. Finally, once EDA is complete and insights are drawn, its
features can then be used for more sophisticated data analysis or modelling, including machine
learning.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about standard
deviations, categorical variables, and confidence intervals. Once EDA is complete and insights
are drawn, its features can then be used for more sophisticated data analysis or modelling,
including machine learning.
In this article, we will understand EDA with the help of an example dataset. We will use python
language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib, seaborn, and
open datasets libraries. Then loading the dataset into a data frame and reading the dataset
using pandas, view the columns and rows of the data, perform descriptive statistics to know
better about the features inside the dataset, write the observations, finding the missing values
and duplicate rows. Discovering the anomalies in the given set and remove those anomalies.
Univariate visualization of each field in the raw dataset, with summary statistics. Bivariate
visualizations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.
Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean
squared error, R-Squared. Analyze these model matrix for all algorithms in the form of table
then identify the best fit.
Page 4 of 37
Some of the most common data science tools used to create an EDA include python, Jupyter.
The common packages used are pandas, numpy, matplotlib, seaborn, etc.
One important benefit of conducting exploratory data analysis is that it can help you organize a
dataset before you model it. This can help you start to make assumptions and predictions about
your dataset. Another benefit of EDA is that it can help you understand the variables in your
dataset. This can help you organize your dataset and begin to pinpoint relationships between
variables, which is an integral part of data analysis.
Conducting EDA can also help you identify the relationships between the variables in your
dataset. Identifying the relationships between variables is a critical part of drawing conclusions
from a dataset.
Another important benefit of EDA is helping you choose the right model for your dataset. You
can use all of the information that you gain from conducting an EDA to help you choose a data
model. It's important to choose the right data model because it can make it easier for everyone
in your organization to understand your data. Some commonly used data models that you can
choose from include:
You can also use EDA to help you find patterns in a dataset. Finding patterns in a dataset is
important because it can help you make predictions and estimations. This can help your
organization plan for the future and anticipate problems and solutions.
Page 5 of 37
TABLE OF CONTENTS
List of Tables 7
List of Graphs 7
Annexures 27-29
List of Tables
Page 6 of 37
Table No. Table Title Page No.
1 Model Evaluation Comparison between all models 27
List of Graphs
Graph No. Graph Title Page No.
2.2.4 Bar graph for Univariate 15
2.2.4 Scatter plot for Bivariate 16
2.2.4 Heat map for Multi-variate 17
2.2.2.1 Histogram plot 18
2.2.2.2 Box plot 18
3.1 Scatter plot for Linear regression model 20
3.1 Distplot for Linear regression model 21
3.2 Scatter plot for Ridge regression model 22
3.3 Scatter plot for Lasso regression 23
3.4 Scatter plot for Support Victor Regression 23
3.4 Distplot for Support Victor Regression 24
3.5 Scatter plot for Random forest regressor 25
3.5 Distplot for Random forest regressor 25
CHAPTER 1
Page 7 of 37
INTRODUCTION AND BACKGROUND
Page 8 of 37
INTRODUCTION AND BACKGROUND
EDA is an important step in any Data Analysis or Data Science project. EDA involves generating
summary statistics for numerical data in the dataset and creating various graphical
representations to understand the data better. The goal of EDA is to identify patterns,
anomalies, and relationships in the data that can be used to inform subsequent steps in the data
science process, such as building models or identifying insights. EDA is to help look at data
before making any assumptions. It can help identify obvious errors, as well as better understand
patterns within the data, detect outliers or anomalous events, find interesting relations among
the variables. It also helps answer the questions about standard deviations, categorical
variables, and confidence intervals. Finally, once EDA is complete and insights are drawn, its
features can then be used for more sophisticated data analysis or modelling, including machine
learning.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about standard
deviations, categorical variables, and confidence intervals. Once EDA is complete and insights
are drawn, its features can then be used for more sophisticated data analysis or modelling,
including machine learning.
In this article, we will understand EDA with the help of an example dataset. We will use python
language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib, seaborn, and
open datasets libraries. Then loading the dataset into a data frame and reading the dataset
using pandas, view the columns and rows of the data, perform descriptive statistics to know
better about the features inside the dataset, write the observations, finding the missing values
and duplicate rows. Discovering the anomalies in the given set and remove those anomalies.
Univariate visualization of each field in the raw dataset, with summary statistics. Bivariate
visualizations and summary statistics that allow you to assess the relationship between each
variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.
Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean
1
squared error, R-Squared. Analyze these model matrix for all algorithms in the form of table
then identify the best fit.
Some of the most common data science tools used to create an EDA include python, Jupyter.
The common packages used are pandas, numpy, matplotlib, seaborn, etc.
One important benefit of conducting exploratory data analysis is that it can help you organize a
dataset before you model it. This can help you start to make assumptions and predictions about
your dataset. Another benefit of EDA is that it can help you understand the variables in your
dataset. This can help you organize your dataset and begin to pinpoint relationships between
variables, which is an integral part of data analysis.
Conducting EDA can also help you identify the relationships between the variables in your
dataset. Identifying the relationships between variables is a critical part of drawing conclusions
from a dataset.
Another important benefit of EDA is helping you choose the right model for your dataset. You
can use all of the information that you gain from conducting an EDA to help you choose a data
model. It's important to choose the right data model because it can make it easier for everyone
in your organization to understand your data. Some commonly used data models that you can
choose from include:
You can also use EDA to help you find patterns in a dataset. Finding patterns in a dataset is
important because it can help you make predictions and estimations. This can help your
organization plan for the future and anticipate problems and solutions.
If you come across any random home buyer questioning them about their dream house, then
there are high chances that their descriptions would not start off describing the various aspects
of house like the height of basement ceiling or the nearness to a commercial building.
Thousands of people seek to place their home on market with the motto of coming up with a
reasonable price. Generally, assessors apply their experience and common knowledge to gauge
a home based on its various characteristics like its location, commodities and its dimensions.
But, regression analysis comes up with another approach which provides much better home
prices with reliable predictions. Better still, assessor experience can help guide the modeling
2
process to fine tune a final predictive model. So, this model will help for both the home buyers
and home sellers. There is ongoing competition hosted by Kaggle.com from where I am
gathering the required data set [1]. The dataset of the competition furnishes good amount of
info which helps in price negotiations than the other features of home. This dataset also
supports advanced machine learning techniques like random forests and gradient boosting.
The real estate sector is an important industry with many stakeholders ranging from regulatory
bodies to private companies and investors. Among these stakeholders, there is a high demand
for a better understanding of the industry operational mechanism and driving factors. Today
there is a large amount of data available on relevant statistics as well as on additional contextual
factors, and it is natural to try to make use of these in order to improve our understanding of
the industry.
Let‘s suppose we want to make a data science project on the house price prediction of a
company. But before we make a model on this data we have to analyze all the information
which is present across the dataset like as what is the price of the house, what is the price they
are getting, what is the area of the house, and the living measures. These all steps of analyzing
and modifying the data come under EDA.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover
trends, patterns, or check assumptions in data with the help of statistical summaries and
graphical representations.
The main goal of the project is to find out the accurate predictions of the houses/ properties for
the next upcoming years. Here are the step by step process involved
1. Requirement Gathering – We have to gather the information extract the main information
from it.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
3
1. Univariate Analysis – In univariate analysis, we analyze or deal with only one variable at a
time. The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and find patterns
that exist within it.
2. Bi-Variate analysis – This type of data involves two different variables. The analysis of this
type of data deals with causes and relationships and the analysis is done to find out the
relationship between the two variables.
3. Multivariate Analysis – When the data involves three or more variables, it is categorized
under multivariate.
Depending on the type of analysis we can also subcategorize EDA into two parts.
Data Encoding
There are some models like Linear Regression which does not work with categorical dataset in
that case we should try to encode categorical dataset into the numerical column. We can use
different methods for encoding like Label encoding or One-hot encoding. Pandas and sklearn
provide different functions for encoding in our case we will use the Label Encoding function
from sklearn to encode.
In this article, we will understand EDA with the help of an example dataset. We will use python
language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib, seaborn, and
open datasets libraries. Then loading the dataset into a data frame and reading the dataset
using pandas, view the columns and rows of the data, perform descriptive statistics to know
better about the features inside the dataset, write the observations, finding the missing values
and duplicate rows. Discovering the anomalies in the given set and remove those anomalies.
Univariate visualization of each field in the raw dataset, with summary statistics. Bivariate
visualizations and summary statistics that allow you to assess the relationship between each
4
variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.
Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean
squared error, R-Squared. Analyze these model matrix for all algorithms in the form of table
then identify the best fit.
A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don‘t know the price which you may expect — it can‘t
be too low or too high. To find house price you usually try to find similar properties in your
neighborhood and based on gathered data you will try to assess your house price.
• Identify the important home price attributes which feed the model‘s predictive power
Take advantage of all of the feature variables available below, use it to analyse and predict
house prices.
5
1.5 Company and industry overview
The real estate market is one of the most competitive in terms of pricing and same tends to be
vary significantly based on lots of factor, forecasting property price is an important modules in
decision making for both the buyers and investors in supporting budget allocation, finding
property finding stratagems and determining suitable policies hence it becomes one of the
prime fields to apply the concepts of machine learning to optimize and predict the prices with
high accuracy. The industry review give the clear idea and it will serve as the support for the
future projects. most of the authors have concluded that artificial neural network have more
influence in predicting the but in real world there are other algorithms which should have taken
into the consideration. Investor’s decisions are based on the market trends to reap maximum
returns. Developers are interested to know the future trends for their decision making, this
helps to know about the pros and cons and also help to build the project. To accurately estimate
property prices and future trends, large amount of data that influences land price is required for
analysis, modeling and forecasting. The factors that affect the land price have to be studied and
their impact on price has also to be modeled. It is inferred that establishing a simple Regression
linear mathematical relationship for these time-series data is found not viable for prediction.
Hence it became imperative to establish a non-linear model which can well fit the data
characteristic to analyze and predict future trends. As the real estate is fast developing sector,
the analysis and prediction of land prices using mathematical modeling and other techniques is
an immediate urgent need for decision making by all those concerned.
Machine learning (ML) is a subset of artificial intelligence (AI) that focuses on developing
algorithms that enable computers to learn from and make predictions or decisions based on
data. In the context of real estate, ML can be used to analyze vast amounts of historical and
real-time data to predict house prices with a high degree of accuracy.
1. Regression Analysis
2. Decision Trees and Random Forests
3. Gradient Boosting Machines (GBMs)
4. Neural Networks
5. Feature Engineering and Selection
6. Cross-Validation and Model Evaluation
7. Handling Missing Data
6
providing valuable insights and predictions that aid in making informed decisions in the
real estate market.
7
CHAPTER 2
Research Methodology
8
RESEARCH METHODOLOGY
2.2 Methodology
The methodology section outlines the research design, data collection methods, and analytical
techniques employed in developing the house price prediction model. The primary objective is
to create a robust model that accurately predicts house prices based on various property
features.
Research Objectives
Data Sources
9
Real Estate Listings: Data from online real estate platforms providing information on
property features and prices.
Government Records: Publicly available data on property sales and transactions.
Census Data: Demographic information relevant to real estate valuation.
Geospatial Data: Location-specific data such as proximity to amenities, crime rates,
and school quality.
Data Features
Data analysis tools are essential for processing, analyzing, and visualizing data in
the house price prediction project. These tools facilitate data cleaning, feature
engineering, model development, evaluation, and deployment. This section
outlines the key tools and libraries used in the project.
Features:
10
Applications: Handling missing data, merging datasets, aggregating data, and
performing exploratory data analysis (EDA).
Features:
Matplotlib: A plotting library for Python that provides an object-oriented API for
embedding plots.
Features:
Features:
11
Historical Data Collection
The historical data for this study spans a period of five years, from January 2018
to December 2022. This timeframe provides a comprehensive dataset that
captures various market cycles, trends, and seasonal variations in house prices.
Data Updates
To ensure the model remains relevant and accurate, data is updated quarterly. This
involves incorporating new property listings, sales transactions, and any
significant market changes. The regular updates help in refining the model and
adapting it to recent market conditions.
12
This function assumes that the fields are comma separated by default. When a CSV is loaded, we
get a kind of object called a Data Frame, which is made up of rows and columns. Part of a data
frame is shown in Figure below The data extracted as:
The important part and problem of data preprocessing is handling missing values in the dataset.
Data scientists must manage missing values because it can adversely affect the operation of
machine learning models. Data can be imputed in such a procedure, missing values can be filled
based on the other observations.
13
5. Imputing missing observations by exploring similarities between cases.
Missing values are usually represented with ‘nan‘, ‘NA‘ or ‘null‘. Below is the list of variables
with missing variables in the train dataset
Data cleaning: Handling NA values
14
Uni-Variate: Uni-Variate in House Price Prediction , chosen attribute like price because by price
is independent each other.
Bi-Variate: Bi-variate in House Price Prediction, chosen attributes like price, total_sqft because
by total_sqft price is calculated so these two variables are dependent to each other.
15
Multi-Variate: Multi-variate in House Price Prediction, chosen attributes like price, total_sqft,
area, bhk because area, bhk will calculates total_sqft and by total_sqft price is calculated so
these four variables are dependent to each other.
16
17
2.2.4.2 Plots: Box plot
18
CHAPTER 3
19
DATA ANALYSIS AND INTERPRETATION
Linear regression model shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
Feature Engineering:
Add new feature(integer) for bhk (Bedrooms Hall Kitchen)
20
Explore total feature:
Above shows that total_sqft can be a range (e.g. 2100-2850). For such case we can just take
average of min and max value in the range. There are other cases such as 34.46Sq. Meter
which one can convert to square ft using unit conversion. I am going to just drop such corner
cases to keep things simple.
21
For below row, it shows total_sqft as 2475 which is an average of the range 2100-2850
22
Dimensionality Reduction
Any location having less than 10 data points should be tagged as "other" location. This way
number of categories can be reduced by huge amount. Later on when we do one hot
encoding, it will help us with having fewer dummy columns
23
Use K Fold cross validation to measure accuracy of our Linear Regression model
We can see that in 5 iterations we get a score above 80% all the time. This is pretty good but we want
to test few other algorithms for regression to see if we can get even better score. We will use
GridSearchCV for this purpose
Find best model using GridSearchCV
24
Based on above results we can say that LinearRegression gives the best score. Hence we will use that.
Test the model for few properties
25
26
ANNEXURE (if any)
1. Data Description
Property Listings: Data collected from various real estate websites and agencies.
Geospatial Data: Location-based data including proximity to amenities, transport links,
and neighborhood demographics.
Total Square Feet: The total area of the property in square feet.
Number of BHK (Bedrooms, Hall, Kitchen): The configuration of the property.
Number of Bathrooms: The total number of bathrooms in the property.
Location: The geographical location or neighborhood of the property.
Price: The listed or transaction price of the property.
2. Methodology
Feature Engineering: Creating new features from existing data to improve model
performance.
Model Selection: Evaluating various machine learning models such as Linear Regression,
Decision Trees, Random Forest, and Gradient Boosting.
Training and Validation: Splitting data into training and validation sets to evaluate
model performance.
27
3. Model Details
Metrics Used: Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean
Squared Error (RMSE), and R-squared (R²).
Cross-Validation: K-fold cross-validation to ensure model robustness.
4. Results
5. Deployment
Endpoints:
o /get_location_names: Fetches the list of available locations.
o /predict_home_price: Predicts the price of a property based on input
features.
Integration: Integration with a web application for user interaction.
6. User Interface
28
6.2 Functionality
Input Fields: Fields for entering square feet, number of BHK, number of bathrooms, and
selecting location.
Output: Displaying the estimated price based on user inputs.
7. References
Data Sources:
o Real estate websites (e.g., Zillow, Realtor.com)
Academic References:
o Research papers and articles on real estate price prediction
o Documentation of machine learning algorithms used
29