0% found this document useful (0 votes)
39 views28 pages

Report Capstone Project House Price Prediction

The document discusses the importance of exploratory data analysis (EDA) in predicting house prices, emphasizing that various features beyond location and square footage contribute to a home's value. It outlines the objective to create an effective price prediction model using multiple variables and methodologies, including regression analysis. The document also highlights the significance of understanding data patterns and relationships to improve predictive accuracy in the real estate market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views28 pages

Report Capstone Project House Price Prediction

The document discusses the importance of exploratory data analysis (EDA) in predicting house prices, emphasizing that various features beyond location and square footage contribute to a home's value. It outlines the objective to create an effective price prediction model using multiple variables and methodologies, including regression analysis. The document also highlights the significance of understanding data patterns and relationships to improve predictive accuracy in the real estate market.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

lOMoARcPSD|46745212

lOMoARcPSD|46745212

House Price
Prediction

1
lOMoARcPSD|46745212

Problem statement:
A house value is simply more than location and square footage. Like the features that make up a person, an
educated party would want to know all aspects that give a house its value. For example, you want to sell a house
and you don’t know the price which you may expect — it can’t be too low or too high. To find house price you
usually try to find similar properties in your neighborhood and based on gathered data you will try to assess your
house price.

2
lOMoARcPSD|46745212

Objective:
Take advantage of all of the feature variables available below, use it to analyse and predict house prices.
1. cid: a notation for a house
2. dayhours: Date house was sold
3. price: Price is prediction target
4. room_bed: Number of Bedrooms/House
5. room_bath: Number of bathrooms/bedrooms
6. living_measure: square footage of the home
7. lot_measure: quare footage of the lot
8. ceil: Total floors (levels) in house
9. coast: House which has a view to a waterfront
10. sight: Has been viewed
11. condition: How good the condition is (Overall)
12. quality: grade given to the housing unit, based on grading system
13. ceil_measure: square footage of house apart from basement
14. basement_measure: square footage of the basement
15. yr_built: Built Year
16. yr_renovated: Year when house was renovated
17. zipcode: zip
18. lat: Latitude coordinate
19. long: Longitude coordinate
20. living_measure15: Living room area in 2015(implies-- some renovations) This might or might not have
affected the lotsize area
21. lot_measure15: lotSize area in 2015(implies-- some renovations)
22. furnished: Based on the quality of room
23. total_area: Measure of both living and lot

3
lOMoARcPSD|46745212

TABLE OF CONTENTS

Title Page Nos.

List of Graphs 4

Executive Summary 6-7

Chapter 1: Introduction and background 8-12

Chapter 2: Research Methodology 13-20

Chapter 3: Data analysis and interpretation 21-27

Chapter 4: Findings, Recommendations and Conclusion 28-31

List of Graphs
Graph No. Graph Title Page No.
2.2.4 Bar graph for Univariate 17
2.2.4 Scatter plot for Bivariate 18
2.2.4 Heat map for Multi-variate 18
2.2.2.1 Histogram plot 19
2.2.2.2 Box plot 19
2.2.2.3 Correlation between variables 20
3.1 Scatter plot for Linear regression model 21
3.1 Distplot for Linear regression model 22
3.2 Scatter plot for Ridge regression model 23
3.2.1 Distplot for Ridge regression model 23
3.3 Scatter plot for Lasso regression 24
3.4 Scatter plot for Support Victor Regression 25
3.4 Distplot for Support Victor Regression 25
3.5 Scatter plot for Random forest regressor 26
3.5 Distplot for Random forest regressor 27

4
lOMoARcPSD|46745212

List of Tables
Table No. Table Title Page No.
1 Model Evaluation Comparison between all models 27

5
lOMoARcPSD|46745212

CHAPTER 1

INTRODUCTION AND BACKGROUND

1.1 EXECUTIVE SUMMARY

EDA is an important step in any Data Analysis or Data Science project. EDA involves generating
summary statistics for numerical data in the dataset and creating various graphical representations
to understand the data better. The goal of EDA is to identify patterns, anomalies, and
relationships in the data that can be used to inform subsequent steps in the data science process,
such as building models or identifying insights. EDA is to help look at data before making any
assumptions. It can help identify obvious errors, as well as better understand patterns within the
data, detect outliers or anomalous events, find interesting relations among the variables. It also
helps answer the questions about standard deviations, categorical variables, and confidence
intervals. Finally, once EDA is complete and insights are drawn, its features can then be used for
more sophisticated data analysis or modelling, including machine learning.

Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about standard
deviations, categorical variables, and confidence intervals. Once EDA is complete and insights
are drawn, its features can then be used for more sophisticated data analysis or modelling,
including machine learning.

In this article, we will understand EDA with the help of an example dataset. We will
use python language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib,
seaborn, and open datasets libraries. Then loading the dataset into a data frame and reading the
dataset using pandas, view the columns and rows of the data, perform descriptive statistics to
know better about the features inside the dataset, write the observations, finding the missing
values and duplicate rows. Discovering the anomalies in the given set and remove those

6
lOMoARcPSD|46745212

anomalies. Univariate visualization of each field in the raw dataset, with summary statistics. Bi-
variate visualizations and summary statistics that allow you to assess the relationship between
each variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.

Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean squared
error, R-Squared. Analyze these model matrix for all algorithms in the form of table then identify
the best fit.

Some of the most common data science tools used to create an EDA include python, Jupyter. The
common packages used are pandas, numpy, matplotlib, seaborn, etc.

One important benefit of conducting exploratory data analysis is that it can help you organize a
dataset before you model it. This can help you start to make assumptions and predictions about
your dataset. Another benefit of EDA is that it can help you understand the variables in your
dataset. This can help you organize your dataset and begin to pinpoint relationships between
variables, which is an integral part of data analysis.
Conducting EDA can also help you identify the relationships between the variables in your
dataset. Identifying the relationships between variables is a critical part of drawing conclusions
from a dataset.
Another important benefit of EDA is helping you choose the right model for your dataset. You
can use all of the information that you gain from conducting an EDA to help you choose a data
model. It's important to choose the right data model because it can make it easier for everyone in
your organization to understand your data. Some commonly used data models that you can
choose from include:
You can also use EDA to help you find patterns in a dataset. Finding patterns in a dataset is
important because it can help you make predictions and estimations. This can help your
organization plan for the future and anticipate problems and solutions.

7
lOMoARcPSD|46745212

1.2 Introduction and Background


If you come across any random home buyer questioning them about their dream house, then there
are high chances that their descriptions would not start off describing the various aspects of house
like the height of basement ceiling or the nearness to a commercial building. Thousands of people
seek to place their home on market with the motto of coming up with a reasonable price.
Generally, assessors apply their experience and common knowledge to gauge a home based on its
various characteristics like its location, commodities and its dimensions. But, regression analysis
comes up with another approach which provides much better home prices with reliable
predictions. Better still, assessor experience can help guide the modeling process to fine tune a
final predictive model. So, this model will help for both the home buyers and home sellers. There
is ongoing competition hosted by Kaggle.com from where I am gathering the required data set
[1]. The dataset of the competition furnishes good amount of info which helps in price
negotiations than the other features of home. This dataset also supports advanced machine
learning techniques like random forests and gradient boosting.

The real estate sector is an important industry with many stakeholders ranging from regulatory
bodies to private companies and investors. Among these stakeholders, there is a high demand for
a better understanding of the industry operational mechanism and driving factors. Today there is
a large amount of data available on relevant statistics as well as on additional contextual factors,
and it is natural to try to make use of these in order to improve our understanding of the industry.

Let‘s suppose we want to make a data science project on the house price prediction of a
company. But before we make a model on this data we have to analyze all the information
which is present across the dataset like as what is the price of the house, what is the price they
are getting, what is the area of the house, and the living measures. These all steps of analyzing
and modifying the data come under EDA.

Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover
trends, patterns, or check assumptions in data with the help of statistical summaries and
graphical representations.

The main goal of the project is to find out the accurate predictions of the houses/ properties for
the next upcoming years. Here are the step by step process involved

8
lOMoARcPSD|46745212

1. Requirement Gathering – We have to gather the information extract the main information from
it.
2. Normalizing the data
3. Detecting Outliners in the data
4. Analysis and visualisation using the data

Types of EDA

Depending on the number of columns we are analyzing we can divide EDA into two types.
1. Univariate Analysis – In univariate analysis, we analyze or deal with only one variable at a
time. The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and find patterns
that exist within it.
2. Bi-Variate analysis – This type of data involves two different variables. The analysis of
this type of data deals with causes and relationships and the analysis is done to find out the
relationship between the two variables.
3. Multivariate Analysis – When the data involves three or more variables, it is categorized
under multivariate.
Depending on the type of analysis we can also subcategorize EDA into two parts.

1. Non-graphical Analysis – In non-graphical analysis, we analyze data using statistical tools


like mean median or mode or skewness
2. Graphical Analysis – In graphical analysis, we use visualizations charts to visualize trends
and patterns in the data

Data Encoding

There are some models like Linear Regression which does not work with categorical dataset in
that case we should try to encode categorical dataset into the numerical column. we can use
different methods for encoding like Label encoding or One-hot encoding. pandas and sklearn

9
lOMoARcPSD|46745212

provide different functions for encoding in our case we will use the Label Encoding function
from sklearn to encode.

In this article, we will understand EDA with the help of an example dataset. We will
use python language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib,
seaborn, and open datasets libraries. Then loading the dataset into a data frame and reading the
dataset using pandas, view the columns and rows of the data, perform descriptive statistics to
know better about the features inside the dataset, write the observations, finding the missing
values and duplicate rows. Discovering the anomalies in the given set and remove those
anomalies. Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship between
each variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.

Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean squared
error, R-Squared. Analyze these model matrix for all algorithms in the form of table then identify
the best fit.

1.3 Problem Statement


A house value is simply more than location and square footage. Like the features that make up a
person, an educated party would want to know all aspects that give a house its value. For
example, you want to sell a house and you don‘t know the price which you may expect — it can‘t
be too low or too high. To find house price you usually try to find similar properties in your
neighbourhood and based on gathered data you will try to assess your house price.

10
lOMoARcPSD|46745212

1.4 Objective of the study:


 Create an effective price prediction model
 Validate the model‘s prediction accuracy
 Identify the important home price attributes which feed the model‘s predictive power
Take advantage of all of the feature variables available below, use it to analyse and predict house
prices.
1. cid: a notation for a house
2. day hours: Date house was sold
3. price: Price is prediction target
4. room_bed: Number of Bedrooms/House
5. room_bath: Number of bathrooms/bedrooms
6. living_measure: square footage of the home
7. lot_measure: quare footage of the lot
8. ceil: Total floors (levels) in house
9. coast: House which has a view to a waterfront
10. sight: Has been viewed
11. condition: How good the condition is (Overall)
12. quality: grade given to the housing unit, based on grading system
13. ceil_measure: square footage of house apart from basement
14. basement_measure: square footage of the basement
15. yr_built: Built Year
16. yr_renovated: Year when house was renovated
17. zip code: zip
18. lat: Latitude coordinate
19. long: Longitude coordinate
20. living_measure15: Living room area in 2015(implies-- some renovations) This might
or might not have affected the lot size area
21. lot_measure15: lot Size area in 2015(implies-- some renovations)
22. furnished: Based on the quality of room
23. total_area: Measure of both living and lot

1.5 Literature Survey

11
lOMoARcPSD|46745212

The real estate market is one of the most competitive in terms of pricing and same tends to be
vary significantly based on lots of factor, forecasting property price is an important modules in
decision making for both the buyers and investors in supporting budget allocation, finding
property finding stratagems and determining suitable policies hence it becomes one of the prime
fields to apply the concepts of machine learning to optimize and predict the prices with high
accuracy. The literature review give the clear idea and it will serve as the support for the future
projects. most of the authors have concluded that artificial neural network have more influence in
predicting the but in real world there are other algorithms which should have taken into the
consideration. Investors decisions are based on the market trends to reap maximum returns.
Developers are interested to know the future trends for their decision making, this helps to know
about the pros and cons and also help to build the project. To accurately estimate property prices
and future trends, large amount of data that influences land price is required for analysis,
modeling and forecasting. The factors that affect the land price have to be studied and their
impact on price has also to be modeled. It is inferred that establishing a simple Regression linear
mathematical relationship for these time-series data is found not viable for prediction. Hence it
became imperative to establish a non-linear model which can well fit the data characteristic to
analyze and predict future trends. As the real estate is fast developing sector, the analysis and
prediction of land prices using mathematical modeling and other techniques is an immediate
urgent need for decision making by all those concerned.

12
lOMoARcPSD|46745212

2.1.1 Implementation

The mean of the dataset:

13
lOMoARcPSD|46745212

The median of the dataset:

The standard deviation of the dataset:

2.1.2 Handling Missing data :

The important part and problem of data preprocessing is handling missing values in the dataset.
Data scientists must manage missing values because it can adversely affect the operation of
machine learning models. Data can be imputed in such a procedure, missing values can be filled
based on the other observations.

Techniques involved in imputing unknown or missing observations include:

14
lOMoARcPSD|46745212

1. Deleting the whole rows or columns with unknown or missing observations.

2. Missing values can be inferred by averaging techniques like mean, median, mode.

3. Imputing missing observations with the most frequent values.

4. Imputing missing observations by exploring correlations.

5. Imputing missing observations by exploring similarities between cases.

Missing values are usually represented with ‗nan‘, ‘NA‘ or ‗null‘(Refer image 5). Below is the
list of variables with missing variables in the train dataset

2.1.3 Uni-Variate, Bi-Variate, Multi-Variate:

Uni-Variate: Uni-Variate in House Price Prediction , chosen attribute like price because by price
is independent each other.

Bi-Variate: Bi-variate in House Price Prediction, chosen attributes like price, living_measure
because by living_measure price is calculated so these two variables are dependent to each other.

15
lOMoARcPSD|46745212

Multi-Variate: Multi-variate in House Price Prediction, chosen attributes like price,


living_measure, ceil_measure, basement because ceil_measure ,basement will calculates
living_measure and by living_measure price is calculated so these four variables are dependent
to each other.

16
lOMoARcPSD|46745212

2.2.2.1 Plots: Histogram plot

2.2.2.2 Plots: Box plot

17
lOMoARcPSD|46745212

2.2.2.3 The correlation between variables:

18
lOMoARcPSD|46745212

DATA ANALYSIS AND INTERPRETATION

3.1 Linear regression model

Linear regression model shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.

Scatter Plot for Linear Regression Model

19
lOMoARcPSD|46745212

Distplot for Linear Regression Model

3.1.1 Model evaluation against Linear Regression


Mean Absolute Error (MAE) : It is the simplest error metric used in regression problems. It is
basically the sum of average of the absolute difference between the predicted and actual values.

Mean Square Error (MSE) : MSE is like the MAE, but the only difference is that the it squares
the difference of actual and predicted output values before summing them all instead of using the
absolute value.

Root Mean Squared Error (RSME) : RSME provides information about the short-term
performance of a model by allowing a term-by-term comparison of the actual difference between
the estimated and the measured value.

R Squared (R2): R Squared metric is generally used for explanatory purpose and provides an
indication of the goodness or fit of a set of predicted output values to the actual output values.

20
lOMoARcPSD|46745212

3.2 Ridge regression model


Ridge regression is a technique used to analyze multi-linear regression (multi-collinear), also
known as L2 regularization.

21
lOMoARcPSD|46745212

3.2.1 Model evaluation against Ridge Regression

3.3 Lasso Regression


Lasso. It stands for – Least Absolute Shrinkage and Selection Operator is a technique where data
points are shrunk towards a central point, like the mean. Lasso is also known as L1
regularization.

3.3.1 Model evaluation against Lasso Regression

22
lOMoARcPSD|46745212

3.4 Support Victor Regression (SVR)


Support Vector Regression (SVR) is a type of machine learning algorithm used for regression
analysis. The goal of SVR is to find a function that approximates the relationship between the
input variables and a continuous target variable, while minimizing the prediction error.

3.4.1 Model evaluation against SVR

23
lOMoARcPSD|46745212

3.5 Random forest regression

Random Forest is an ensemble technique capable of performing both regression and


classification tasks with the use of multiple decision trees and a technique called Bootstrap and
Aggregation, commonly known as bagging. The basic idea behind this is to combine multiple
decision trees in determining the final output rather than relying on individual decision trees.
Random Forest has multiple decision trees as base learning models. Randomly perform row
sampling and feature sampling from the dataset forming sample datasets for every model.

24
lOMoARcPSD|46745212

3.5.1 Model evaluation against random forest regression

Model Evaluation Comparison between all Models

SL. Algorithms Mean Absolute Mean Root Mean R


No Error(MAE) Squared Squared Squared
Error(MSE) Error(RMSE)

1 Linear Regression 0.53 0.52 0.72 0.47


2 Ridge Regression 0.60 0.63 0.79 0.35
3 Lasso Regression 0.73 0.98 0.99 -6.06
4 Epsilon-Support 0.50 0.50 0.70 0.49
Vector Regression
5 Random Forest 0.53 0.55 0.74 0.43
Regression

25
lOMoARcPSD|46745212

CHAPTER 4

FINDINGS, RECOMMENDATIONS AND CONCLUSION

4.1 Findings Based on Observations

 The experiment is done to pre-process the data and evaluate the prediction accuracy of the
models. The experiment has multiple stages that are required to get the prediction results.
These stages can be defined as:
 Pre-processing: Datasets will be checked and pre-processed using the methods. These
methods have various ways of handling data. Thus, the preprocessing is done on multiple
iterations where each time the accuracy will be evaluated with the used combination.
 Data splitting: dividing the dataset into two parts is essential to train the model with one
and use the other in the evaluation. The dataset will be split 80% for training and 20% for
testing.
 Evaluation: the accuracy of dataset will be evaluated by measuring the R2 and RMSE rate
when training the model alongside an evaluation of the actual prices on the test dataset
with the prices that are being predicted by the model.
 Performance: alongside the evaluation metrics, the required time to train the model will
be measured to show the algorithm vary in terms of time.
 Correlation: correlation between the available features and house price will be evaluated
using the Pearson Coefficient Correlation to identify whether the features have a negative,
positive or zero correlation with the house price.

4.2 Findings Based on analysis of Data

 Pre-processing methods played a significant role to provide the final prediction accuracy,
as shown in the experiment sequence in both public and local data.
 outlier, as suggested by gave a worse outcome than Isolation Forest where it has
improved the prediction accuracy.

26
lOMoARcPSD|46745212

 The performance of trained models has been measured by evaluating the RMSE, R2
metrics, MAE, MSE .
 The accuracy has been evaluated by plotting the actual prices on the predicted values, as
shown below
4.3 Recommendation based on findings

4.3 Experiment Results


 Many machine learning algorithms are used to predict. However, previous
researches have shown a comparison between all algorithms.
 Therefore, using these algorithms is beneficial so that the result can be as near to
the claimed results.
 However, the prediction accuracy of these algorithms depends heavily on the
given data when training the model.
 If the data is in bad shape, the model will be over fitted and inefficient, which
means that data pre-processing is an important part of this experiment and will
affect the final results.
 Thus, multiple combinations of pre-processing methods need to be tested before
getting the data ready to be used in train

4.4 Scope for future research

Future work on this study could be divided into seven main areas to improve the result even
further. Which can be done by:
 The used pre-processing methods do help in the prediction accuracy. However,
experimenting with different combinations of pre-processing methods to achieve better
prediction accuracy.
 Make use of the available features and if they could be combined as binning features has
shown that the data got improved.
 Training the datasets with different regression methods such as Elastic net regression that
combines both L1 and L2 norms. In order to expand the comparison and check the
performance.

27
lOMoARcPSD|46745212

 The correlation has shown the association in the local data. Thus, attempting to enhance
the local data is required to make rich with features that vary and can provide a strong
correlation relationship.
 The factors that have been studied in this study has a weak correlation with the sale price.
Hence, by adding more factors to the local dataset that affect the house price, such as
GDP, average income, and the population. In order to increase the number of factors that
have an impact on house prices. This could also lead to a better finding for question 1 and
2.

28

You might also like