Report Capstone Project House Price Prediction
Report Capstone Project House Price Prediction
lOMoARcPSD|46745212
House Price
Prediction
1
lOMoARcPSD|46745212
Problem statement:
A house value is simply more than location and square footage. Like the features that make up a person, an
educated party would want to know all aspects that give a house its value. For example, you want to sell a house
and you don’t know the price which you may expect — it can’t be too low or too high. To find house price you
usually try to find similar properties in your neighborhood and based on gathered data you will try to assess your
house price.
2
lOMoARcPSD|46745212
Objective:
Take advantage of all of the feature variables available below, use it to analyse and predict house prices.
1. cid: a notation for a house
2. dayhours: Date house was sold
3. price: Price is prediction target
4. room_bed: Number of Bedrooms/House
5. room_bath: Number of bathrooms/bedrooms
6. living_measure: square footage of the home
7. lot_measure: quare footage of the lot
8. ceil: Total floors (levels) in house
9. coast: House which has a view to a waterfront
10. sight: Has been viewed
11. condition: How good the condition is (Overall)
12. quality: grade given to the housing unit, based on grading system
13. ceil_measure: square footage of house apart from basement
14. basement_measure: square footage of the basement
15. yr_built: Built Year
16. yr_renovated: Year when house was renovated
17. zipcode: zip
18. lat: Latitude coordinate
19. long: Longitude coordinate
20. living_measure15: Living room area in 2015(implies-- some renovations) This might or might not have
affected the lotsize area
21. lot_measure15: lotSize area in 2015(implies-- some renovations)
22. furnished: Based on the quality of room
23. total_area: Measure of both living and lot
3
lOMoARcPSD|46745212
TABLE OF CONTENTS
List of Graphs 4
List of Graphs
Graph No. Graph Title Page No.
2.2.4 Bar graph for Univariate 17
2.2.4 Scatter plot for Bivariate 18
2.2.4 Heat map for Multi-variate 18
2.2.2.1 Histogram plot 19
2.2.2.2 Box plot 19
2.2.2.3 Correlation between variables 20
3.1 Scatter plot for Linear regression model 21
3.1 Distplot for Linear regression model 22
3.2 Scatter plot for Ridge regression model 23
3.2.1 Distplot for Ridge regression model 23
3.3 Scatter plot for Lasso regression 24
3.4 Scatter plot for Support Victor Regression 25
3.4 Distplot for Support Victor Regression 25
3.5 Scatter plot for Random forest regressor 26
3.5 Distplot for Random forest regressor 27
4
lOMoARcPSD|46745212
List of Tables
Table No. Table Title Page No.
1 Model Evaluation Comparison between all models 27
5
lOMoARcPSD|46745212
CHAPTER 1
EDA is an important step in any Data Analysis or Data Science project. EDA involves generating
summary statistics for numerical data in the dataset and creating various graphical representations
to understand the data better. The goal of EDA is to identify patterns, anomalies, and
relationships in the data that can be used to inform subsequent steps in the data science process,
such as building models or identifying insights. EDA is to help look at data before making any
assumptions. It can help identify obvious errors, as well as better understand patterns within the
data, detect outliers or anomalous events, find interesting relations among the variables. It also
helps answer the questions about standard deviations, categorical variables, and confidence
intervals. Finally, once EDA is complete and insights are drawn, its features can then be used for
more sophisticated data analysis or modelling, including machine learning.
Data scientists can use exploratory analysis to ensure the results they produce are valid and
applicable to any desired business outcomes and goals. EDA also helps stakeholders by
confirming they are asking the right questions. EDA can help answer questions about standard
deviations, categorical variables, and confidence intervals. Once EDA is complete and insights
are drawn, its features can then be used for more sophisticated data analysis or modelling,
including machine learning.
In this article, we will understand EDA with the help of an example dataset. We will
use python language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib,
seaborn, and open datasets libraries. Then loading the dataset into a data frame and reading the
dataset using pandas, view the columns and rows of the data, perform descriptive statistics to
know better about the features inside the dataset, write the observations, finding the missing
values and duplicate rows. Discovering the anomalies in the given set and remove those
6
lOMoARcPSD|46745212
anomalies. Univariate visualization of each field in the raw dataset, with summary statistics. Bi-
variate visualizations and summary statistics that allow you to assess the relationship between
each variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.
Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean squared
error, R-Squared. Analyze these model matrix for all algorithms in the form of table then identify
the best fit.
Some of the most common data science tools used to create an EDA include python, Jupyter. The
common packages used are pandas, numpy, matplotlib, seaborn, etc.
One important benefit of conducting exploratory data analysis is that it can help you organize a
dataset before you model it. This can help you start to make assumptions and predictions about
your dataset. Another benefit of EDA is that it can help you understand the variables in your
dataset. This can help you organize your dataset and begin to pinpoint relationships between
variables, which is an integral part of data analysis.
Conducting EDA can also help you identify the relationships between the variables in your
dataset. Identifying the relationships between variables is a critical part of drawing conclusions
from a dataset.
Another important benefit of EDA is helping you choose the right model for your dataset. You
can use all of the information that you gain from conducting an EDA to help you choose a data
model. It's important to choose the right data model because it can make it easier for everyone in
your organization to understand your data. Some commonly used data models that you can
choose from include:
You can also use EDA to help you find patterns in a dataset. Finding patterns in a dataset is
important because it can help you make predictions and estimations. This can help your
organization plan for the future and anticipate problems and solutions.
7
lOMoARcPSD|46745212
The real estate sector is an important industry with many stakeholders ranging from regulatory
bodies to private companies and investors. Among these stakeholders, there is a high demand for
a better understanding of the industry operational mechanism and driving factors. Today there is
a large amount of data available on relevant statistics as well as on additional contextual factors,
and it is natural to try to make use of these in order to improve our understanding of the industry.
Let‘s suppose we want to make a data science project on the house price prediction of a
company. But before we make a model on this data we have to analyze all the information
which is present across the dataset like as what is the price of the house, what is the price they
are getting, what is the area of the house, and the living measures. These all steps of analyzing
and modifying the data come under EDA.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the data and discover
trends, patterns, or check assumptions in data with the help of statistical summaries and
graphical representations.
The main goal of the project is to find out the accurate predictions of the houses/ properties for
the next upcoming years. Here are the step by step process involved
8
lOMoARcPSD|46745212
1. Requirement Gathering – We have to gather the information extract the main information from
it.
2. Normalizing the data
3. Detecting Outliners in the data
4. Analysis and visualisation using the data
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA into two types.
1. Univariate Analysis – In univariate analysis, we analyze or deal with only one variable at a
time. The analysis of univariate data is thus the simplest form of analysis since the
information deals with only one quantity that changes. It does not deal with causes or
relationships and the main purpose of the analysis is to describe the data and find patterns
that exist within it.
2. Bi-Variate analysis – This type of data involves two different variables. The analysis of
this type of data deals with causes and relationships and the analysis is done to find out the
relationship between the two variables.
3. Multivariate Analysis – When the data involves three or more variables, it is categorized
under multivariate.
Depending on the type of analysis we can also subcategorize EDA into two parts.
Data Encoding
There are some models like Linear Regression which does not work with categorical dataset in
that case we should try to encode categorical dataset into the numerical column. we can use
different methods for encoding like Label encoding or One-hot encoding. pandas and sklearn
9
lOMoARcPSD|46745212
provide different functions for encoding in our case we will use the Label Encoding function
from sklearn to encode.
In this article, we will understand EDA with the help of an example dataset. We will
use python language for this purpose. In this dataset, we used Pandas, Numpy, matplotlib,
seaborn, and open datasets libraries. Then loading the dataset into a data frame and reading the
dataset using pandas, view the columns and rows of the data, perform descriptive statistics to
know better about the features inside the dataset, write the observations, finding the missing
values and duplicate rows. Discovering the anomalies in the given set and remove those
anomalies. Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship between
each variable in the dataset and the target variable you‘re looking at. Predictive models, such as
linear regression, use statistics and data to predict outcomes.
Plotting the graphs with different attributes of the dataset and analyzing the given dataset. Then
Use the algorithms of regression to understand which is better fit for the data set in house price
prediction using model matrix i.e., Mean Squared error, Mean absolute error , Root Mean squared
error, R-Squared. Analyze these model matrix for all algorithms in the form of table then identify
the best fit.
10
lOMoARcPSD|46745212
11
lOMoARcPSD|46745212
The real estate market is one of the most competitive in terms of pricing and same tends to be
vary significantly based on lots of factor, forecasting property price is an important modules in
decision making for both the buyers and investors in supporting budget allocation, finding
property finding stratagems and determining suitable policies hence it becomes one of the prime
fields to apply the concepts of machine learning to optimize and predict the prices with high
accuracy. The literature review give the clear idea and it will serve as the support for the future
projects. most of the authors have concluded that artificial neural network have more influence in
predicting the but in real world there are other algorithms which should have taken into the
consideration. Investors decisions are based on the market trends to reap maximum returns.
Developers are interested to know the future trends for their decision making, this helps to know
about the pros and cons and also help to build the project. To accurately estimate property prices
and future trends, large amount of data that influences land price is required for analysis,
modeling and forecasting. The factors that affect the land price have to be studied and their
impact on price has also to be modeled. It is inferred that establishing a simple Regression linear
mathematical relationship for these time-series data is found not viable for prediction. Hence it
became imperative to establish a non-linear model which can well fit the data characteristic to
analyze and predict future trends. As the real estate is fast developing sector, the analysis and
prediction of land prices using mathematical modeling and other techniques is an immediate
urgent need for decision making by all those concerned.
12
lOMoARcPSD|46745212
2.1.1 Implementation
13
lOMoARcPSD|46745212
The important part and problem of data preprocessing is handling missing values in the dataset.
Data scientists must manage missing values because it can adversely affect the operation of
machine learning models. Data can be imputed in such a procedure, missing values can be filled
based on the other observations.
14
lOMoARcPSD|46745212
2. Missing values can be inferred by averaging techniques like mean, median, mode.
Missing values are usually represented with ‗nan‘, ‘NA‘ or ‗null‘(Refer image 5). Below is the
list of variables with missing variables in the train dataset
Uni-Variate: Uni-Variate in House Price Prediction , chosen attribute like price because by price
is independent each other.
Bi-Variate: Bi-variate in House Price Prediction, chosen attributes like price, living_measure
because by living_measure price is calculated so these two variables are dependent to each other.
15
lOMoARcPSD|46745212
16
lOMoARcPSD|46745212
17
lOMoARcPSD|46745212
18
lOMoARcPSD|46745212
Linear regression model shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
19
lOMoARcPSD|46745212
Mean Square Error (MSE) : MSE is like the MAE, but the only difference is that the it squares
the difference of actual and predicted output values before summing them all instead of using the
absolute value.
Root Mean Squared Error (RSME) : RSME provides information about the short-term
performance of a model by allowing a term-by-term comparison of the actual difference between
the estimated and the measured value.
R Squared (R2): R Squared metric is generally used for explanatory purpose and provides an
indication of the goodness or fit of a set of predicted output values to the actual output values.
20
lOMoARcPSD|46745212
21
lOMoARcPSD|46745212
22
lOMoARcPSD|46745212
23
lOMoARcPSD|46745212
24
lOMoARcPSD|46745212
25
lOMoARcPSD|46745212
CHAPTER 4
The experiment is done to pre-process the data and evaluate the prediction accuracy of the
models. The experiment has multiple stages that are required to get the prediction results.
These stages can be defined as:
Pre-processing: Datasets will be checked and pre-processed using the methods. These
methods have various ways of handling data. Thus, the preprocessing is done on multiple
iterations where each time the accuracy will be evaluated with the used combination.
Data splitting: dividing the dataset into two parts is essential to train the model with one
and use the other in the evaluation. The dataset will be split 80% for training and 20% for
testing.
Evaluation: the accuracy of dataset will be evaluated by measuring the R2 and RMSE rate
when training the model alongside an evaluation of the actual prices on the test dataset
with the prices that are being predicted by the model.
Performance: alongside the evaluation metrics, the required time to train the model will
be measured to show the algorithm vary in terms of time.
Correlation: correlation between the available features and house price will be evaluated
using the Pearson Coefficient Correlation to identify whether the features have a negative,
positive or zero correlation with the house price.
Pre-processing methods played a significant role to provide the final prediction accuracy,
as shown in the experiment sequence in both public and local data.
outlier, as suggested by gave a worse outcome than Isolation Forest where it has
improved the prediction accuracy.
26
lOMoARcPSD|46745212
The performance of trained models has been measured by evaluating the RMSE, R2
metrics, MAE, MSE .
The accuracy has been evaluated by plotting the actual prices on the predicted values, as
shown below
4.3 Recommendation based on findings
Future work on this study could be divided into seven main areas to improve the result even
further. Which can be done by:
The used pre-processing methods do help in the prediction accuracy. However,
experimenting with different combinations of pre-processing methods to achieve better
prediction accuracy.
Make use of the available features and if they could be combined as binning features has
shown that the data got improved.
Training the datasets with different regression methods such as Elastic net regression that
combines both L1 and L2 norms. In order to expand the comparison and check the
performance.
27
lOMoARcPSD|46745212
The correlation has shown the association in the local data. Thus, attempting to enhance
the local data is required to make rich with features that vary and can provide a strong
correlation relationship.
The factors that have been studied in this study has a weak correlation with the sale price.
Hence, by adding more factors to the local dataset that affect the house price, such as
GDP, average income, and the population. In order to increase the number of factors that
have an impact on house prices. This could also lead to a better finding for question 1 and
2.
28