Making Predictions

This document outlines a structured approach to making predictions using Machine Learning, specifically focusing on predicting house prices in Boston. It covers essential steps such as understanding the problem, hypothesis generation, data exploration, preprocessing, feature engineering, model training, and evaluation. The document emphasizes the importance of following a systematic process to achieve accurate predictions and includes practical coding examples and assignments for hands-on learning.

Uploaded by

semeriuss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views13 pages

Making Predictions

Uploaded by

semeriuss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Getting familiarized with Making predictions process using Machine Learning

Learning Objectives
At the end of this session you will be able to:
● Understand machine learning based making prediction process
● Apply correlation technique for feature selection purpose
● Build machine learning predictor for Boston Housing
● Learn how to evaluate the predictor
Introduction
Making predictions using Machine Learning isn't just about grabbing the data and feeding it to
algorithms. The algorithm might spit out some prediction but that's not what you are aiming for.
The difference between good data science professionals and naive data science aspirants is that
the former set follows this process religiously. The process is as follows: 1. Understand the
problem: Before getting the data, we need to understand the problem we are trying to solve. If
you know the domain, think of which factors could play an epic role in solving the problem. If you
don't know the domain, read about it. 2. Hypothesis Generation: This is quite important, yet it is
often forgotten. In simple words, hypothesis generation refers to creating a set of features which
could influence the target variable given a confidence interval ( taken as 95% all the time). We
can do this before looking at the data to avoid biased thoughts. This step often helps in creating
new features. 3. Get Data: Now, we download the data and look at it. Determine which features
are available and which aren't, how many features we generated in hypothesis generation hit the
mark, and which ones could be created. Answering these questions will set us on the right track.
4. Data Exploration: We can't determine everything by just looking at the data. We need to dig
deeper. This step helps us understand the nature of variables ( missing, zero variance feature)
so that they can be treated properly. It involves creating charts, graphs (univariate and bivariate
analysis), and cross-tables to understand the behavior of features. 5. *Data Preprocessing: *Here,
we impute missing values and clean string variables (remove space, irregular tabs, data time
format) and anything that shouldn't be there. This step is usually followed along with the data
exploration stage. 6. Feature Engineering: Now, we create and add new features to the data set.
Most of the ideas for these features come during the hypothesis generation stage. 7. Model
Training: Using a suitable algorithm, we train the model on the given data set. 8. Model Evaluation:
Once the model is trained, we evaluate the model's performance using a suitable error metric.
Here, we also look for variable importance, i.e., which variables have proved to be significant in
determining the target variable. And, accordingly we can shortlist the best variables and train the
model again. 9. Model Testing: Finally, we test the model on the unseen data (test data) set.

We'll follow this process in the project to arrive at our final predictions. Let's get started.
1.Understand the problem
This lab aims at predicting house prices (residential) in Boston, USA. I believe this problem
statement is quite self-explanatory and doesn't need more explanation. Hence, we move to the
next step.

2. Hypothesis Generation
Well, this is going to be interesting. What factors can you think of right now which can influence
house prices ? As you read this, I want you to write down your factors as well, then we can match
them with the data set. Defining a hypothesis has two parts: Null Hypothesis (Ho) and Alternate
Hypothesis(Ha). They can be understood as:

Ho - There exists no impact of a particular feature on the dependent variable. Ha - There exists a
direct impact of a particular feature on the dependent variable.
Based on a decision criterion (say, 5% significance level), we always 'reject' or 'fail to reject' the
null hypothesis in statistical parlance. Practically, while model building we look for probability (p)
values. If p value < 0.05, we reject the null hypothesis. If p > 0.05, we fail to reject the null
hypothesis. Some factors which I can think of that directly influence house prices are the following:
Per capita crime rate by town
Proportion of residential land zoned for lots over 25,000 sq. ft
Proportion of non-retail business acres per town
Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
Nitric oxide concentration (parts per 10 million)
Average number of rooms per dwelling
Proportion of owner-occupied units built prior to 1940
Weighted distances to five Boston employment centers
Index of accessibility to radial highways
Full-value property tax rate per $10,000
Pupil-teacher ratio by town
1000(Bk — 0.63)², where Bk is the proportion of [people of African American descent] by town
LSTAT: Percentage of lower status of the population
Median value of owner-occupied homes in $1000s
…keep thinking. I am sure you can come up with many more apart from these.
3. Get Data
You can download the data from this https://www.kaggle.com/altavish/boston-housing-dataset
and load it in your python IDE. Also, check the competition page where all the details about the
data and variables are given. The data set consists of 13 explanatory variables. Yes, it's going to
be one heck of a data exploration ride. But, we'll learn how to deal with so many variables. The
target variable is MEDV. As you can see the data set comprises numeric, categorical, and ordinal
variables.
4. Data Exploration
Data Exploration is the key to getting insights from data. Practitioners say a good data exploration
strategy can solve even complicated problems in a few hours. A good data exploration strategy
comprises the following:

1. Univariate Analysis - It is used to visualize one variable in one plot. Examples: histogram,
density plot, etc.
2. Bivariate Analysis - It is used to visualize two variables (x and y axis) in one plot. Examples:
bar chart, line chart, area chart, etc.
3. Multivariate Analysis - As the name suggests, it is used to visualize more than two variables
at once. Examples: stacked bar chart, dodged bar chart, etc.
4. Cross Tables -They are used to compare the behavior of two categorical variables (used in
pivot tables as well).

Let's load the necessary libraries and data and start coding.

After we read the data, we can look at the data using:

The description of all the features is given below:

CRIM: Per capita crime rate by town
ZN: Proportion of residential land zoned for lots over 25,000 sq.
ft
INDUS: Proportion of non-retail business acres per town
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0
otherwise)
NOX: Nitric oxide concentration (parts per 10 million)
RM: Average number of rooms per dwelling
AGE: Proportion of owner-occupied units built prior to 1940
DIS: Weighted distances to five Boston employment centers
RAD: Index of accessibility to radial highways
TAX: Full-value property tax rate per $10,000
PTRATIO: Pupil-teacher ratio by town
B: 1000(Bk — 0.63)², where Bk is the proportion of [people of
African American descent] by town
LSTAT: Percentage of lower status of the population
MEDV: Median value of owner-occupied homes in $1000s
The prices of the house indicated by the variable MEDV is our target
variable and the remaining are the feature variables based on
which we will predict the value of a house.

Alternatively, you can also check the data set information using the info() command.
5.Data Preprocessing
After loading the data, it’s a good practice to see if there are any missing values in the data. We count
the number of missing values for each feature using isnull()

Out of 14 features, 6 features have missing values. Let's check the percentage of missing values in
these columns.

We can infer that the all variables has 3.9% missing values. Let's look at a pretty picture explaining
these missing values using a bar plot.
Let's proceed and check the distribution of the target variable.
We see that the values of MEDV are distributed normally with few
outliers.
Next, we create a correlation matrix that measures the linear
relationships between the variables. The correlation matrix can be
formed by using the corr function from the pandas dataframe
library. We will use the heatmap function from the seaborn library to
plot the correlation matrix.

The correlation coefficient ranges from -1 to 1. If the value is close

to 1, it means that there is a strong positive correlation between the
two variables. When it is close to -1, the variables have a strong
negative correlation.
Observations:
To fit a linear regression model, we select those features which have
a high correlation with our target variable MEDV. By looking at the
correlation matrix we can see that RM has a strong positive
correlation with MEDV (0.7) where as LSTAT has a high negative
correlation with MEDV(-0.74).
6. Feature Engineering
An important point in selecting features for a linear regression
model is to check for multi-co-linearity. The features RAD, TAX
have a correlation of 0.91. These feature pairs are strongly
correlated to each other. We should not select both these features
together for training the model. Check this for an explanation.
Same goes for the features DIS and AGE which have a correlation
of -0.75.
Based on the above observations we will select RM and LSTAT as
our features. Using a scatter plot let’s see how these features vary
with MEDV.

Observations:
The prices increase as the value of RM increases linearly. There are
few outliers and the data seems to be capped at 50.
The prices tend to decrease with an increase in LSTAT. Though it
doesn’t look to be following exactly a linear line.

Feature Normalization
If you look at the values, note that house sizes are about 1000 times
the number of bedrooms. When features differ by orders of
magnitude, first performing feature scaling can make gradient
descent converge much more quickly. To normalize the dataset the
following steps is used:
Subtract the mean value of each feature from the dataset.
After subtracting the mean, additionally scale (divide) the feature
values by their respective “standard deviations."

The standard deviation is a way of measuring how much variation

there is in the range of values of a particular feature (most data
points will lie within ±2 standard deviations of the mean); this is
an alternative to taking the range of values (max-min). When
normalizing the features, the values used for normalization should
be kept for later use. After learning the parameters from the model,
we often want to predict the prices of houses we have not seen
before. Given a new x value (living room area and number of bed-
rooms), we must first normalize x using the mean and standard
deviation that we had previously computed from the training set.
Here the code snapshot that is used for the normalization purpose.

Model Training using Gradient descent

Here is the code snap shot for implementing the gradient descent
optimization techniques according to the formula discussed during
the lecture session.
One of the mechanisms to make sure that gradient descent is
working properly or not is to look at the value of the cost function
and check that it is decreasing with each iteration. After the
gradient descent is properly implemented, the value of the cost
function must decease and should converge to a steady value by the
end of the algorithm. The final values for the model parameters will
be used to make prediction based on the new observations. The
following code snap shoot and graph demonstrate the correct
behavior of the cost function.
Testing the model
Using the final value of the model parameters predict MEDV with
LSTAT 9.14 and RM 6.42. The values should be normalized.
Analyzing the Impact of Learning Rate
In this part you will apply different values of learning rates for the
dataset and find a learning rate that converges quickly. Here is the
code snap shoot and the graph of cost function against the number
of iterations for different values of learning rates.

7. Model Training
We concatenate the LSTAT and RM columns using np.c_ provided
by the numpy library.
Splitting the data into training and testing sets
Next, we split the data into training and testing sets. We train the
model with 80% of the samples and test with the remaining 20%.
We do this to assess the model’s performance on unseen data. To
split the data we use train_test_split function provided by scikit-
learn library. We finally print the sizes of our training and test set
to verify if the splitting has occurred properly.

Training and testing the model

We use scikit-learn’s LinearRegression to train our model on both
the training and test sets.

Model evaluation
We will evaluate our model using RMSE and R2-score.
Assignment
This question will use the Boston housing dataset once again. Again, create a test
set consisting of 1/2 of the data using the rest for training.
1. Build and evaluate the model by using additional one feature which has high
feature next to RM and LSTV
2. Fit a polynomial regression model to the training data.
3. Predict the labels for the corresponding test data.
4. Evaluate and generate the model parameters.
5. Out of these predictors used in this assignment, which would you choose as a
final model for the boston housing?

House Prices Prediction in King County
No ratings yet
House Prices Prediction in King County
10 pages
House Price Prdiction Mini Project Report
100% (2)
House Price Prdiction Mini Project Report
8 pages
House Price Prediction: Project Description
No ratings yet
House Price Prediction: Project Description
11 pages
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
100% (1)
Project Presentation On House Price Prediction System: Presented by Name: Simran B Solanki Roll No: 19020
32 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
House Price Prediction 1
No ratings yet
House Price Prediction 1
27 pages
R Doc Ii Vee
No ratings yet
R Doc Ii Vee
24 pages
Chapter Four Statistical Quality Control (SQC)
No ratings yet
Chapter Four Statistical Quality Control (SQC)
31 pages
Real Estate Price Prediction Model
No ratings yet
Real Estate Price Prediction Model
3 pages
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
No ratings yet
FALLSEM2021-22 MDI4001 ETH VL2021220104135 Reference Material I 09-Aug-2021 Data2 1
9 pages
Prediction
100% (1)
Prediction
10 pages
Sberbank Project Report
No ratings yet
Sberbank Project Report
19 pages
Price Prediction
100% (1)
Price Prediction
13 pages
House Prices
No ratings yet
House Prices
5 pages
The Boston Housing Dataset
100% (2)
The Boston Housing Dataset
4 pages
Final
No ratings yet
Final
14 pages
Mean Worksheet
50% (2)
Mean Worksheet
3 pages
220 - 1 Leading Indicators Execution Phase
100% (3)
220 - 1 Leading Indicators Execution Phase
36 pages
House Ames Project
No ratings yet
House Ames Project
15 pages
20MIS1025 - Regression - Ipynb - Colaboratory
No ratings yet
20MIS1025 - Regression - Ipynb - Colaboratory
5 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
DM Assignment
No ratings yet
DM Assignment
17 pages
Linear Reg
No ratings yet
Linear Reg
25 pages
Dawit House
No ratings yet
Dawit House
49 pages
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
No ratings yet
House-Price-Prediction-Using-Regression-Techniques Retouch - Removed
14 pages
Module 2
No ratings yet
Module 2
20 pages
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
No ratings yet
4 - Học Máy Cơ Bản - Hồi Quy Tuyến Tính
113 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Module 2notes
No ratings yet
Module 2notes
44 pages
House Pricing
No ratings yet
House Pricing
15 pages
Regression Dataset
No ratings yet
Regression Dataset
3 pages
UNIT I Introduction PPT Instrumentation
No ratings yet
UNIT I Introduction PPT Instrumentation
55 pages
Pratapa P Evidence of Learning 4
No ratings yet
Pratapa P Evidence of Learning 4
2 pages
Boston House Prediction - Colab1
No ratings yet
Boston House Prediction - Colab1
10 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
14 pages
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
No ratings yet
Experiment Number: 3: Aim:-Study of The Linear Regression in The Machine Learning Using The Boston Housing Dataset. 1)
16 pages
Comprehensive Data Exploration With Python
No ratings yet
Comprehensive Data Exploration With Python
20 pages
Report
No ratings yet
Report
40 pages
Bi El
No ratings yet
Bi El
26 pages
Continuous Assessment
No ratings yet
Continuous Assessment
4 pages
Assignment
No ratings yet
Assignment
3 pages
Problem Statement
No ratings yet
Problem Statement
6 pages
Xgboost
No ratings yet
Xgboost
12 pages
ML Project Part A 1
No ratings yet
ML Project Part A 1
6 pages
T2 Summary VHA
No ratings yet
T2 Summary VHA
14 pages
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
No ratings yet
Cap8 Predicting Continuous Target Variables With Regression Analysis - Thakur Ankita 2016 - Python Real World Data Science
36 pages
Boston Housing
No ratings yet
Boston Housing
17 pages
Module 5
No ratings yet
Module 5
46 pages
Module 2
No ratings yet
Module 2
35 pages
Lec3 4 ML Project
No ratings yet
Lec3 4 ML Project
26 pages
Evaluation of Volumetric and Mechanistic Properties of Asphalt Mi
No ratings yet
Evaluation of Volumetric and Mechanistic Properties of Asphalt Mi
125 pages
AIMLlatestmodule 2notes Removed
No ratings yet
AIMLlatestmodule 2notes Removed
33 pages
BPCC-104 EM 2021-22 IHT 7980393936 - Watermarked
No ratings yet
BPCC-104 EM 2021-22 IHT 7980393936 - Watermarked
12 pages
Data Analysis Project MAIN
No ratings yet
Data Analysis Project MAIN
6 pages
Statistics For Management MCQs and Terminal Questions From All Units
No ratings yet
Statistics For Management MCQs and Terminal Questions From All Units
22 pages
Chapter - 3 Six Sigma
No ratings yet
Chapter - 3 Six Sigma
24 pages
B.B.A. (2024-25)
No ratings yet
B.B.A. (2024-25)
45 pages
ML Project CLG
No ratings yet
ML Project CLG
62 pages
Measures of Dispersion New
No ratings yet
Measures of Dispersion New
27 pages
Lesson3 - Unit 1
No ratings yet
Lesson3 - Unit 1
33 pages
Ese Lab File
No ratings yet
Ese Lab File
30 pages
11330-Article Text-22387-1-10-20220906
No ratings yet
11330-Article Text-22387-1-10-20220906
15 pages
House Price Prediction
No ratings yet
House Price Prediction
14 pages
IC Product Quality Control Chart Sample 11221
No ratings yet
IC Product Quality Control Chart Sample 11221
7 pages
The Moderating Role of Socioeconomic Status On Motivation of Adoles 2018 Sys
No ratings yet
The Moderating Role of Socioeconomic Status On Motivation of Adoles 2018 Sys
9 pages
PIIS1440244023004590
No ratings yet
PIIS1440244023004590
6 pages
Ctryprem
No ratings yet
Ctryprem
213 pages
Administered of The Raven's Standard Progressive Matrices With A Time Limit
No ratings yet
Administered of The Raven's Standard Progressive Matrices With A Time Limit
18 pages
Qa Pastpapers
No ratings yet
Qa Pastpapers
146 pages
Grade-9 Maths Real Life Connection 2
No ratings yet
Grade-9 Maths Real Life Connection 2
4 pages
STS 311
No ratings yet
STS 311
7 pages
Chapter Four
No ratings yet
Chapter Four
11 pages
Neural Mechanisms of Mindfulness-Based Interventions in Anxiety Disorders: A Systematic Review
No ratings yet
Neural Mechanisms of Mindfulness-Based Interventions in Anxiety Disorders: A Systematic Review
8 pages
PIIS0022391323006480
No ratings yet
PIIS0022391323006480
9 pages
Unit 5
No ratings yet
Unit 5
18 pages
QOCA单网平差
No ratings yet
QOCA单网平差
3 pages
Unit 2
No ratings yet
Unit 2
78 pages
Anomaly Construction in Climate Data
No ratings yet
Anomaly Construction in Climate Data
15 pages
Homework 5 Answers
No ratings yet
Homework 5 Answers
126 pages
Project Report
No ratings yet
Project Report
15 pages
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
No ratings yet
Unit 1: Shobana T S Assistant Professor Dept. of ISE, BMSCE
127 pages
House Value
No ratings yet
House Value
22 pages
Test 2.3a - Measures of Location and Spread
No ratings yet
Test 2.3a - Measures of Location and Spread
1 page
EDA Explanations
No ratings yet
EDA Explanations
22 pages
A Primer of Permutation Statistical Methods ISBN 3030209326, 9783030209322 EPUB DOCX PDF Download
No ratings yet
A Primer of Permutation Statistical Methods ISBN 3030209326, 9783030209322 EPUB DOCX PDF Download
15 pages
Kin Jer Ski 2006
No ratings yet
Kin Jer Ski 2006
7 pages
MS5107 Boston Housing, Corolla NUIG
No ratings yet
MS5107 Boston Housing, Corolla NUIG
6 pages
Regression Analysis - Lasso and Ridge Regularization
No ratings yet
Regression Analysis - Lasso and Ridge Regularization
17 pages
IJNRD2211112
No ratings yet
IJNRD2211112
10 pages

Making Predictions

Uploaded by

Making Predictions

Uploaded by

Getting familiarized with Making predictions process using Machine Learning

After we read the data, we can look at the data using:

The description of all the features is given below:

The correlation coefficient ranges from -1 to 1. If the value is close

The standard deviation is a way of measuring how much variation

Model Training using Gradient descent

Training and testing the model

You might also like