BDA Unit 4
BDA Unit 4
Linear Regression
Logistic Regression
Polynomial Regression
Linear Regression
It is one of the most basic and widely used
machine learning algorithms.
It is a predictive modeling technique used to
predict a continuous dependent variable, given
one or more independent variables.
Simple linear regression
One independent and one dependent variable.
Multiple linear regression
More than one independent variable and one
dependent variable.
Here, the relationship between the
dependent and independent variable is
always linear thus, when we try to plot
their relationship, we’ll observe more of a
straight line than a curved one.
Equation used to represent a linear
regression model:
Multiple linear regression
Extension of linear regression into
relationship between more than two
variables.
In simple linear relation we have one
predictor and one response variable, but in
multiple regression we have more than one
predictor variable and one response variable.
Logistic Regression
To calculate b0:
2 69 -2.8 -8.8 24.64 7.84
9 98 4.2 20.2 84.84 17.64
5 82 0.2 4.2 0.84 0.04
5 77 0.2 -0,8 -0.16 0.04
3 71 -1.8 -6.8 12.24 3.24
7 84 2.2 6.2 13.64 4.84
1 55 -3.8 -22.8 86.64 14.44
8 94 3.2 16.2 51.84 10.24
6 84 1.2 6.2 7.44 1.44
2 64 -2.8 -13.8 38.64 7.84
48 778 320.6 67.6
x
To predict the grade when no. of hours
studied = 3
Steps to Establish a
Regression
Y = b0 + b1 * X 1 + b2 * X 2
Where,
b0 - Y- intercept
X1 - Interest_Rate
X2 - - Unemployment_Rate
Year Month Interest_Rate Unemployment_Rate Stock_Index_Price
2017 12 2.75 5.3 1464
2017 11 2.5 5.3 1394
2017 10 2.5 5.3 1357
2017 9 2.5 5.3 1293
2017 8 2.5 5.4 1256
2017 7 2.5 5.6 1254
2017 6 2.5 5.5 1234
2017 5 2.25 5.5 1195
2017 4 2.25 5.5 1159
2017 3 2.25 5.6 1167
2017 2 2 5.7 1130
2017 1 2 5.9 1075
2016 12 2 6 1047
2016 11 1.75 5.9 965
2016 10 1.75 5.8 943
2016 9 1.75 6.1 958
2016 8 1.75 6.2 971
2016 7 1.75 6.1 949
2016 6 1.75 6.1 884
2016 5 1.75 6.1 866
2016 4 1.75 5.9 876
2016 3 1.75 6.2 822
2016 2 1.75 6.2 704
2016 1 1.75 6.1 719
Step 2: Capture the data in R
Step 3: Apply multiple linear regression in
R
Multiple linear regression equation
Step 4: Make a prediction.
Predict the stock index price for the
following data:
Interest Rate = 1.5 (i.e., X1= 1.5)
Unemployment Rate = 5.8 (i.e., X2= 5.8)
Example – II
By using the sample.split() we are actually
creating a vector with two values TRUE
and FALSE.
By setting the SplitRatio to 0.7, you are
splitting the original Revenue dataset of
1000 rows to 70% training and 30%
testing data.
Logistic Regression
Use case – College Admission
using Logistic Regression
Polynomial Regression
Polynomial Regression
𝜃 0 is the bias,
𝜃 1, 𝜃2, …, 𝜃n are the weights in the equation of the
polynomial regression, and
n is the degree of the polynomial
Co-variance
Covariance is a measure of how much two
random variables vary together
Lie between -infinity and +infinity
A positive value shows that both variables
vary in the same direction and negative
value shows that they vary in the opposite
direction.
Measure of correlation
Co-variance
Co-variance
Correlation
Given:
D = {t1, …, tn} where ti=<ti1, …, tih>
Database schema contains {A1, A2, …, Ah}
Classes C={C1, …., Cm}
Decision or Classification Tree is a tree associated with D such that
Each internal node is labeled with attribute, Ai
Each arc is labeled with predicate which can be applied to
attribute at parent
Each leaf node is labeled with a class, Cj
Solving the Classification problem using Decision trees is
a two step process:
Decision tree Induction
Construct a DT using training data
For each ti belongs to D, apply the DT to determine its class.
Decision tree based algorithms
DT Splits Area
M
Gender
F
Height
Comparing DT’s
Balanced
Deep
Random Forests
Random forest is a supervised learning algorithm which
is used for both classification as well as regression.
Mainly used for classification problems.
Forest is made up of trees and more trees means more
robust forest.
Similarly, random forest algorithm creates decision trees
on data samples and then gets the prediction from each
of them and finally selects the best solution by means of
voting.
It is an ensemble method which is better than a single
decision tree because it reduces the over-fitting by
averaging the result.
Random Forest