ML Unit-2
ML Unit-2
Techniques
KCS 055
Regression Algorithm
House Price
Challenges in guessing the House price
Predicting the price with the help of ML model
Regression Model
Simple Linear Regression
Y = a + bX
Dependent
Variable
Independent
Variable
Y-intercept
(The value of
Y when x is 0)
Slope
(How much Y
changes for a unit
change in X)
Linear Regression
35
30
30
25
Area Price 20
(sq. feet) (in Lakhs) 20
Price
100 10 15
10
200 20 10
300 30 5
0
0 50 100 150 200 250 300 350
Area
Linear Regression
35
30
30
Area Price 25
Price
15
200 20 10
10
300 30
5
Y = a + bX 0
0 50 100 150 200 250 300 350
Y -> Price Area
X -> Area
Linear Regression
Slope (b) = Sum of product of deviation/ Sum of square
of deviation for X
Y-intercept (a) = Mean of Y – (b* Mean of X)
Area (X) Price (Y) Mean Mean Deviation (X) Deviation (Y) Product of Square of
(sq. feet) ( Lakhs) of X of Y X – mean(X) Y – mean(Y) Deviations Deviation for X
100 10 200 20 100 – 200 = -100 10 - 20= -10 1000 10,000
200 20 200-200 = 0 20-20 = 0 0 0
300 30 300 – 200 = 100 30 – 20 = 10 1000 10,000
30
30
25
20
Price 20
15
10
10
0
0 50 100 150 200 250 300 350
Area
35
30
30
25
20
Price 20
Outliers
15
10
10
0
0 50 100 150 200 250 300 350
Area
Outliers
An observation that lies an abnormal distance from other
values in a random sample from a population
Predict the price of the pizza whose
diameter is 20 inches.
10 13
12 16
Predict the price of the pizza whose
diameter is 20 inches.
Diameter Price (Y) Mean Mean Deviation (X) Deviation (Y) Product of Square of
(X) (Dollar) of X of Y X – mean(X) Y – mean(Y) Deviations Deviation for X
(inches)
8 10 10 13 8 -10 = -2 10 – 13 = -3 6 4
10 13 10 – 10 = 0 13 -13 = 0 0 0
12 16 12- 10 = 2 16 – 13 = 3 6 4
Slope (b) = Sum of product of deviation/ Sum of square of deviation for X Price when X is 20
Y-intercept (a) = Mean of Y – (b* Mean of X)
Price = a + bx
Slope (b) = 1.5 = -2 + 1.5 * 20
Y-intercept (a) = -2 = 28
Pizza Price
30
25
20
15
10
0
0 5 10 15 20 25
-5
The world in not so linear
Multiple Linear Regression
• When the data has more
than one independent
variable.
Y = a + b1X1+ b2X2 + b3X3
………………. + bnXn
Dataset
Use the following steps to fit a multiple linear
regression model to this dataset.
In our example, it is
Y = -6.867 + 3.148x1 – 1.656x2
Matrix Approach
Coefficients = ((XTX)-1XT )Y
1 1 4
Y X1 X2 1 1 1 1
X= 1 2 5 XT = 1 2 3 4
1 1 4 1 3 8
4 5 8 2
1 4 2 4x3
3x4
6 2 5
1
8 3 8 (((XT)3x4X4x3)-1)3x3(XT)3x4 ) 3x4Y4x1
Y= 6
8 = (result) 3x1
12 4 2
12
4x1
Matrix Approach
Coefficients = ((XTX)-1XT )Y
1 1 1 1 1 1 4
X TX = 1 2 3 4 * 1 2 5
Y X1 X2 1 3 8
4 5 8 2
1 4 2
1 1 4 4 10 19
6 2 5
XTX = 10 30 46
19 46 109
8 3 8
3.15 −0.59 −0.30
12 4 2 (XTX)-1 = −0.59 0.20 0.016
−0.30 0.016 0.054
Matrix Approach
Coefficients = ((XTX)-1XT )Y
3.15 −0.59 −0.30 1 1 1 1
(XTX)-1XT = −0.59 0.20 0.016 * 1 2 3 4
Y X1 X2 −0.30 0.016 0.054 4 5 8 2
1 1 4 (XTX)-1XT =
6 2 5
0.05 0.47 − 1.02 0.19
8 3 8 −0.32 −0.098 0.155 0.26
12 4 2 −0.065 0.005 0.185 −0.125
Matrix Approach
Coefficients = ((XTX)-1XT )Y
((XTX)-1XT )Y =
1
Y X1 X2 0.05 0.47 − 1.02 0.19
* 6
1 1 4 −0.32 −0.098 0.155 0.26 8
−0.065 0.005 0.185 −0.125 12
6 2 5
−1.69 𝑏0
8 3 8 ((XTX)-1XT )Y = 3.48 = 𝑏1
12 4 2 −0.05 𝑏2
b0 = -1.69, b1 = 3.48, b2 = -0.05
Matrix Approach
Coefficients = ((XTX)-1XT )Y
So, Coefficients are:
Y X1 X2 b0 = -1.69, b1 = 3.48, b2 = -0.05
1 1 4
Y = b0 + b1X1 + b2X2
6 2 5
8 3 8 Y = -1.69 + 3.48X1 + -0.05X2
12 4 2
Polynomial Regression Model
It is the extended version of Simple Linear Model
Polynomial
• Zero degree Polynomial
Y = ax0 = a = Constant
• One degree Polynomial
Y = a + bx1 = Simple Linear Equation
• Two degree Polynomial
Y = a + bx1 + bx2
• n degree Polynomial
Y = a + bx1 + bx2 + bx3 +……….. + bxn
Regression Model
Simple Linear • Y = a + bX
Regression
Multiple Linear • Y = a + b1X1 + b2X2 + b3X3 + ……….. + bnXn
Regression
Polynomial • Y = a + bX1 + bX2 + bX3 +……….. + bXn
Regression
1 - Linear Relationship
Between dependent and independent variables
2 - Normal Distribution of Residuals
Mean should be zero
3 - Very low/No Multicollinearity
As we can see, there is no relation between
independent variables
4- No Auto-correlation
𝑌 = 𝜎 𝑎 + 𝑏𝑥
Sigmoid
Function
Logistic Regression
1
𝑌= −(𝑎+𝑏𝑥)
1+ 𝑒
Logistic Regression
Study Exam
Hours Result
X Y
• Supervised Classification Model
2 0 • Dependent Variable (Y) is
3 0
Categorical or binary (0 or 1)
4 0
5 1 • Independent Variable (X) is
6 1 Continuous
7 1
8 1
Linear Regression Vs Logistic Regression
What is error
• Supervised Machine
Learning Algorithm.
• Binary Classification.
• Vectors means the
data points.
Basic Concepts in SVM
• Mathematical Functions
• Take data at input and transform it into required output.
• Different Kernal Functions are:
– Linear Kernel
– Polynomial Kernel
– Gaussian Kernel
– Radial Basis Function (RBF)
Linear Kernel
Outlook Yes No
Overcast
Rainy
Sunny
Total
Step-1: Make a frequency table
Outlook Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Step-2: Make Likelihood Table
Overcast
Rainy
Sunny
Outlook P(Outook|Yes) P(Outook|No)
Overcast 5/10 0
Rainy 2/10 2/4
Sunny 3/10 2/4
Find the probability to play tennis
on 15th day using Naïve Bayes
Classifier where Outlook is Sunny
Step-3: Apply Bayes’ Theorem:
𝑃 𝐵 𝐴 . 𝑃(𝐴)
𝑃 𝐴𝐵 =
𝑃(𝐵)
• First, we find the probability of Yes when it is Sunny
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠)
𝑃 𝑌𝑒𝑠 𝑆𝑢𝑛𝑛𝑦 =
𝑃(𝑆𝑢𝑛𝑛𝑦)
Find the probability to play tennis
on 15th day using Naïve Bayes
Classifier where Outlook is Sunny
• First, we find the probability of Yes when it is Sunny
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑌𝑒𝑠 ∗𝑃(𝑌𝑒𝑠)
𝑃 𝑌𝑒𝑠 𝑆𝑢𝑛𝑛𝑦 =
𝑃(𝑆𝑢𝑛𝑛𝑦)
3 10
∗
10 14
𝑃 𝑌𝑒𝑠 𝑆𝑢𝑛𝑛𝑦 = 5 = 3/5 = 0.60
14
Find the probability to play tennis
on 15th day using Naïve Bayes
Classifier where Outlook is Sunny
• Second, we find the probability of No when it is
Sunny
𝑃 𝑆𝑢𝑛𝑛𝑦 𝑁𝑜 ∗𝑃(𝑁𝑜)
𝑃 𝑁𝑜 𝑆𝑢𝑛𝑛𝑦 =
𝑃(𝑆𝑢𝑛𝑛𝑦)
2 4
∗
4 14
𝑃 𝑁𝑜 𝑆𝑢𝑛𝑛𝑦 = 5 = 2/5 = 0.40
14
Find the probability to play tennis
on 15th day using Naïve Bayes
Classifier where Outlook is Sunny
• So, P(Yes|Sunny) > P(No|Sunny)
= 0.60 > 0.40
Therefore, we can say that Player can play tennis on a
sunny day.
P(Play Tennis = yes) = 9/14 = 0.64
P(Play Tennis = no) = 5/14 = 0.36
Humidity Prob.
Outlook Prob. Temperature Prob.
High
Sunny hot
Normal
Overcast mild
true
false
P(Play tennis = yes) = 9/14 = 0.64
P(Play Tennis = no) = 5/14 = 0.36
true 6/14
false 8/14
P(Play tennis = yes) = 9/14 = 0.64
P(Play Tennis = no) = 5/14 = 0.36
• Advantages:
– Fast and easy algorithm
– Can be used for binary and multi classification
– Mostly used for text classification
• Disadvantages:
– Cannot learn relation between independent features
Bayesian Belief Network
• Probabilistic Graphical Model.
• Represents a set of variables and
their conditional dependencies
using a directed acyclic graph.
• Two major components:
• Directed Acyclic Graph (DAG)
• Table of Conditional
Probabilities
Bayesian Belief Network
0.001
Calculate the probability that alarm has
sounded, but there is neither a burglary,
nor an earthquake occurred, and David
and Sophia both called the Harry.
P(A) = P(A|B,E)P(B)P(E)+
c0.001 P(A|B, ⌐ E)P(B)P(⌐ E)+
P(A| ⌐ B,E)P(⌐ B)P(E)+
P(A| ⌐ B, ⌐ E)P(⌐ B)P(⌐ E)
What is the probability that David called?
P(D) = P(D|A)P(A) +
P(D|⌐A)P(⌐A)
P(⌐A) = P(⌐A|B,E)P(B)P(E)+
0.001 P(⌐A|B, ⌐E)P(B)P(⌐E)+ P(⌐A| ⌐
B,E)P(⌐ B)P(E)+ P(⌐A|⌐B,
⌐E)P(⌐B)P(⌐E)
What is the probability that David called?
• P(A) = 0.00252
• P(⌐A) = 0.99748
• P(D) = P(D|A)P(A) + P(D|⌐A)P(⌐A)
• P(D) = 0.91 * 0.00252 + 0.05 * 0.99748
EM Algorithm
• E -> Expectation
• M -> Maximization
• Used to find latent variable.
• Latent variable – not directly observed
• Basically, used for many unsupervised
clustering algorithm
Steps involved in EM Algorithm
• Step 1- A set of initial values are considered
– Set of incomplete data is given to the system.
• Step 2 - Expectation Step or E-step
– Use observed data to estimate or guess the values.
• Step 3 – Maximization Step or M-Step
– Update the generated values
• Step 4 – To check values are converging or not
– If converging – stop
– Otherwise repeat step 2 or 3 till the convergence
occurs
Usage of EM Algorithm
• Used to fill missing data.
• Used for unsupervised clustering.
• Used to discover values of latent
variable.
• Used to calculate Gaussian density of a
function.
• Used to estimate parameters of Hidden
Markov Model.
Advantages & Disadvantages
Advantages Disadvantages
• Easy to implement as it has • Slow convergence.
only 2 steps E-step and M- • Make convergence local
step. optimal only.
• Likelihood increases after • Required both forward and
each iteration. backward probabilities.
• Solution of M-Step exists in
closed form.
Concept Learning
• “A task of acquiring potential
hypothesis (solution) that best fits the
given training examples”.
1) +ve
S1 = < Sunny, Warm, Normal,
Strong, Warm, Same>
G1 = <?,?,?,?,?,?>
2) +ve
S2 = < Sunny, Warm, ?, Strong, Warm, Same>
G2 = <?,?,?,?,?,?>
3) –ve
S3 = < Sunny, Warm, ?, Strong, Warm, Same>
G3 = <<Sunny,?,?,?,?,?>,<?,Warm,?,?,?,?>,<?,?,?,?,?,same>>
4) +ve
S4 = < Sunny, Warm, ?, Strong, ?, ?>
G4 = <<Sunny,?,?,?,?,?>,<?,Warm,?,?,?,?>>
S0 = <Փ,Փ,Փ,Փ>
Find S algorithm.
Find S algorithm.
Find S algorithm.
• F1 -> A, B
• F2-> X, Y
• Instance Spaces: (A,X), (A,Y), (B,X), (B,Y) – 4 examples
• Hypothesis Space: (A,X), (A,Y), (A, Փ), (A,?), (B,X),
(B,Y), (B, Փ), (B,?), (Փ,X), (?,X), (Փ,Y), (?,Y), (Փ, Փ),
(Փ,?), (?, Փ), (?,?) - 16 Hypothesis
List then Eliminate
Tom M. Mitchell, Ethem Alpaydin, ―Introduction Stephen Marsland, Bishop, C., Pattern
―Machine Learning, to Machine Learning (Adaptive ―Machine Learning: An Recognition and Machine
McGraw-Hill Computation and Machine Algorithmic Perspective, Learning. Berlin:
Education (India) Learning), The MIT Press 2004. CRC Press, 2009. Springer-Verlag.
Private Limited, 2013.
Text Books
Saikat Dutt, Andreas C. Müller and John Paul Mueller and Dr. Himanshu
Subramanian Sarah Guido - Luca Massaron - Sharma, Machine
Chandramaouli, Amit Introduction to Machine Machine Learning for Learning, S.K.
Kumar Das – Machine Learning with Python Dummies Kataria & Sons -2022
Learning, Pearson