ML Merged Endsem
ML Merged Endsem
STRONGLY RECOMMENDED:
• Linear algebra
– Matrices, vectors, systems of linear equations
– Eigenvectors, matrix rank
– Singular value decomposition
• Multivariable calculus
– Derivatives, integration, tangent planes
– Optimization, Lagrange multipliers
• Good programming skills: Python highly recommended
Source Materials
Spam
vs.
Not Spam
Face recognition
Temperature
72° F
Ranking
comparing items
Web search
Given image, find similar images
http://www.tiltomo.com/
Collaborative Filtering
Recommendation systems
Recommendation systems
Machine learning competition with a $1 million prize
Clustering
Set of Images
[Goldberger et al.]
Clustering web search results
Embedding
visualizing data
Embedding images
• Images have
thousands or
millions of pixels.
[Joseph Turian]
Embedding words (zoom in)
[Joseph Turian]
Structured prediction
•How do we
choose the best
one?
Occam’s Razor Principle
• William of Occam: Monk living in the 14th century
• Principle of parsimony:
[Samy Bengio]
Key Issues in Machine Learning
• How do we choose a hypothesis space?
– Often we use prior knowledge to guide this choice
• How can we gauge the accuracy of a hypothesis on unseen
data?
– Occam’s razor: use the simplest hypothesis consistent with data!
This will help us avoid overfitting.
– Learning theory will help us quantify our ability to generalize as
a function of the amount of training data and the hypothesis space
• How do we find the best hypothesis?
– This is an algorithmic question, the main topic of computer
science
• How to model applications as machine learning problems?
(engineering challenge)
Probability Theory refresher
Chapter 2
Bayesian Decision Theory
Key Principle
Bayes Theorem
Machine Learning Tapas Kumar Mishra 2
Bayes Theorem
Bayes theorem
X: the observed sample (also called evidence; e.g.: the length of a fish)
H: the hypothesis (e.g. the fish belongs to the “salmon” category)
P(H): the prior probability that H holds (e.g. the probability
of catching a salmon)
P(X|H): the likelihood of observing X given that H holds
(e.g. the probability of observing a 3‐inch length fish which is
salmon)
P(X): the evidence probability that X is observed
(e.g. the probability of observing a fish with 3‐inch length)
P(H|X): the posterior probability that H holds given X (e.g.
Thomas Bayes
the probability of X being salmon given its length is 3‐inch)
(1702-1761)
Observation for
test example
Convert the prior probability
(e.g.: fish lightness)
to the posterior probability
Question
If positive test result is returned for some person, does
he/she have this kind of cancer or not?
No cancer!
Counting relative
Quantities to know:
frequencies via
collected samples
Compute :
# cars in 221
# cars in 988
For , # cars in Ix
is 46
For , # cars in Ix
is 59
decide decide
error rate
the probability thataction
Is wrong
Discriminant functions
Various Identical
discriminant functions classification results
where and
Discrete case
Discrete case:
Continuous case:
Variance
Discrete case:
Continuous case:
(marginal pdf )
Expected vector
marginal pdf on
the i‐th component
Notation:
symmetric
Positive
semidefinite
weight vector
threshold/bias
squared Mahalanobis
distance
weight vector
threshold/bias
quadratic matrix
weight vector
threshold/bias
◼ Gaussian/Normal density
3
Complementary Event
P(A)=1-P(not A)
4
Joint Probability
Joint Probability (A ∩ B)
The probability of two events in conjunction. It is the probability of both events together.
p (A B ) = p (A ) + p (B ) − p (A B )
Independent Events
Two events A and B are independent if
p (A B ) = p (A ) p (B )
5
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3
E2: Drawing Ball 2 P(E2):1/3 p (A B ) = p (A ) p (B )
E3: Drawing Ball 3 P(E3): 1/3
p (B ) P (A )
7
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3 p (A B )
E2: Drawing Ball 2 P(E2):1/3 p (A | B ) =
E3: Drawing Ball 3 P(E3): 1/3 p (B )
Case 1: Drawing with replacement of the ball
1 2
The second draw is independent of the first draw
3
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1 E 2) 3 3 1
= =
3 p ( E 2) 1 3
3
Case 2: Drawing without replacement of the ball
The second draw is dependent on the first draw
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1 E 2) 3 2 1
= =
2 p ( E 2) 1 2
3
8
Baye’s Rule
We know that
p (A B ) = p (B A )
Using Conditional Probability definition, we have
p (A | B ) P (B ) = p (B | A ) P (A )
10
Law of Total Probability
Event A = {3}
X : →S
Ei xi
Event Value
14
A random variable: Examples.
15
Random Variable Types
► Discrete Random Variable:
► possible values are discrete (countable sample space,
Integer values)
X : → {1, 2,3, 4,...}
Ei xi
► Continuous Random Variable:
► possible values are continuous (uncountable space, Real
X : → 1.4,32.3
values)
Ei xi
16
Discrete Random Variable
The probability distribution for discrete random
variable is called Probability Mass Function (PMF).
p (x i ) = P (X = x i )
Properties of PMF
0 p (x i ) 1 and p (x
i
i ) =1
Cumulative Distribution Function (CDF)
p (X x ) = x x
p (x i )
i
Mean value n
X = E ( X ) = x
i =1
i p (x i )
17
Discrete Random Variable
Mean (Expected) value
n
X = E ( X ) = x
i =1
i p (x i )
Variance
V (X ) = X2 = E (X − E (X ))
2
=E X ( )
2
− E (X )2 General Equation
n
V (X ) = ( x i − x ) p (x i )
2
i =1 For Discrete RV
2
n n
V (X ) = (x i ) p (x i ) −
x i p (x i )
2
i =1 i =1
18
Discrete Random Variable: Example
Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
0 p ( 2) 1
4
p ( 2) = P ( X = 2) = = 0.4
10
0 p ( 3) 1
4
p ( 3) = P ( X = 3) = = 0.4
10 Grade
19
Discrete Random Variable: Example
Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
p ( x i ) = p (1) + p ( 2) + p (3) = 1
i
p ( X 3) = x 2
p (x i ) = p (1) + p ( 2 ) + p ( 3) = 1
i
20
Discrete Random Variable: Example
Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10
Grade 3 2 3 1 2 3 1 3 2 2
x p (x
i
i i ) = 1 0.2 + 2 0.4 + 3 0.4 = 2.2
Grade
i
= 2.2
10
21
Continuous Random Variable
The probability distribution for continuous random
variable is called Probability Density Function (PDF).
f (x )
The probability of a given value is always 0 p ( x = x i ) = 0
The sample space is infinite
For continuous random variable, we compute p ( a x b )
Properties of PDF
1. f ( x) 0 , for all x in R X
2. f ( x)dx = 1
RX
3. f ( x) = 0, if x is not in R X
22
Continuous Random Variable
Cumulative Distribution Function (CDF)
p (X x ) = x x
p (x i )
i
p ( X x ) = f (t ) dt = 0
x
−
Mean/Expected value
p ( a X b ) = f ( x ) dx
b
+
x f ( x )dx
a
X = E ( X ) =
−
Variance
+ +
V (X ) = (x − ) f ( x ) dx and
V ( X ) = x 2 f ( x ) dx − x2
2
x
− −
23
Discrete versus Continuous Random Variables
Discrete Random Variable Continuous Random Variable
3. f ( x) = 0, if x is not in R X
x
p (x i )
i −
p ( a X b ) = f ( x ) dx
b
24
Continuous Random Variables: Example
1 x
Exponential Distribution exp − , x 0
f (x ) =
Exp (µ) 0,
otherwise
25
Continuous Random Variables: Example
1 −x / 2
e , x0
f ( x) = 2
0, otherwise
time
26
Continuous Random Variables: Example
Probability that the customer waits exactly 3 minutes is:
1 3 − x /2
P (x = 3) = P (3 x 3) = 3 e dx = 0
2
Probability that the customer waits between 2 and 3 minutes is:
1 3 − x /2
P (2 x 3) = e dx = 0.145
2 2
P(2 X 3) = F (3) − F (2) = (1 − e − (3 / 2) ) − (1 − e −1 ) = 0.145 CDF
27
Continuous Random Variables: Example
2 0
E (X ) = −x /2
dx = + e − x / 2dx = 2
xe
0
0
= V (X ) = 2
28
Variance
X = X
2
= V (X ) =s
29
Coefficient of Variation
V (X ) X
CV (X ) = =
E ( X ) X
30
Discrete Probability Distribution
31
Probability Mass Function (PMF)
Formally
the probability distribution or probability mass function
(PMF) of a discrete random variable X is a function that
gives the probability p(xi) that the random variable equals xi,
for each value xi:
p (x i ) = P (X = x i )
It satisfies the following conditions:
0 p (x i ) 1
p (x
i
i ) =1
32
Continuous Random Variable
33
Probability Density Function (PDF)
For the case of continuous variables, we do not want to
ask what the probability of "1/6" is, because the answer is
always 0...
Rather, we ask what is the probability that the value is in
the interval (a,b).
So for continuous variables, we care about the derivative of
the distribution function at a point (that's the derivative of an
integral). This is called a probability density function
(PDF).
The probability that a random variable has a value in a set A is
the integral of the p.d.f. over that set A.
34
Probability Density Function (PDF)
The Probability Density Function (PDF) of a continuous
random variable is a function that can be integrated to
obtain the probability that the random variable takes a value in
a given interval.
More formally, the probability density function, f(x), of a
continuous random variable X is the derivative of the
cumulative distribution function F(x):
d
f (x ) = F (x )
dx
Since F(x)=P(X≤x), it follows that:
b
F (b ) − F (a ) = P (a X b ) = f ( x ) dx
a
35
Cumulative Distribution Function (CDF)
− x +,
F (x ) = P (X x )
36
Cumulative Distribution Function (CDF)
F (x ) = P (X x ) = P (X
x i x
= xi ) = p (x
x i x
i )
F (a ) − F (b ) = P (a X b ) = f ( x ) dx
a
37
Cumulative Distribution Function (CDF)
► Example
► Discrete case: Suppose a random variable X has the
following probability mass function p(xi):
xi 0 1 2 3 4 5
p(xi) 1/32 5/32 10/32 10/32 5/32 1/32
38
Mean or Expected Value
40
Variance
41
Variance
V (X ) = X2 = E (X − E (X ))
2
( ) − E (X )
=E X 2 2
X = X2 = V ( X ) =s
42
Sampling Distributions,
Confidence interval, Hypothesis
testing
Sampling Distribution of the means
• Central Limit Theorem: if 𝑋ത is the mean of a random sample of size n
taken from a population mean 𝜇 and finite variance 𝜎 2 , then the
limiting form of the distribution of
ത
𝑋−𝜇
𝑧= 𝜎
𝑛
Theorem 2:
Example:
• Let X1, X2, …, Xn be the sample means of samples S1, S2, …, Sn that are
drawn from an independent and identically distributed population
with mean and standard deviation . From central limit theorem
we know that the sample means Xi follow a normal distribution with
mean and standard deviation / n . The variable Z = / n follows
X −
i
− −
P ( X − Z / 2 / n X + Z / 2 / n ) = 1 −
9/11/2023 @TKMISHRA ML NITRKL 14
CI for Different Significance Values
• That is, the probability that the population mean takes a value
between X − Z / 2 / n and X + Z / 2 / n is 1 – .
− −
• The absolute values of Z/2 for various values of are shown below:
Confidence interval for
|Z/2|
population mean when
population standard deviation is
known
−
0.1 1.64 X 1.64 / n
−
0.05 1.96 X 1.96 / n
−
0.02 2.33 X 2.33 / n
−
0.01 2.58 X 2.58 / n
(a) Calculate the 95% confidence interval for the population mean.
(b) What is the probability that the population mean is greater than
4.73 days?
Note that 4.73 is the upper limit of the 95% confidence interval from part (a), thus the probability
that the population mean is greater than 4.73 is approximately 0.025.
• William Gossett (Student, 1908) proved that if the population follows a normal
distribution and the standard deviation is calculated from the sample, then the statistic
given in Eq will follow a t-distribution with (n − 1) degrees of freedom
−
X −
t =
S / n
• Here S is the standard deviation estimated from the sample (standard error). The t-
distribution is very similar to standard normal distribution; it has a bell shape and its
mean, median, and mode are equal to zero as in the case of standard normal distribution.
The major difference between the t-distribution and the standard normal distribution is
that t-distribution has broad tail compared to standard normal distribution. However, as
the degrees of freedom increases the t-distribution converges to standard normal
distribution.
• In above Eq, the value t/2,n − 1 is the value of t under t-distribution for
which the cumulative probability F(t) = /2 when the degrees of
freedom is (n − 1).
• An online grocery store is interested in estimating the basket size (number of items
ordered by the customer) of its customers so that it can optimize its size of crates used
for delivering the grocery items. From a sample of 70 customers, the average basket size
was estimated as 24 and the standard deviation estimated from the sample was 3.8.
Calculate the 95% confidence interval for the basket size of the customer order.
Solution
−
We know that X = 24 , n = 70, S = 3.8 and t0.025, 69 = 1.995
Thus the 95% confidence interval for the size of the basket is
(23.09,24.91).
9/11/2023 @TKMISHRA ML NITRKL 25
HYPOTHESIS TESTING
INTRODUCTION TO HYPOTHEIS TESTING
3) Identify the test statistic to be used for testing the validity of the null
hypothesis. Test statistic will enable us to calculate the evidence in
support of null hypothesis. The test statistic will depend on the
probability distribution of the sampling distribution; for example, if the
test is for mean value and the mean is calculated from a large sample
and if the population standard deviation is known, then the sampling
distribution will be a normal distribution and the test statistic will be a Z-
statistic (standard normal statistic).
9/11/2023 @TKMISHRA ML NITRKL 31
HYPOTHESIS TESTING STEPS
4. Decide the criteria for rejection and retention of null hypothesis.
This is called significance value traditionally denoted by symbol .
The value of will depend on the context and usually 0.1, 0.05, and
0.01 are used.
6. Take the decision to reject or retain the null hypothesis based on the
p-value and significance value . The null hypothesis is rejected
when p-value is less than and the null hypothesis is retained when
p-value is greater than or equal to .
analytics.
engineering.
hypothesis.
Criteria Decision
H0: m 100,000
HA: m > 100,000
Z-statistic = X −
/ n
• The critical value in this case will depend on the significance
value and whether it is a one-tailed or two-tailed test
0.1
−1.28 1.28 −1.64 and 1.64
0.05
−1.64 1.64 −1.96 and 1.96
0.01
−2.33 2.33 −2.58 and 2.58
X − 4250 − 4200
Z = = = 3.125
/ n 3200 / 40000
16 16 30 37 25 22 19 35 27 32
34 28 24 35 24 21 32 29 24 35
28 29 18 31 28 33 32 24 25 22
21 27 41 23 23 16 24 38 26 28
/ n = 12 .5 / 40 = 1.9764
9/11/2023 @TKMISHRA ML NITRKL 56
Solution Continued…
• The critical value of left-tailed test for = 0.05 is –1.644.
• Since the critical value is less than the Z-statistic value,
we fail to reject the null hypothesis. The p-value for Z =
−1.4926 is 0.06777 which is greater than the value of .
X − 84 − 82
Z= = = 1.8132
/ n 11.03 / 100
601 627 330 364 562 353 583 254 528 470
408 601 593 729 402 530 708 599 439 762
292 636 444 286 636 667 252 335 457 632
X − 429.55 − 500
t - statistic = = = −2.2845
S/ n 195.0337 / 40
X − 19.5 − 16.8
t = = = 2.8927
S / n 6.6 / 50
d − D 11.5 − 0
t = = = 0.5375
S/ n 95.6757 / 20
ˆ − X
(X ˆ ) − ( − )
Z = 1 2 1 2
12 22
+
n1 n2
Specialization Sample Size Estimated Mean Salary (in Rupees) Population Standard
Z-statistic value is higher than the Z-critical value, we reject the null
Where pis
S2 the pooled variance of two samples and is given by
2 2
( n − 1) S + ( n − 1) S
S 2p = 1 1 2 2
(n1 + n2 − 2)
( X 1 − X 2 ) − (1 − 2 )
t =
1 1
S p2 +
1
n n2
Group Sample Size Increase in Height (in cm) during the Standard Deviation
Drink health
80 7.6 cm 1.1 cm
drink
Do not drink
80 6.3 cm 1.3 cm
health drink
2 = 1.3.
The null and alternative hypotheses are
H0: 1 − 2 1.2
HA: 1 − 2 > 1.2
Pooled variance is
(n1 − 1) S12 + (n2 − 1) S 22 79 1.12 + 79 1.32
S 2p = = = 1.45
(n1 + n2 − 2) 80 + 80 − 2
deviation
S2 S 2
Su = 1 + 2
n1 n
2
( X 1 − X 2 ) − (1 − 2 )
t =
S12 S 22
+
1
n n2
Couples with no
120 10.1 years 2.4 years
Degree
Couples with
100 9.5 years 3.1 years
Degree
1
+
2 +
1
n n2
120 100
August 3, 2022
Bayes Optimal classifier
Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?
Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?
nH + nT
P(D | θ) = θnH (1 − θ)nT , (1)
nH
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m
θα−1 (1 − θ)β−1
P(θ) = (8)
B(α, β)
August 7, 2022
1/45
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup
D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C
where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space
2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The data points (xi , yi ) are drawn from some (unknown)
distribution P(X , Y ). Ultimately we would like to learn a function
h such that for a new pair (x, y ) ∼ P, we have h(x) = y with high
probability (or h(x) ≈ y ).
3/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier
5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Assume for example an email x can either be classified as spam
(+1) or ham (−1). For the same email x the conditional class
probabilities are:
In this case the Bayes optimal classifier would predict the label
y ∗ = +1 as it is most likely, and its error rate would be
BayesOpt = 0.2.
6/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)
7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The Venn diagram illustrates that the MLE method estimates
P̂(y |x) as
|C |
P̂(y |x) =
|B|
8/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!
9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!
9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
10/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(y ) is easy.
For example, if Y takes on discrete binary values estimating P(Y )
reduces to coin tossing. We simply need to count how many times
we observe each outcome (in this case each class):
Pn
I (yi = c)
P(y = c) = i=1 = π̂c
n
11/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes
P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(x|y ), however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.
12/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1
14/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as
Now that we know how we can use our assumption to make the
estimation of P(y |x) tractable. There are 3 notable cases in which
we can use our naive Bayes classifier.
Case #1: Categorical features.
Case #2: Multinomial features.
Case #3: Continuous features (Gaussian Naive Bayes)
16/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
17/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Features:
[x]α ∈ {f1 , f2 , · · · , fKα }
Each feature α falls into one of Kα categories. (Note that the case
with binary features is just a specific case of this, where Kα = 2.)
An example of such a setting may be medical data where one
feature could be
gender (male / female) or
marital status (single / married / widowed).
18/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Model P(xα | y ):
and
Kα
X
[θjc ]α = 1
j=1
19/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα
20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα
20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features
Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c [θ̂jc ]α
y y
α=1
Pn d P
= c) Y ni=1 I (yi = c)I (xiα = j) + l
i=1 I (yi
= argmax Pn
y n i=1 I (yi = c) + lKα
α=1
21/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d
29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d
29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features
Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c (θ̂αc )xα
c c
α=1
30/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
31/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
32/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
Features:
33/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
34/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
35/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)
Prediction:
36/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example
38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
That is,
w> x + b > 0 ⇐⇒ h(x) = +1.
40/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :
41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)
d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )
ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y
i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
44/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Validation of the Linear
Regression Model
Validation of the Simple Linear Regression Model
The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)
Yi
= 0 + 1 X i + i
Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model
In absence of the predictive model for Yi, the users will use the
mean value of Yi. Thus, the total variation is measured
−
as the
difference between Yi and mean value of Yi (i.e.,Yi -Y ).
Description of total variation, explained variation and
unexplained variation
−
Total Variation (SST) ( Yi −Y ) Total variation is the difference between the actual
value and the mean value.
−
Variation explained by the model ( Yi − Y ) Variation explained by the model is the difference
between the estimated value of Yi and the mean value
of Y
Variation not explained by model ( ) Variation not explained by the model is the difference
Yi − Yi
between the actual value and the predicted value of Yi
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
− ∧ − ∧
𝑌𝑖 − 𝑌 = 𝑌𝑖 − 𝑌 + 𝑌𝑖 − 𝑌𝑖
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model
where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by
∧ − 2
Explained variation 𝑆𝑆𝑅 𝑌𝑖 − 𝑌
Coefficient of determination = R2 = = = − 2
Total variation 𝑆𝑆𝑇
𝑌𝑖 − 𝑌
∧
𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:
2004 1 2
2005 6 2
2006 12 2
2007 58 2
2008 145 11
2009 360 21
2010 608 31
2011 845 40
2012 1056 51
Facebook users versus helium poisoning in UK
The R-square value for regression model between the number of deaths due to
helium poisoning in UK and the number of Facebook users is 0.9928. That is,
99.28% variation in the number of deaths due to helium poisoning in UK is
explained by the number of Facebook users.
Hypothesis Test for Regression Co-efficient (t-Test)
In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by
n n
i
2 2
(Yi − Y i )
Se = i =1 = i =1
n−2 n−2
n
2
(Yi − Y i ) n−2
Se i =1
S e ( 1 ) = =
− −
2
(Xi − X ) ( X i − X )2
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:
H0: 1 = 0
HA: 1 0
• The corresponding t-statistic is given as
1 − 1 1− 0 1
t = = =
Se ( 1) Se ( 1) Se ( 1)
Confidence Interval for Regression coefficients 0 and
1
The standard error of estimates of and 1are given by
0
𝑆𝑒 × σ𝑛𝑖=1 𝑋𝑖2
∧ 𝑆𝑒
∧ 𝑆𝑒 (𝛽1 ) =
𝑆𝑒 (𝛽0 ) = 𝑆𝑆𝑋
𝑛 × 𝑆𝑆𝑋
∧ 2
where 𝑌𝑖 − 𝑌𝑖
𝑆𝑒 =
𝑛−2
n −
2
Where Se is the standard error of residuals and SSX = (Xi − X )
i =1
The interval estimate or (1-)100% confidence interval for 0and
1 are given by
∧ ∧ ∧ ∧
𝛽1 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽1 ) 𝛽0 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽0 )
Multiple Linear Regression
• Multiple linear regression means linear in
regression parameters (beta values). The following
are examples of multiple linear regression:
Y = 0 + 1x1 + 2 x2 + ... + k xk +
Y = 0 + 1x1 + 2 x2 + 3 x1x2 + 4 x2 ... + k xk
2
+
∧
𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
• SSE is the sum of squares of errors and SST is the sum of
squares of total deviation. In case of MLR, SSE will decrease as
the number of explanatory variables increases, and SST
remains constant.
SSE/(n - k - 1)
Adjusted R - Square = 1 -
SST/(n - 1)
Statistical Significance of Individual Variables in MLR – t-test
β = (XT X) −1 XTY
Alternatively,
• H0: i = 0
• HA: i 0
The corresponding test statistic is given by
i − 0 i
t = =
Se (i ) Se (i )
Validation of Overall Regression Model – F-test
H0: 1 = 2 = 3 = … = k = 0
H1: Not all s are zero.
F = (SST-SSE)/k/SSE/(n-k-1) ~ Fk,n-k-1
F-test for the overall fit of the model
• Where the critical value F(, k, n-k-1) can be found from an F-table.
• The existence of a regression relation by itself does not assure that
useful prediction can be made by using it.
• Note that when k=1, this test reduces to the F-test for testing in simple
linear regression whether or not 1= 0
Linear Regression
1/39
Supervised learning
Lets start by talking about a few examples of supervised
learning problems. Suppose we have a dataset giving the
living areas and prices of 47 houses from Delhi, India:
2/39
Tapas Kumar Mishra Linear Regression
Given data like this, how can we learn to predict the prices of other
houses in Delhi, as a function of the size of their living areas?
3/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.
4/39
Tapas Kumar Mishra Linear Regression
To describe the supervised learning problem slightly more formally,
our goal is,
given a training set, to learn a function h : X → Y so that h(x)is a
good predictor for the corresponding value of y . For historical
reasons, this function h is called a hypothesis.
5/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.
6/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.
6/39
Tapas Kumar Mishra Linear Regression
Linear Regression
To make our housing example more interesting, lets consider a
slightly richer dataset in which we also know the number of
bedrooms in each house:
7/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)
8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)
8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:
hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)
8/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0
9/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1
10/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square
11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square
11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square
11/39
Tapas Kumar Mishra Linear Regression
LMS for a single instance
Lets first work it out for the case of if we have only one training
example (x, y ), so that we can neglect the sum in the definition of
J. We have:
12/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)
13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)
13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)
13/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset
The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.
14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset
The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.
14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset
The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.
14/39
Tapas Kumar Mishra Linear Regression
m
1X
J(θ) = (h(x (i) − y (i) )2 . (6)
2
i=1
15/39
Tapas Kumar Mishra Linear Regression
16/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
When we run batch gradient descent to fit θ on our previous
dataset, to learn to predict housing price as a function of living
area, we obtain θ0 = 71.27, θ1 = 0.1345. If we plot hθ (x) as a
function of x(area), along with the training data, we obtain the
following figure:
18/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent
19/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent
19/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.
20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.
20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.
20/39
Tapas Kumar Mishra Linear Regression
Normal Equation
21/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.
X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y
22/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation
24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.
24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.
24/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE
27/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP
28/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
29/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
30/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1
30/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,
31/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,
31/39
Tapas Kumar Mishra Linear Regression
Locally weighted linear regression
Consider the problem of predicting y from x ∈ R. The leftmost
figure below shows the result of fitting a y = θ0 + θ1 x to a
dataset. We see that the data doesnt really lie on straight line, and
so the fit is not very good. This is Underfitting: the structure of
the data is not captured by the model.
32/39
Tapas Kumar Mishra Linear Regression
Instead, if we had added an extra feature x 2 , and
fity = θ0 + θ1 x + θ2 x 2 , then we obtain a slightly better fit to the
data. Naively, it might seem that the more features we add, the
better.
33/39
Tapas Kumar Mishra Linear Regression
However, there is a danger of adding too many features. P The
figure below is the result of 5th order polynomial y = 5j=0 θj x j .
Even though the fitted curve passes through the data perfectly, it
is not a good predictor of y (housing prices)for different x (living
area). This is Overfitiing.
34/39
Tapas Kumar Mishra Linear Regression
In the original linear regression algorithm, to make a prediction at
a query point x (to evaluate h(x)), we would:
P (i)
1 Fit θ to minimize
i (y − θT x (i) )2 .
2 Output θT x.
35/39
Tapas Kumar Mishra Linear Regression
In Contrast, the locally weighted linear regression algorithm does
the following:
P (i) (i)
1 Fit θ to minimize
i z (y − θT x (i) )2 .
2 Output θT x.
36/39
Tapas Kumar Mishra Linear Regression
Here z (i) are non-negative valued weights.
If z (i) is large for a particular value i, then in picking θ, we will try
hard to make (y (i) − θT x (i) )2 small.
If z (i) is small for a particular value i, then (y (i) − θT x (i) )2 is
ignored in the fit.
37/39
Tapas Kumar Mishra Linear Regression
A fairly standard choice for weights is
!
(i) (x (i) − x)2
z = exp −
2τ 2
weights depend on x.
If |x (i) − x| is small, z (i) is close to 1 and if |x (i) − x| is large,
z (i) is small.
Hence, θ is chosen giving a much higher weight to the
training examples close to the query point x.
τ is the bandwidth parameter.
38/39
Tapas Kumar Mishra Linear Regression
Parametric vs non-Parametric
39/39
Tapas Kumar Mishra Linear Regression
Logistic Regression
1/10
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Classification problem
2/10
Tapas Kumar Mishra Logistic Regression
Logistic regression
1
hθ (x) = g (θT x) =
1 + e −θT x
1
where g (a) = 1+e −a
is the logistic/sigmoid function.
3/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).
4/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as
5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as
5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as
5/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).
6/10
Tapas Kumar Mishra Logistic Regression
To maximize the likelihood, we will use gradient descent.
θ := θ + α∇θ l(θ)
.
We start by taking just one training example (x, y ) and take
derivatives to derive the stochastic gradient descent rule.
7/10
Tapas Kumar Mishra Logistic Regression
This gives the stochastic ascent rule
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj
.
This is a similar looking rule as compared to LMS update rule!!
8/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.
9/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit
10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit
10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit
10/10
Tapas Kumar Mishra Logistic Regression
Optimizing the training process:
Underfitting, overfitting,
testing,
and regularization
• Let’s say that we have to study for a test.
• Several things could go wrong during our study process.
• Maybe we didn’t study enough. There’s no way to fix that, and we’ll likely
perform poorly in our test. ---------- Underfitting
• What if we studied a lot but in the wrong way. For example, instead of focusing
on learning, we decided to memorize the entire textbook word for word. Will
we do well in our test? It’s likely that we won’t, because we simply memorized
everything without learning. ----------Overfitting
• The best option, of course, would be to study for the exam properly and in a
way that enables us to answer new questions that we haven’t seen before on
the topic. -----------Generalization
• Notice that model 1 is too simple, because it is a line trying to fit a quadratic dataset. There is no way
we’ll find a good line to fit this dataset, because the dataset simply does not look like a line.
Therefore, model 1 is a clear example of underfitting.
• Model 2, in contrast, fits the data pretty well. This model neither overfits nor underfits.
• Model 3 fits the data extremely well, but it completely misses the point. The data is meant to look
like a parabola with a bit of noise, and the model draws a very complicated polynomial of degree 10
that manages to go through each one of the points but doesn’t capture the essence of the data.
Model 3 is a clear example of overfitting.
How do we get the computer to pick the right
model?
By testing
• Testing a model consists of picking a small set of the points in the dataset and choosing to use
them not for training the model but for testing the model’s performance. This set of points is
called the testing set.
• The remaining set of points (the majority), which we use for training the
model, is called the training set.
• Once we’ve trained the model on the training set, we use the
testing set to evaluate the model.
• In this way, we make sure that the model is good at generalizing
to unseen data, as opposed to memorizing the training set.
• Going back to the exam analogy, let’s imagine training and testing this way.
• Let’s say that the book we are studying for in the exam has
100 questions at the end.
• We pick 80 of them to train, which means we study them carefully, look
up the answers, and learn them.
• Then we use the remaining 20 questions to test ourselves—we
try to answer them without looking at the book, as in an exam setting.
• Looking at the top row we can see that model 1 has a large training
error, model 2 has a small training error, and model 3 has a tiny
training error (zero, in fact). Thus, model 3 does the best job on the
training set.
• Model 1 still has a large testing error,meaning that this is simply a bad
model, underperforming with the training and the testing set: it
underfits.
Can we use our testing data for training the model? No.
• We broke the golden rule in the previous example.
• Recall that we had three polynomial regression models: one of degree
1, one of degree 2, and one of degree 10, and we didn’t know which one
to pick.
• We used our training data to train the three models, and then we used
the testing data to decide which model to pick.
• We are not supposed to use the testing data to train our model or to
make any decisions on the model or its hyperparameters.
Solution: Validation Set
We break our dataset into the following three sets:
• Training set: for training all our models
• Validation set: for making decisions on which model to use
• Testing set: for checking how well our model did
• Imagine that we have a different and much more complex dataset, and we are trying to build a
polynomial regression model to fit it. We want to decide the degree of our model among the
numbers between 0 and 10 (inclusive).
• the way to decide which model to use is to pick the one that has the smallest validation error.
• However, plotting the training and validation errors can give us some valuable information and
help us examine trends.
The model
complexity graph
Another alternative to avoiding overfitting: Regularization
Now it is clear that roofer 2 is the best one, which means that optimizing performance and
complexity at the same time yields good results that are also as simple as possible. This is what
regularization is about: measuring performance and complexity with two different error functions,
and adding them to get a more robust error function.
Regularization- Measuring how complex a model is: L1 and L2 norm
• in the roofer analogy, our goal was to find a roofer that provided both good quality
and low complexity. We did this by minimizing the sum of two numbers: the measure of quality
and the measure of complexity. Regularization consists of applying the same principle to our
machine learning model.
• regression error A measure of the quality of the model. In this case, it can be the absolute
or square errors
• regularization term A measure of the complexity of the model. It can be the L1 or the L2
norm of the model.
• Error = Regression error + λ Regularization term
• λ is the regularization hyperparameter.
• Lasso regression error = Regression error + λ L1 norm
• Ridge regression error = Regression error + λ L2 norm
Regularization- Effects of L1 and L2 regularization
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:
2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Expected Label (given x ∈ Rd ):
Z
ȳ (x) = Ey |x [Y ] = y Pr(y |x)∂y .
y
The expected label denotes the label you would expect to obtain,
given a feature vector x.
3/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).
4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):
h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y
5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):
h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y
5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D
6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y
To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.
7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0
8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0
8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0
8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below
Ex,y ,D hD (x) − h̄(x) h̄(x) − y
= Ex,y ED hD (x) − h̄(x) h̄(x) − y
= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y
= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0
9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D
hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +
h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (2)
| {z }
0
Returning to the earlier expression, we’re left with the variance and
another term
h i h 2 i h 2 i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) +Ex,y h̄(x) − y
| {z }
Variance
(3)
10/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0
11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0
11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2
2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0
11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below
Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)
= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)
= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0
12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2
13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2
13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?
14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?
14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?
14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.
15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.
15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.
15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.
16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.
16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Graphical illustration of bias and variance.
17/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : The variation of Bias and Variance with the model complexity.
This is similar to the concept of overfitting and underfitting. More
complex models overfit while the simplest models underfit.
18/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Detecting High Bias and High Variance
The graph above plots the training error and the test error and can
be divided into two overarching regimes. In the first regime (on the
left side of the graph), training error is below the desired error
threshold (denoted by ), but test error is significantly higher.
19/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Test and training error as the number of training instances
increases.
In the second regime (on the right side of the graph), test error is
remarkably close to training error, but both are above the desired
tolerance of .
20/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 1 (High Variance)
Symptoms:
Training error is much lower than test error
Training error is lower than
Test error is above
Remedies:
Add more training data
Reduce model complexity – complex models are prone to high
variance
Bagging (will be covered later in the course)
21/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 2 (High Bias): the model being used is not robust enough
to produce an accurate prediction
Symptoms:
Training error is higher than , but close to test error.
Remedies:
Use more complex model (e.g. kernelize, use non-linear
models)
Add features
Boosting (will be covered later in the course)
22/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Model Selection
Performance estimation techniques
Always evaluate models as if they are predicting future data
We do not have access to future data, so we pretend that some data is hidden
Simplest way: the holdout (simple train-test split)
Randomly split data (and labels) into training , Validation and test set (e.g. 60%-20%-20 %)
Train (fit) a model on the training da ta,minimize error on validation set and score on the test data
K-fold Cross-validation
Each random split can yield very different models (and scores)
e.g. all easy (of hard) examples could end up in the test set
Split data into k equal-sized parts, called folds
Create k splits, each time using a different fold as the test set
Compute k evaluation scores, agg regate afte rwards (e.g. take the mea n)
Large k gives be tter estimates (more training data), but is expensive
Stratified K-Fold cross-validation
Sample n (dataset size) data points, with replacement, as training set (the bootstrap)
On average, bootstraps include 66% of all data points (some are duplicates)
Use the unsampled (out-of-bootstrap) samples as the test set
Repeat times to obtain scores
k k
Repeated cross-validation
Cross-validation is still biased in that the initial split can be made in many ways
Repeated, or n-times-k-fold cross-validation:
Shuffle data randomly, do k-fold cross-validation
Repeat n times, yields n times k scores
Unbiased, very robust, but n times more expensive
Cross-validation with groups
Every new sample is evaluated only once, then added to the training set
Can also be done in batches (of n samples at a time)
TimeSeriesSplit
In the kth split, the first k folds are the train set and the (k+1)th fold as the Validation set
Often, a maximum training set size (or window) is used
more robust against concept drift (change in data over time)
Choosing a performance estimation procedure
No strict rules, only guidelines:
Always use stratification for classification (sklearn does this by default)
Use holdout for very large datasets (e.g. >1.000.000 examples)
Or when learners don't always converge (e.g. deep learning)
Choose k depending on dataset size and resources
Use leave-one-out for very small datasets (e.g. <100 examples)
Use cross-validation otherwise
Most popular (and theoretically sound): 10-fold CV
Literature suggests 5x2-fold CV is better
Use grouping or leave-one-subject-out for grouped data
Use train-then-test for time series
Binary classification
https://en.wikipedia.org/wiki/Precision_and_recall
Multi-class classification
Train models per class : one class viewed as positive, other(s) als negative, then average
micro-averaging: count total TP, FP, TN, FN (every sample equally important)
micro-precision, micro-recall, micro-F1, accuracy are all the same
C
∑ TPc c=2 TP + TN
c=1
Precision: −
−→
C C
TP + TN + FP + FN
∑ TPc + ∑ FPc
c=1 c=1
Other useful classification metrics
Cohen's Kappa
Measures 'agreement' between different models (aka inter-rater agreement)
To evaluate a single model, compare it against a model that does random guessing
Similar to accuracy, but taking into account the possibility of predicting the
right class by chance
Can be weighted: different misclassifications given different weights
1: perfect prediction, 0: random prediction, negative: worse than random
With = accuracy, and = accuracy of random classifier:
p0 pe
po − pe
κ =
1 − pe
The best trade-off between precision and recall depends on your application
You can have arbitrary high recall, but you often want reasonable precision, too.
Plotting precision against recall for all possible thresholds yields a precision-recall curve
Change the treshold until you find a sweet spot in the precision-recall trade-off
Often jagged at high thresholds, when there are few positive examples left
Model selection
Plotting TPR against FPR for all possible thresholds yields a Receiver Operating
T P +F N F P +T N
Characteristics curve
Change the treshold until you find a sweet spot in the TPR-FPR trade-off
Lower thresholds yield higher TPR (recall), higher FPR, and vice versa
Visualization
Histograms show the amount of points with a certain decision value (for each class)
TPR =
TP
can be seen from the positive predictions (top histogram)
can be seen from the negative predictions (bottom histogram)
T P +F N
FP
FPR =
F P +T N
Model selection
Between 0 and 1, but negative if the model is worse than just predicting the mean
Easier to interpret (higher is better).
Decision tree learning
Inductive inference with decision trees
▪ Inductive reasoning is a method of reasoning in which
a body of observations is considered to derive a
general principle.
▪ Decision Trees is one of the most widely used and
practical methods of inductive inference
▪ Features
▪ Method for approximating discrete-valued functions
(including boolean)
▪ Learned functions are represented as decision trees (or if-
then-else rules)
▪ Expressive hypotheses space, including disjunction
Decision tree representation (PlayTennis)
+
−
When to use Decision Trees
▪ Problem characteristics:
▪ Instances can be described by attribute value pairs
▪ Target function is discrete valued
▪ Disjunctive hypothesis may be required
▪ Possibly noisy training data samples
▪ Robust to errors in training data
▪ Missing attribute values
▪ Different classification problems:
▪ Equipment or medical diagnosis
▪ Credit risk analysis
▪ Several tasks in natural language processing
Top-down induction of Decision Trees
▪ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
▪ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
▪ The construction of the tree is top-down. The algorithm is greedy.
▪ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
▪ Select the best attribute
▪ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
▪ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A best attribute; decision attribute for Root A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs − {A})
return Root
Inductive bias in decision tree learning
(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
▪ Each distinct path produces a different rule: a condition
removal may be based on a local (contextual) criterion. Node
pruning is global and affects all the rules
▪ In rule form, tests are not ordered and there is no book-
keeping involved when conditions (nodes) are removed
▪ Converting to rules improves readability for humans
Dealing with continuous-valued attributes
▪ So far discrete values for attributes and for outcome.
▪ Given a continuous-valued attribute A, dynamically create a
new attribute Ac
Ac = True if A < c, False otherwise
▪ How to determine threshold value c ?
▪ Example. Temperature in the PlayTennis example
▪ Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
▪ Determine candidate thresholds by averaging consecutive values where
there is a change in classification: (48+60)/2=54 and (80+90)/2=85
▪ Evaluate candidate thresholds (attributes) according to information gain.
The best is Temperature>54.The new attribute competes with the other
ones
Problems with information gain
▪ Natural bias of information gain: it favours attributes with
many possible values.
▪ Consider the attribute Date in the PlayTennis example.
▪ Date would have the highest information gain since it perfectly
separates the training data.
▪ It would be selected at the root resulting in a very broad tree
▪ Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
▪ The problem is that the partition is too specific, too many small
classes are generated.
▪ We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A) − log2
|S |
i=1 |S |
▪ Si are the sets obtained by partitioning on value i of A
▪ SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A)
SplitInformation(S, A)
▪ GainRatio penalizes attributes that split examples in many small classes such as
Date. Let |S |=n, Date splits examples in n classes
▪ SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
▪ Compare with A, which splits data in two even classes:
▪ SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
▪ Problem: SplitInformation(S, A) can be zero or very small
when |Si | ≈ |S | for some value i
▪ To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
Handling incomplete training data
▪ How to cope with the problem that the value of some attribute
may be missing?
▪ Example: Blood-Test-Result in a medical diagnosis problem
▪ The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training examples at
the node
2. Assign a probability to each value, based on frequencies, and assign
values to missing attribute, according to this probability distribution
▪ Missing values in new instances to be classified are treated
accordingly, and the most probable classification is chosen
(C4.5)
Handling attributes with different
costs
▪ Instance attributes may have an associated cost: we would
prefer decision trees that use low-cost attributes
▪ ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) − 1
(Cost(A) + 1)w w ∈ [0,1]
Gini (impurity) Index
▪ The Gini index is a measure of diversity in a dataset. In other
words, if we have a set in which all the elements are similar,
this set has a low Gini index, and if all the elements are
different, it has a large Gini index.
▪ For clarity, consider the following two sets of 10 colored balls
(where any two balls of the same color are indistinguishable):
▪ • Set 1: eight red balls, two blue balls
▪ • Set 2: four red balls, three blue balls, two yellow balls, one green
ball
▪ Set 1 looks more pure than set 2, because set 1 contains
mostly red balls and a couple of blue ones, whereas set 2 has
many different colors. Next, we devise a measure of impurity
that assigns a low value to set 1 and a high value to set 2.
Gini (impurity) Index
▪ If we pick two random elements of the set, what is the
probability that they have a different color ? The two elements
don’t need to be distinct; we are allowed to pick the same
element twice.
▪ P(picking two balls of different color) = 1 – P(picking two balls
of the same color)
▪ P(picking two balls of the same color) = P(both balls are color 1)
+ P(both balls are color 2) + … + P(both balls are color n)
▪ P(both balls are color i) = pi2
▪ P(picking two balls of different colors) = 1 – p12 – p22 – … – pn2
Gini (impurity) Index
▪ Gini impurity index:
In a set with m elements and n classes, with ai elements belonging
to the i-th class, the Gini impurity index is
Gini = 1 – p12 – p22 – … – pn2 , where pi = ai / m
Gini (impurity) Index
sample x.
● The labels of the 3 neighbors are 2×(+1) and
outputs
● Classification rule: For a test input x, assign
to each class
Example in 2D
similarly , µ~ 2 = v t µ 2
Fisher Linear Discriminant
How good is µ~1 − µ~2 as a measure of separation?
The larger µ~1 − µ~ 2 , the better is the expected separation
µ~1 µ1
µ~2 µ2
µ1 µ2
µ~1 µ1
small variance
µ~2 µ2
µ1 µ2
large variance
Fisher Linear Discriminant
We need to normalize µ~1 − µ~2 by a factor which is
proportional to variance
1 n
Have samples z1,…,zn . Sample mean is µ z = n z i
i =1
i =1
( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2
µ~1 µ~2
x i ∈Class 1
S2 = (x i − µ 2 )( x i − µ 2 )
t
x i ∈Class 2
Fisher Linear Discriminant Derivation
Now define the within the class scatter matrix
SW = S 1 + S 2
~2 =
s 1 (v x − v µ )
t
i
t
1
2
y i ∈Class 1
= (v (x − µ )) (v (x
t
i 1
t t
i − µ 1 ))
y i ∈Class 1
= ((x i − µ1 ) v
t
) ((x
t
i − µ1 ) v
t
)
y i ∈Class 1
= v t
(x i − µ 1 )( x i − µ 1 ) v = v t S 1v
t
y i ∈Class 1
Fisher Linear Discriminant Derivation
Similarly s~22 = v t S 2v
Therefore s~12 + s~22 = v t S 1v + v t S 2v = v t S W v
Define between the class scatter matrix
S B = (µ 1 − µ 2 )(µ 1 − µ 2 )
t
= v (µ 1 − µ 2 )(µ 1 − µ 2 ) v
t t
= v t SBv
Fisher Linear Discriminant Derivation
Thus our objective function can be written:
( µ1 − µ 2 )
~ ~
J (v ) = ~ 2 ~ 2 = t
2
v t S Bv
s1 + s 2 v SW v
Minimize J(v) by taking the derivative w.r.t. v and
setting it to 0
d t t d t
v S B v v SW v − v SW v v t S B v
d dv dv
J (v ) =
dv (v t
SW v )
2
=
(2 S B v )v t SW v − (2 SW v )v t S B v
=0
(v t
SW v )
2
Fisher Linear Discriminant Derivation
Need to solve v t S W v (S B v ) − v t S B v (S W v ) = 0
v t S W v (S B v ) v t S B v (S W v )
t
− t
=0
v SW v v SW v
v t S B v (S W v )
SBv − t
=0
v SW v = λ
S B v = λ SW v
v λ v
Fisher Linear Discriminant Example
Data
Class 1 has 5 samples c1=[(1,2),(2,3),(3,3),(4,5),(5,5)]
Class 2 has 6 samples c2=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]
Arrange data in 2 separate matrices
1 2 1 0
c1 = c2 =
5 5 6 5
det (V t S BV )
Objective function: J (V ) =
det (V t S W V )
within the class scatter matrix SW is
c c
SW = Si = (x k − µ i )( x k − µ i )
t
i =1 i = 1 x k ∈class i
2/15
Tapas Kumar Mishra The Perceptron
Classifier: h(xi ) = sign(w⊤ xi + b)
3/15
Tapas Kumar Mishra The Perceptron
Classifier: h(xi ) = sign(w⊤ xi + b)
b is the bias term (without the bias term, the hyperplane that
w defines would always have to go through the origin).
Dealing with b can be a pain, so we ’absorb’ it into the
feature vector w by adding one additional constant dimension.
Under this convention,
xi
xi becomes
1
w
w becomes
b
5/15
Tapas Kumar Mishra The Perceptron
Perceptron Algorithm: obtaining w
6/15
Tapas Kumar Mishra The Perceptron
Geometric Intuition
7/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence
8/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: set up
9/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: set up
9/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: set up
9/15
Tapas Kumar Mishra The Perceptron
Set up
Theorem
If all of the above holds, then the Perceptron algorithm makes at
most 1/γ 2 mistakes.
11/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
12/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
12/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
12/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
(w + y x)⊤ w∗ = w⊤ w∗ + y (x⊤ w∗ ) ≥ w⊤ w∗ + γ
The inequality follows from the fact that, for w∗ , the distance from
the hyperplane defined by w∗ to x must be at least γ (i.e.
y (x⊤ w∗ ) = |x⊤ w∗ | ≥ γ).
This means that for each update
w⊤ w∗ grows by at least γ.
13/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
(w + y x)⊤ w∗ = w⊤ w∗ + y (x⊤ w∗ ) ≥ w⊤ w∗ + γ
The inequality follows from the fact that, for w∗ , the distance from
the hyperplane defined by w∗ to x must be at least γ (i.e.
y (x⊤ w∗ ) = |x⊤ w∗ | ≥ γ).
This means that for each update
w⊤ w∗ grows by at least γ.
13/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
(w + y x)⊤ w∗ = w⊤ w∗ + y (x⊤ w∗ ) ≥ w⊤ w∗ + γ
The inequality follows from the fact that, for w∗ , the distance from
the hyperplane defined by w∗ to x must be at least γ (i.e.
y (x⊤ w∗ ) = |x⊤ w∗ | ≥ γ).
This means that for each update
w⊤ w∗ grows by at least γ.
13/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
14/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
14/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
14/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
Now we know that after M updates the following two inequalities
must hold:
1 w⊤ w∗ ≥ Mγ
2 w⊤ w ≤ M.
2 w⊤ w ≤ M.
2 w⊤ w ≤ M.
2 w⊤ w ≤ M.
2 w⊤ w ≤ M.
2 w⊤ w ≤ M.
2 w⊤ w ≤ M.
wm
sum
weighted threshold
binary signal
McCulloch & Pitts Neuron Model
1943
9
1
1
0
0
x1
1
0
1
0
x2
1
0
0
0
Out
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit>
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
<latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
x2
x1
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit>
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
<latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
w2 =1
w1 =1
sum
weighted
Logical AND Gate
t=1.5
10
1
1
0
0
x1
1
0
1
0
x2
1
1
1
0
Out
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit>
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
<latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
x2
x1
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit>
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
<latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
w2 =1
w1 =1
sum
weighted
Logical OR Gate
t=0.5
11
1
0
x1
0
1
Out
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
x1
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
sum
w1 = -1 weighted
Logical NOT Gate
t= -0.5
12
1
1
0
0
x1
1
0
1
0
x2
0
1
1
0
Out
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit>
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
<latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
x2
x1
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit>
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
<latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
w2 =?
w1 =?
sum
(Take-home exercise)
weighted
Logical XOR Gate
t=?
13
40
SVM Example
Dan Ventura
March 12, 2009
Abstract
We try to give a helpful simple example that demonstrates a linear
SVM and then extend the example to a simple non-linear case to illustrate
the use of mapping functions and kernels.
1 Introduction
Many learning models make use of the idea that any learning problem can be
made easy with the right set of features. The trick, of course, is discovering that
“right set of features”, which in general is a very difficult thing to do. SVMs are
another attempt at a model that does this. The idea behind SVMs is to make use
of a (nonlinear) mapping function Φ that transforms data in input space to data
in feature space in such a way as to render a problem linearly separable. The
SVM then automatically discovers the optimal separating hyperplane (which,
when mapped back into input space via Φ−1 , can be a complex decision surface).
SVMs are rather interesting in that they enjoy both a sound theoretical basis
as well as state-of-the-art success in real-world applications.
To illustrate the basic ideas, we will begin with a linear SVM (that is, a
model that assumes the data is linearly separable). We will then expand the
example to the nonlinear case to demonstrate the role of the mapping function
Φ, and finally we will explain the idea of a kernel and how it allows SVMs to
make use of high-dimensional feature spaces while remaining tractable.
and the following negatively labeled data points in <2 (see Figure 1):
1 0 0 −1
, , ,
0 1 −1 0
1
Figure 1: Sample data points in <2 . Blue diamonds are positive examples and
red squares are negative examples.
In what follows we will use vectors augmented with a 1 as a bias input, and
for clarity we will differentiate these with an over-tilde. So, if s1 = (10), then
s˜1 = (101). Figure 3 shows the SVM architecture, and our task is to find values
for the αi such that
2
Figure 2: The three support vectors are marked as yellow circles.
3
2α1 + 4α2 + 4α3 = −1
4α1 + 11α2 + 9α3 = +1
4α1 + 9α2 + 11α3 = +1
X
w̃ = αi s˜i
i
1 3 3
= −3.5 0 + 0.75 1 + 0.75 −1
1 1 1
1
= 0
−2
Finally, remembering that our vectors are augmented with a bias, we can
equate the last entry in w̃ as the hyperplane
offset
b and write the separating
1
hyperplane equation y = wx + b with w = and b = −2. Plotting the line
0
gives the expected decision surface (see Figure 4).
and the following negatively labeled data points in <2 (see Figure 5):
1 1 −1 −1
, , ,
1 −1 −1 1
4
Figure 4: The discriminating hyperplane corresponding to the values α1 =
−3.5, α2 = 0.75 and α3 = 0.75.
Figure 5: Nonlinearly separable sample data points in <2 . Blue diamonds are
positive examples and red squares are negative examples.
5
Figure 6: The data represented in feature space.
Referring back to Figure 3, we can see how Φ transforms our data before
the dot products are performed. Therefore, we can rewrite the data in feature
space as
2 6 6 2
, , ,
2 2 6 6
for the positive examples and
1 1 −1 −1
, , ,
1 −1 −1 1
for the negative examples (see Figure 6). Now we can once again easily identify
the support vectors (see Figure 7):
1 2
s1 = , s2 =
1 2
We again use vectors augmented with a 1 as a bias input and will differentiate
them as before. Now given the [augmented] support vectors, we must again find
values for the αi . This time our constraints are
6
Figure 7: The two support vectors (in feature space) are marked as yellow
circles.
3α1 + 5α2 = −1
5α1 + 9α2 = +1
7
Figure 8: The discriminating hyperplane corresponding to the values α1 = −7
and α2 = 4
1 2
= −7 1 + 4 2
1 1
1
= 1
−3
1
giving us the separating hyperplane equation y = wx + b with w = and
1
b = −3. Plotting the line gives the expected decision surface (see Figure 8).
where σ(z) returns the sign of z. For example, if we wanted to classify the point
x = (4, 5) using the mapping function of Eq. 1,
4 1 4 2 4
f = σ −7Φ1 · Φ1 + 4Φ1 · Φ1
5 1 5 2 5
1 0 2 0
= σ −7 1 · 1 + 4 2 · 1
1 1 1 1
= σ(−2)
8
Figure 9: The decision surface in input space corresponding to Φ1 . Note the
singularity.
and thus we would classify x = (4, 5) as negative. Looking again at the input
space, we might be tempted to think this is not a reasonable classification; how-
ever, it is what our model says, and our model is consistent with all the training
data. As always, there are no guarantees on generalization accuracy, and if we
are not happy about our generalization, the likely culprit is our choice of Φ.
Indeed, if we map our discriminating hyperplane (which lives in feature space)
back into input space, we can see the effective decision surface of our model
(see Figure 9). Of course, we may or may not be able to improve generalization
accuracy by choosing a different Φ; however, there is another reason to revisit
our choice of mapping function.
9
Figure 10: The decision surface in input space corresponding Φ2 .
2 2 −2 −2
2 , −2 , −2 , 2
1 1 1 1
for the negative examples. With a little thought, we realize that in this case, all
1
8 of the examples will be support vectors with αi = 46 for the positive support
−7
vectors and αi = 46 for the negative ones. Note that a consequence of this
mapping is that we do not need to use augmented vectors (though it wouldn’t
hurt to do so) because the hyperplane
in feature space goes through the origin,
0
y = wx+b, where w = 0 and b = 0. Therefore, the discriminating feature,
1
is x3 , and Eq. 2 reduces to f (x) = σ(x3 ).
Figure 10 shows the decision surface induced in the input space for this new
mapping function.
Kernel trick.
5 Conclusion
What kernel to use? Slack variables. Theory. Generalization. Dual problem.
QP.
10
Support vector Machines
1/19
Intro
2/19
Tapas Kumar Mishra Support vector Machines
Intro
2/19
Tapas Kumar Mishra Support vector Machines
Intro
2/19
Tapas Kumar Mishra Support vector Machines
Intro
2/19
Tapas Kumar Mishra Support vector Machines
Figure : Figure 1: (Left:) Two different separating hyperplanes for the
same data set. (Right:) The maximum margin hyperplane. The margin,
γ, is the distance from the hyperplane (solid line) to the closest points in
either class (which touch the parallel dotted lines).
3/19
Tapas Kumar Mishra Support vector Machines
Typically, if a data set is linearly separable, there are infinitely
many separating hyperplanes. A natural question to ask is:
Question: What is the best separating hyperplane?
SVM Answer: The one that maximizes the distance to the closest
data points from both classes. We say it is the hyperplane with
maximum margin.
4/19
Tapas Kumar Mishra Support vector Machines
Typically, if a data set is linearly separable, there are infinitely
many separating hyperplanes. A natural question to ask is:
Question: What is the best separating hyperplane?
SVM Answer: The one that maximizes the distance to the closest
data points from both classes. We say it is the hyperplane with
maximum margin.
4/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance
5/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance
5/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance
5/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance
wT x + b
γ(w, b) = min
x∈D kwk2
7/19
Tapas Kumar Mishra Support vector Machines
Consider some point x. Let d be the vector from H to x of
minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
therefore wT xP + b = wT (x − d) + b = wT (x − αw) + b = 0
T
which implies α = wwTx+bw
√ √ |wT x+b| |wT x+b|
The length of d: kdk2 = dT d = α2 wT w = √ T = kwk
w w 2
Margin of H with respect to D:
wT x + b
γ(w, b) = min
x∈D kwk2
7/19
Tapas Kumar Mishra Support vector Machines
Consider some point x. Let d be the vector from H to x of
minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
therefore wT xP + b = wT (x − d) + b = wT (x − αw) + b = 0
T
which implies α = wwTx+bw
√ √ |wT x+b| |wT x+b|
The length of d: kdk2 = dT d = α2 wT w = √ T = kwk
w w 2
Margin of H with respect to D:
wT x + b
γ(w, b) = min
x∈D kwk2
7/19
Tapas Kumar Mishra Support vector Machines
Consider some point x. Let d be the vector from H to x of
minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
therefore wT xP + b = wT (x − d) + b = wT (x − αw) + b = 0
T
which implies α = wwTx+bw
√ √ |wT x+b| |wT x+b|
The length of d: kdk2 = dT d = α2 wT w = √ T = kwk
w w 2
Margin of H with respect to D:
wT x + b
γ(w, b) = min
x∈D kwk2
7/19
Tapas Kumar Mishra Support vector Machines
Consider some point x. Let d be the vector from H to x of
minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
therefore wT xP + b = wT (x − d) + b = wT (x − αw) + b = 0
T
which implies α = wwTx+bw
√ √ |wT x+b| |wT x+b|
The length of d: kdk2 = dT d = α2 wT w = √ T = kwk
w w 2
Margin of H with respect to D:
wT x + b
γ(w, b) = min
x∈D kwk2
7/19
Tapas Kumar Mishra Support vector Machines
Note that if the hyperplane is such that γ is maximized, it must lie
right in the middle of the two classes. In other words, γ must be
the distance to the closest point within both classes.
(If not, you could move the hyperplane towards data points of the
class that is further away and increase γ, which contradicts that γ
is maximized.)
8/19
Tapas Kumar Mishra Support vector Machines
Note that if the hyperplane is such that γ is maximized, it must lie
right in the middle of the two classes. In other words, γ must be
the distance to the closest point within both classes.
(If not, you could move the hyperplane towards data points of the
class that is further away and increase γ, which contradicts that γ
is maximized.)
8/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier
9/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier
9/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier
9/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier
10/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier
10/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier
10/19
Tapas Kumar Mishra Support vector Machines
The new optimization problem becomes: ¡br¿
∀i, yi (wT xi + b) ≥ 0
s.t. (2)
mini wT xi + b =1
min wT w (3)
w,b
11/19
Tapas Kumar Mishra Support vector Machines
The new optimization problem becomes: ¡br¿
∀i, yi (wT xi + b) ≥ 0
s.t. (2)
mini wT xi + b =1
min wT w (3)
w,b
11/19
Tapas Kumar Mishra Support vector Machines
The new optimization problem becomes: ¡br¿
∀i, yi (wT xi + b) ≥ 0
s.t. (2)
mini wT xi + b =1
min wT w (3)
w,b
11/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
Support Vectors
For the optimal w, b pair, some training points will have tight
constraints, i.e.
yi (wT xi + b) = 1.
(This must be the case, because if for all training points we had a
strict > inequality, it would be possible to scale down both
parameters w, b until the constraints are tight and obtained an
even lower objective value.)
We refer to these training points as support vectors.
Support vectors are special because they are the training points
that define the maximum margin of the hyperplane to the data set
and they therefore determine the shape of the hyperplane. If you
were to move one of them and retrain the SVM, the resulting
hyperplane would change.
13/19
Tapas Kumar Mishra Support vector Machines
Support Vectors
For the optimal w, b pair, some training points will have tight
constraints, i.e.
yi (wT xi + b) = 1.
(This must be the case, because if for all training points we had a
strict > inequality, it would be possible to scale down both
parameters w, b until the constraints are tight and obtained an
even lower objective value.)
We refer to these training points as support vectors.
Support vectors are special because they are the training points
that define the maximum margin of the hyperplane to the data set
and they therefore determine the shape of the hyperplane. If you
were to move one of them and retrain the SVM, the resulting
hyperplane would change.
13/19
Tapas Kumar Mishra Support vector Machines
Support Vectors
For the optimal w, b pair, some training points will have tight
constraints, i.e.
yi (wT xi + b) = 1.
(This must be the case, because if for all training points we had a
strict > inequality, it would be possible to scale down both
parameters w, b until the constraints are tight and obtained an
even lower objective value.)
We refer to these training points as support vectors.
Support vectors are special because they are the training points
that define the maximum margin of the hyperplane to the data set
and they therefore determine the shape of the hyperplane. If you
were to move one of them and retrain the SVM, the resulting
hyperplane would change.
13/19
Tapas Kumar Mishra Support vector Machines
Support Vectors
For the optimal w, b pair, some training points will have tight
constraints, i.e.
yi (wT xi + b) = 1.
(This must be the case, because if for all training points we had a
strict > inequality, it would be possible to scale down both
parameters w, b until the constraints are tight and obtained an
even lower objective value.)
We refer to these training points as support vectors.
Support vectors are special because they are the training points
that define the maximum margin of the hyperplane to the data set
and they therefore determine the shape of the hyperplane. If you
were to move one of them and retrain the SVM, the resulting
hyperplane would change.
13/19
Tapas Kumar Mishra Support vector Machines
SVM with soft constraints
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
14/19
Tapas Kumar Mishra Support vector Machines
SVM with soft constraints
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
14/19
Tapas Kumar Mishra Support vector Machines
SVM with soft constraints
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
14/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
15/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
15/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
15/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0
15/19
Tapas Kumar Mishra Support vector Machines
soft SVM: Unconstrained Formulation:
16/19
Tapas Kumar Mishra Support vector Machines
soft SVM: Unconstrained Formulation:
16/19
Tapas Kumar Mishra Support vector Machines
soft SVM: Unconstrained Formulation:
16/19
Tapas Kumar Mishra Support vector Machines
If we plug this closed form into the objective of our SVM
optimization problem, we obtain the following unconstrained
version as loss function and regularizer:
n
X h i
T
min |w{zw} +C max 1 − yi (wT x + b), 0
w,b
l2 −regularizer i=1 | {z }
hinge−loss
17/19
Tapas Kumar Mishra Support vector Machines
If we plug this closed form into the objective of our SVM
optimization problem, we obtain the following unconstrained
version as loss function and regularizer:
n
X h i
T
min |w{zw} +C max 1 − yi (wT x + b), 0
w,b
l2 −regularizer i=1 | {z }
hinge−loss
17/19
Tapas Kumar Mishra Support vector Machines
If we plug this closed form into the objective of our SVM
optimization problem, we obtain the following unconstrained
version as loss function and regularizer:
n
X h i
T
min |w{zw} +C max 1 − yi (wT x + b), 0
w,b
l2 −regularizer i=1 | {z }
hinge−loss
17/19
Tapas Kumar Mishra Support vector Machines
If we plug this closed form into the objective of our SVM
optimization problem, we obtain the following unconstrained
version as loss function and regularizer:
n
X h i
T
min |w{zw} +C max 1 − yi (wT x + b), 0
w,b
l2 −regularizer i=1 | {z }
hinge−loss
17/19
Tapas Kumar Mishra Support vector Machines
18/19
Tapas Kumar Mishra Support vector Machines
19/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b
12/19
Tapas Kumar Mishra Support vector Machines
Kernels
1/22
Linear classifiers are great, but what if there exists no linear
decision boundary? As it turns out, there is an elegant way to
incorporate non-linearities into most linear classifiers.
2/22
Tapas Kumar Mishra Kernels
Linear classifiers are great, but what if there exists no linear
decision boundary? As it turns out, there is an elegant way to
incorporate non-linearities into most linear classifiers.
2/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion
3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion
3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion
3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion
3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion
3/22
Tapas Kumar Mishra Kernels
x1
x2
Consider the following example: x = . , and define
..
xd
1
x 1
..
.
xd
x1 x2
ϕ(x) = .
.
.
.
xd−1 xd
..
.
x1 x2 · · · xd
Quiz: What is the dimensionality of ϕ(x)?
Solution 1: In all elements of ϕ(x), there are d0 zero-degree
5/22
Tapas Kumar Mishra Kernels
This new representation, ϕ(x), is very expressive and allows for
complicated non-linear decision boundaries -
but the dimensionality is extremely high. This makes our algorithm
unbearable (and quickly prohibitively) slow.
5/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
6/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
6/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
6/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
7/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1
0
For this initial choice of w 0 , the linear combination in
w = ni=1 αi xi is trivially α1 = · · · = αn = 0.
P
We now show that throughout the entire gradient descent
optimization such coefficients α1 , . . . , αn must always exist, as we
can re-write the gradient updates entirely in terms of updating the
αi coefficients:
8/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1
0
For this initial choice of w 0 , the linear combination in
w = ni=1 αi xi is trivially α1 = · · · = αn = 0.
P
We now show that throughout the entire gradient descent
optimization such coefficients α1 , . . . , αn must always exist, as we
can re-write the gradient updates entirely in terms of updating the
αi coefficients:
8/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1
0
For this initial choice of w 0 , the linear combination in
w = ni=1 αi xi is trivially α1 = · · · = αn = 0.
P
We now show that throughout the entire gradient descent
optimization such coefficients α1 , . . . , αn must always exist, as we
can re-write the gradient updates entirely in terms of updating the
αi coefficients:
8/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1
0
For this initial choice of w 0 , the linear combination in
w = ni=1 αi xi is trivially α1 = · · · = αn = 0.
P
We now show that throughout the entire gradient descent
optimization such coefficients α1 , . . . , αn must always exist, as we
can re-write the gradient updates entirely in terms of updating the
αi coefficients:
8/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1
0
For this initial choice of w 0 , the linear combination in
w = ni=1 αi xi is trivially α1 = · · · = αn = 0.
P
We now show that throughout the entire gradient descent
optimization such coefficients α1 , . . . , αn must always exist, as we
can re-write the gradient updates entirely in terms of updating the
αi coefficients:
8/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
n
X n
X n
X n
X
w1 =w0 − s 2(w0⊤ xi − yi )xi = αi0 xi −s γi0 xi = αi1 xi
i=1 i=1 i=1 i=1
(with αi1 = αi0 − sγi0 )
X n n
X n
X n
X
w2 =w1 − s 2(w1⊤ xi − yi )xi = αi1 xi − s γi1 xi = αi2 xi
i=1 i=1 i=1 i=1
(with αi2 = αi1 xi − sγi1 )
X n n
X n
X n
X
⊤ 2 2
w3 =w2 − s 2(w2 xi − yi )xi = αi xi − s γi xi = αi3 xi
i=1 i=1 i=1 i=1
(with αi3 = αi2 − sγi2 )
··· ··· ···
(4)
9/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
10/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
10/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
11/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
12/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
Consequently,
Pn we⊤ can also2 re-write the squared-loss from
ℓ(w) = i=1 (w xi − yi ) entirely in terms of inner-product
between training inputs:
2
n
X Xn
ℓ(α) = αj x⊤
j xi − yi
(6)
i=1 j=1
13/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
14/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
1
x1
..
.
xd
Let’s go back to the previous example, ϕ(x) = x1 x2 .
.
..
xd−1 xd
..
.
x1 x2 · · · xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
ϕ(x)⊤ ϕ(z) =1 · 1 + x1 z1 + x2 z2 + · · · + x1 x2 z1 z2 + · · · + x1 · · · xd z1 · · · zd
d
Y
= (1 + xk zk ).
k=1
We can compute the inner-product from the above formula in time
O(d) instead of O(2d ). 15/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
1
x1
..
.
xd
Let’s go back to the previous example, ϕ(x) = x1 x2 .
.
..
xd−1 xd
..
.
x1 x2 · · · xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
ϕ(x)⊤ ϕ(z) =1 · 1 + x1 z1 + x2 z2 + · · · + x1 x2 z1 z2 + · · · + x1 · · · xd z1 · · · zd
d
Y
= (1 + xk zk ).
k=1
We can compute the inner-product from the above formula in time
O(d) instead of O(2d ). 15/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
1
x1
..
.
xd
Let’s go back to the previous example, ϕ(x) = x1 x2 .
.
..
xd−1 xd
..
.
x1 x2 · · · xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
ϕ(x)⊤ ϕ(z) =1 · 1 + x1 z1 + x2 z2 + · · · + x1 x2 z1 z2 + · · · + x1 · · · xd z1 · · · zd
d
Y
= (1 + xk zk ).
k=1
We can compute the inner-product from the above formula in time
O(d) instead of O(2d ). 15/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
16/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
16/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
17/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
17/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: General Kernels
19/22
Tapas Kumar Mishra Kernels
The Kernel Trick: General Kernels
20/22
Tapas Kumar Mishra Kernels
The Kernel Trick: General Kernels
−∥x−z∥
Exponential Kernel: K(x, z) = e 2σ 2
−|x−z|
Laplacian Kernel: K(x, z) = e σ
Sigmoid Kernel: K(x, z) = tanh(ax⊤ + c)
21/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions
22/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions
22/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions
22/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions
22/22
Tapas Kumar Mishra Kernels
Kernels
2/11
Tapas Kumar Mishra Kernels
Well-defined kernels
k(x, z) = x> z
k(x, z) = ck1 (x, z)
k(x, z) = k1 (x, z) + k2 (x, z)
k(x, z) = g (k(x, z))
k(x, z) = k1 (x, z)k2 (x, z)
k(x, z) = f (x)k1 (x, z)f (z)
k(x, z) = e k1 (x,z)
k(x, z) = x> Az
where k1 , k2 are well-defined kernels, c ≥ 0, g is a polynomial
function with positive coefficients, f is any function and A 0 is
positive semi-definite.
3/11
Tapas Kumar Mishra Kernels
Theorem
−(x−z)2
The RBF kernel k(x, z) = e σ2 is a well-defined kernel matrix.
4/11
Tapas Kumar Mishra Kernels
Theorem
The following kernel is defined on any two sets S1 , S2 ⊆ Ω,
List out all possible samples Ω and arrange them into a sorted list.
We define a vector xS ∈ {0, 1}|Ω| , where each of its element
indicates whether a corresponding sample is included in the set S.
It is easy to prove that
xS> xS2
k(S1 , S2 ) = e 1 ,
5/11
Tapas Kumar Mishra Kernels
Kernel Machines
6/11
Tapas Kumar Mishra Kernels
Kernelized Linear Regression
7/11
Tapas Kumar Mishra Kernels
Kernelized Linear Regression
7/11
Tapas Kumar Mishra Kernels
Kernelization
8/11
Tapas Kumar Mishra Kernels
Kernelization
8/11
Tapas Kumar Mishra Kernels
Similarly, during testing a test point is only accessed through
inner-products with training inputs:
n
X
>
h(z) = w z = αi x>
i z. (4)
i=1
9/11
Tapas Kumar Mishra Kernels
Theorem
~ = K−1 y.
Kernelized ordinary least squares has the solution α
α = w = (XX> )−1 Xy
X~ — multiply from left by X> XX>
(X> X)(X> X)~
α = X> (XX> (XX> )−1 )Xy —substitute K = X> X
K2 α
~ = Ky —multiply from left by (K−1 )2
~ = K−1 y
α
10/11
Tapas Kumar Mishra Kernels
Kernel regression can be extended to the kernelized version of ridge
regression. The solution then becomes
~ = (K + τ 2 I)−1 y.
α (5)
11/11
Tapas Kumar Mishra Kernels
Testing
h(z) = k∗ (K + τ 2 I)−1 y,
where k∗ is the kernel (vector) of the test point with the training
points, i.e. the i th dimension corresponds to [k∗ ]i = φ(z)> φ(xi ),
the inner-product between the test point z with the training point
xi after the mapping into feature space through φ.
12/11
Tapas Kumar Mishra Kernels
Neural Networks
The Infamous XOR problem
• In 1969, Minsky and Papert showed that Perceptrons cannot solve
the XOR problem.
The Infamous XOR problem
• We know that the problem is not linearly separable: therefore, no
linear classifier like the Perceptron can solve the problem.
• It is very crucial to choose the correct form of the basis function (like
𝐴𝑁𝐷(𝑥1 , 𝑥2 ), 𝐴𝑁𝐷 𝑥1 , 𝑥2 for XOR) so that the problem becomes
easy to solve in the projected space.
What is the basis function doing
e.g. x = -2, y = 5, z = -4
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Chain Rule
Want:
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Chain Rule
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Chain Rule
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Chain Rule
e.g. x = -2, y = 5, z = -4
1. Forward pass: Compute outputs
Chain Rule
Upstream
gradient
Base Case
Local Gradient
Downstream Upstream
Gradient Gradient
Local Gradient
Downstream Upstream
Gradient Gradient
Local Gradient
Downstream Upstream
Gradient Gradient
Local Gradient
Downstream Upstream
Gradient Gradient
Local Gradient
Upstream
Gradient
Downstream
Gradient
Local Gradient
Downstream Upstream
Gradient Gradient
Local Gradient
Local Gradient
• Computational graph is
Sigmoid not unique: we can use
primitives that have
simple local gradients
• Computational graph is
Sigmoid not unique: we can use
primitives that have
simple local gradients
Sigmoid local
gradient:
• Computational graph is
Sigmoid not unique: we can use
primitives that have
simple local gradients
Backward pass:
Compute grads
Base case
Sigmoid
Add
Add
Multiply
Multiply
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
4 Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
Summarization
– Reduce the size of large
data sets
Clustering precipitation
in Australia
– Partitional Clustering
◆ A division of data objects into non-overlapping subsets (clusters)
– Hierarchical clustering
◆ A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
p1
p3 p4
p2
p1 p2 p3 p4
Well-separated clusters
Prototype-based clusters
Contiguity-based clusters
Density-based clusters
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is
closer (or more similar) to every other point in the cluster than
to any point not in the cluster.
3 well-separated clusters
Prototype-based
– A cluster is a set of objects such that an object in a cluster is
closer (more similar) to the prototype or “center” of a cluster,
than to the center of any other cluster
– The center of a cluster is often a centroid, the average of all
the points in the cluster, or a medoid, the most “representative”
point of a cluster
4 center-based clusters
8 contiguous clusters
Density-based
– A cluster is a dense region of points, which is separated by
low-density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
Hierarchical clustering
Density-based clustering
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Depending on the
choice of initial
centroids, B and C
may get merged or
remain separate
6 Starting with two initial centroids in one cluster of each pair of clusters
3/24/2021 Introduction to Data Mining, 2nd Edition 31
Tan, Steinbach, Karpatne, Kumar
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
3/24/2021 Introduction to Data Mining, 2nd Edition 32
Tan, Steinbach, Karpatne, Kumar
10 Clusters Example
Starting with some pairs of clusters having three initial centroids, while other
have only one.
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Multiple runs
– Helps, but probability is not on your side
Use some strategy to select the k initial centroids
and then select among these initial centroids
– Select most widely separated
◆K-means++ is a robust way of doing this selection
– Use hierarchical clustering to determine initial
centroids
Bisecting K-means
– Not as susceptible to initialization issues
5. End For
3/24/2021 Introduction to Data Mining, 2nd Edition 36
Tan, Steinbach, Karpatne, Kumar
Bisecting K-means
CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.
One solution is to find a large number of clusters such that each of them represents a part of a
natural cluster. But these small clusters need to be put together in a post-processing step.
One solution is to find a large number of clusters such that each of them represents a part of
a natural cluster. But these small clusters need to be put together in a post-processing step.
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05
3 1
0
1 3 2 5 4 6
– Divisive:
◆ Start with one, all-inclusive cluster
◆ At each step, split a cluster until each cluster contains an individual
point (or there are k clusters)
p2
p3
p4
p5
.
.
. Proximity Matrix
...
p1 p2 p3 p4 p9 p10 p11 p12
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
C1 ?
C2 U C5 ? ? ? ?
C3
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward’s Method uses squared error
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
Two Clusters
Original Points
• Sensitive to noise
Three Clusters
Distance Matrix:
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
0.05
4
0
3 6 4 1 2 5
p jClusterj
proximity(Clusteri , Clusterj ) =
|Clusteri ||Clusterj |
Distance Matrix:
5 4 1
0.25
2
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
Strengths
– Less susceptible to noise
Limitations
– Biased towards globular clusters
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4
5
1 5 4 1
2 2
5 Ward’s Method 5
2 2
3 6 Group Average 3 6
3
4 1 1
4 4
3
MinPts = 7
Original Points
(MinPts=4, Eps=9.92).
Original Points
• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.75)
3/24/2021 Introduction to Data Mining, 2nd Edition 81
Tan, Steinbach, Karpatne, Kumar
DBSCAN: Determining EPS and MinPts
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.8 0.8
K-means Complete
0.7 0.7
Link
0.6 0.6
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
Example: SSE
– SSB + SSE = constant
m
1 m1 2 3 4 m2 5
cohesion separation