0% found this document useful (0 votes)

46 views1,117 pages

ML Merged Endsem

This document provides an introduction to machine learning, including an overview of prerequisites like linear algebra and calculus, source materials for further study, and examples of different machine learning applications like classification, regression, ranking, clustering, and embedding. It also outlines the course roadmap, which will cover topics like Bayesian decision theory, linear and logistic regression, decision trees, dimensionality reduction, SVMs, neural networks, and unsupervised learning.

Uploaded by

Amit Behera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views1,117 pages

ML Merged Endsem

Uploaded by

Amit Behera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1117

Introduction to Machine Learning

Tapas Kumar Mishra

NIT Rourkela
Logistics
• Class webpage:
– https://mishra-tapas.github.io/ml.html
Prerequisites
REQUIRED:
• Basic algorithms
– Dynamic programming, algorithmic analysis

STRONGLY RECOMMENDED:
• Linear algebra
– Matrices, vectors, systems of linear equations
– Eigenvectors, matrix rank
– Singular value decomposition
• Multivariable calculus
– Derivatives, integration, tangent planes
– Optimization, Lagrange multipliers
• Good programming skills: Python highly recommended
Source Materials

• C. Bishop, Pattern Recognition and Machine Learning,

Springer, 2007
• K. Murphy, Machine Learning: a Probabilistic
Perspective, MIT Press, 2012
• Tom Mitchel. Machine Learning.
• Richard O. Duda, Peter E. Hart, David G. Stork Pattern
Classification.
What is Machine Learning ?
(by examples)
Classification

from data to discrete classes

Spam filtering
data prediction

Spam
vs.
Not Spam
Face recognition

Example training images

for each orientation

10 ©2009 Carlos Guestrin

Weather prediction
Regression

predicting a numeric value

Stock market
Weather prediction revisited

Temperature

72° F
Ranking

comparing items
Web search
Given image, find similar images

http://www.tiltomo.com/
Collaborative Filtering
Recommendation systems
Recommendation systems
Machine learning competition with a $1 million prize
Clustering

discovering structure in data

Clustering Data: Group similar things
Clustering images

Set of Images

[Goldberger et al.]
Clustering web search results
Embedding

visualizing data
Embedding images

• Images have
thousands or
millions of pixels.

• Can we give each

image a coordinate,
such that similar
images are near
each other?

26 ©2009 Carlos Guestrin

[Saul & Roweis ‘03]
Embedding words

[Joseph Turian]
Embedding words (zoom in)

[Joseph Turian]
Structured prediction

from data to discrete classes

Speech recognition
Natural language processing

I need to hide a body

noun, verb, preposition, …
Growth of Machine Learning
• Machine learning is preferred approach to
– Speech recognition, Natural language processing
– Computer vision
– Medical outcomes analysis
– Robot control
– Computational biology
– Sensor networks
– …
• This trend is accelerating
– Big data
– Improved machine learning algorithms
– Faster computers
– Good open-source software
Course roadmap
– Bayesian Decision Theory
– Linear and Logistic regression
– Decision trees
– Dimensionality Reduction
– SVMs, kernel methods
– Neural Networks
– Unsupervised learning
– Ensemble Methods
Supervised Learning: find f
• Given: Training set {(xi, yi) | i = 1 … N}
• Find: A good approximation to f : X ! Y
Examples: what are X and Y ?
• Spam Detection
– Map email to {Spam, Not Spam}
• Digit recognition
– Map pixels to {0,1,2,3,4,5,6,7,8,9}
• Stock Prediction
– Map new, historic prices, etc. to ℜ (the real numbers)
A Supervised Learning Problem
• Our goal is to find a
Dataset:
function f : X ! Y
– X = {0,1}4
– Y = {0,1}

• Question 1: How should

we pick the hypothesis
space, the set of possible
functions f ?
• Question 2: How do we
find the best f in the
hypothesis space?
Most General Hypothesis Space
Consider all possible boolean functions over four
input features!
Dataset:
•216 possible
hypotheses

•29 are consistent

with our dataset

•How do we
choose the best
one?
Occam’s Razor Principle
• William of Occam: Monk living in the 14th century
• Principle of parsimony:

“One should not increase, beyond what is necessary, the number of

entities required to explain anything”

• When many solutions are available for a given problem, we should

select the simplest one
• But what do we mean by simple?
• We will use prior knowledge of the problem to solve to define what is
a simple solution

Example of a prior: smoothness

[Samy Bengio]
Key Issues in Machine Learning
• How do we choose a hypothesis space?
– Often we use prior knowledge to guide this choice
• How can we gauge the accuracy of a hypothesis on unseen
data?
– Occam’s razor: use the simplest hypothesis consistent with data!
This will help us avoid overfitting.
– Learning theory will help us quantify our ability to generalize as
a function of the amount of training data and the hypothesis space
• How do we find the best hypothesis?
– This is an algorithmic question, the main topic of computer
science
• How to model applications as machine learning problems?
(engineering challenge)
Probability Theory refresher
Chapter 2
Bayesian Decision Theory

Machine Learning Tapas Kumar Mishra 1

Bayesian Decision Theory
Bayesian decision theory is a statistical approach to
pattern recognition
The fundamentals of most ML algorithms are
rooted from Bayesian decision theory
Basic Assumptions
 The decision problem is posed (formalized) in
probabilistic terms
 All the relevant probability values are known

Key Principle
Bayes Theorem
Machine Learning Tapas Kumar Mishra 2
Bayes Theorem
Bayes theorem

X: the observed sample (also called evidence; e.g.: the length of a fish)
H: the hypothesis (e.g. the fish belongs to the “salmon” category)
P(H): the prior probability that H holds (e.g. the probability
of catching a salmon)
P(X|H): the likelihood of observing X given that H holds
(e.g. the probability of observing a 3‐inch length fish which is
salmon)
P(X): the evidence probability that X is observed
(e.g. the probability of observing a fish with 3‐inch length)
P(H|X): the posterior probability that H holds given X (e.g.
Thomas Bayes
the probability of X being salmon given its length is 3‐inch)
(1702-1761)

Machine Learning Tapas Kumar Mishra 3

A Specific Example
State of Nature
 Future events that might occur
e.g. the next fish arriving along the conveyor belt
 State of nature is unpredictable
e.g. it is hard to predict what type will emerge next

From statistical/probabilistic point of view, the state of nature

should be favorably regarded as a random variable
e.g. let denote the (discrete) random variable
representing the state of nature (class) of fish types

Machine Learning Tapas Kumar Mishra 4

Prior Probability
Prior Probability
Prior probability is the probability distribution which
reflects one’s prior knowledge on the random variable

Probability distribution (for discrete random variable)

Let be the probability distribution on the random variable
with c possible states of nature , such that:

the catch produced as much sea bass as salmon

the catch produced more sea bass than salmon

Machine Learning Tapas Kumar Mishra 5

Decision Before Observation
The Problem
To make a decision on the type of fish arriving next, where
1) prior probability is known; 2) no observation is allowed

Naive Decision Rule

 This is the best we can do without observation

 Fixed prior probabilities ➔ Same decisions all the time
Incorporate Good when is much greater (smaller) than
observations Poor when is close to
into decision! [only 50% chance of being right if ]

Machine Learning Tapas Kumar Mishra 6

Probability Density Function (pdf)
Probability density function (pdf )
(for continuous random variable)
Let be the probability density function on the continuous
random variable taking values in R, such that:

 For continuous random variable, it no longer makes sense to talk about

the probability that x has a particular value (almost always be zero)
 We instead talk about the probability of x falling into a region R, say
R=(a,b), which could becomputed with the pdf:

Machine Learning Tapas Kumar Mishra 7

Incorporate Observations
The Problem
Suppose the fish lightness measurement x is observed,
how could we incorporate this knowledge into usage?
Class‐conditional probability density function
It is a probability density function (pdf ) for x given that the state of
nature (class) is , i.e.:

 The class‐conditional pdf describes the difference in the

distribution of observations under different classes

Machine Learning Tapas Kumar Mishra 8

Class-Conditional PDF
An illustrative example
h‐axis: lightness of fish scales
v‐axis: class‐conditional pdf
values

black curve: sea bass

red curve: salmon

 The area under each curve

is 1.0 (normalization)

 Sea bass is somewhat

brighter than salmon
class‐conditional pdf for lightness

Machine Learning Tapas Kumar Mishra 10

Decision After Observation
Known Unknown
The quantity which we want to use
Prior probability in decision naturally (by exploiting
observation information)

Class‐conditional Bayes Posterior probability

pdf
Formula

Observation for
test example
Convert the prior probability
(e.g.: fish lightness)
to the posterior probability

Machine Learning Tapas Kumar Mishra

Bayes Formula Revisited
Joint probability density function
Marginal distribution

Law of total probability

Machine Learning Tapas Kumar Mishra

Bayes Formula Revisited (Cont.)

Bayes Decision Rule

 and are assumed to be known

 is irrelevant for Bayesian decision (serving
as a normalization factor, not related to any state
of nature)

Machine Learning Tapas Kumar Mishra

Bayes Formula Revisited (Cont.)

Special Case I: Equal prior probability

Depends on the
likelihood

Special Case II: Equal likelihood

Degenerate to naive
decision rule

Normally, prior probability and likelihood function

together in Bayesian decision process
Machine Learning Tapas Kumar Mishra
Bayes Formula Revisited (Cont.)
An illustrative example

What will the

posterior
probability for
either type of fish
class‐conditional pdf for lightness look like?

Machine Learning Tapas Kumar Mishra

Bayes Formula Revisited (Cont.)
An illustrative example
h‐axis: lightness of fish scales
v‐axis: posterior probability
for either type of fish

black curve: sea bass

red curve: salmon

 For each value of x, the

higher curve yields the
output of Bayesian decision
 For each value of x, the
posteriors of either curve
posterior probability for either type of fish
sum to 1.0

Machine Learning Tapas Kumar Mishra

Another Example
Problem statement
 A new medical test is used to detect whether a patient has a certain
cancer or not, whose test result is either + (positive) or ‐ (negative )
 For patient with this cancer, the probability of returning positive test
result is 0.98
 For patient without this cancer, the probability of returning negative
test result is 0.97
 The probability for any person to have this cancer is 0.008

Question
If positive test result is returned for some person, does
he/she have this kind of cancer or not?

Machine Learning Tapas Kumar Mishra

Another Example (Cont.)

No cancer!

Machine Learning Tapas Kumar Mishra

Feasibility of Bayes Formula

To compute posterior probability , we need to know:

Prior probability: Likelihood:

 A simple solution: Counting

How do we relative frequencies
know these
probabilities?  An advanced solution: Conduct
density estimation

Machine Learning Tapas Kumar Mishra

A Further Example
Problem statement
Based on the height of a car in some campus, decide whether
it costs more than $50,000 or not

Counting relative
Quantities to know:
frequencies via
collected samples

Machine Learning Tapas Kumar Mishra 20

A Further Example (Cont.)
Collecting samples
Suppose we have randomly picked 1209 cars in the
campus, got prices from theirowners, and measured
their heights

Compute :

# cars in 221

# cars in 988

Machine Learning Tapas Kumar Mishra 20

A Further Example (Cont.)
Compute :
Discretize the height spectrum (say [0.5m, 2.5m]) into 20 intervals
each with length 0.1m, and then count the number of cars falling
into each interval for either class
Suppose x falls into interval
Ix=[1.0m, 1.1m]

For , # cars in Ix
is 46
For , # cars in Ix
is 59

Machine Learning Tapas Kumar Mishra 21

A Further Example (Cont.)
Question
For a car with height 1.05m, is its price greater than $50,000?
Estimated quantities

Machine Learning Tapas Kumar Mishra 22

Is Bayes Decision Rule Optimal?
Bayes Decision Rule (In case of two classes)

Whenever we observe a particular x, the probability of error is:

Under Bayes decision rule, we have

For every x, we ensure The average probability of error

that P(error | x) is as over all possible x must be as
small as possible small as possible

Machine Learning Tapas Kumar Mishra 23

Bayes Decision Rule – The General
Case
➢ By allowing to use more than one feature
(d‐dimensional Euclidean space)

➢ By allowing more than two states of nature

(finite set of c states of nature)

➢ By allowing actions other than merely deciding the

state of nature
(finite set of a possible actions)

Machine Learning Tapas Kumar Mishra 24

Bayes Decision Rule – The General
Case (Cont.)
➢ By introducing a loss function more general than
the probability of error

the loss incurred for taking action when the

state of nature is
A simple loss function
For ease of reference, Action
Class
usually written as:
5 50 10,000
60 3 0

Machine Learning Tapas Kumar Mishra 25

Bayes Decision Rule – The General
Case (Cont.)
The problem
Given a particular x, we have to decide which action to take

We need to know the loss of taking each action

true state of the action being However, the true state

nature is taken is of nature is uncertain

incur the loss Expected (average) loss

Machine Learning Tapas Kumar Mishra 26

Bayes Decision Rule – The General
Case (Cont.) Average by enumerating over
all possible states of nature!
Expected loss

The incurred loss of taking The probability of

action in case of true being the true state
state of nature being of nature

The expected loss is also named as (conditional)

risk

Machine Learning Tapas Kumar Mishra 27

Bayes Decision Rule – The General
Case (Cont.)
Suppose we have:
Action For a particular x:
Class
5 50 10,000
60 3 0

Similarly, we can get:

Machine Learning Tapas Kumar Mishra 28

Bayes Decision Rule – The General
Case (Cont.)
The task: find a mapping from patterns to actions

In other words, for every x, the decision function

assumes one of the a actions
Overall risk R
expected loss
with decision Conditional risk for pattern pdf for
function x with action patterns

Machine Learning Tapas Kumar Mishra 30

Bayes Decision Rule – The General
Case (Cont.)

For every x, we ensure that the

The overall risk over all possible
conditional risk
x must be as small as possible
is as small as possible
 The resulting overall
Bayes decision rule (General case)
risk is called the Bayes
risk (denoted as R*)
 The best performance
achievable given p(x)
and loss function

Machine Learning Tapas Kumar Mishra 30

Two-Category Classification
Special case
 (two states of nature)


the loss incurred for deciding

when the true state of nature is

The conditional risk:

yes no

decide decide

Machine Learning Tapas Kumar Mishra 31

Two-Category Classification (Cont.)
constant θ
likelihood
independent of x
ratio
by
definition

the loss for being

error is ordinarily
by greater than the loss
re‐arrangement for being correct
by Bayes
theorem

Machine Learning Tapas Kumar Mishra 32

Minimum-Error-Rate Classification
Classification setting
 (c possible states of nature)

Zero‐one (symmetrical) loss function

 Assign no loss (i.e. 0) to a correct decision

 Assign a unit loss (i.e. 1) to any incorrect decision (equal cost)

Machine Learning Tapas Kumar Mishra 33

Minimum-Error-Rate Classification
(Cont.)

error rate
the probability thataction
Is wrong

Minimum error rate

Machine Learning Tapas Kumar Mishra 34

Discriminant Function
Classification decide
actions
Pattern ➔ Category categories

Discriminant functions

 Useful way to represent classifiers

 One function per category

Machine Learning Tapas Kumar Mishra 35

Discriminant Function (Cont.)
Minimum risk:
Minimum‐error‐rate:

Various Identical
discriminant functions classification results

is a monotonically increasing function

(i.e. equivalent in decision)

e.g.:

Machine Learning Tapas Kumar Mishra 36

Discriminant Function (Cont.)
Decision region
c discriminant functions c decision regions

where and

Decision boundary decision

boundary
surface in feature space where ties
occur among several largest
discriminant functions
Machine Learning Tapas Kumar Mishra 37
Expected Value
Expected value , a.k.a. expectation, mean or average
of a random variable x

Discrete case

Continuous case Notation:

Machine Learning Tapas Kumar Mishra 38

Expected Value (Cont.)
Given random variable x and function , what is the
expected value of

Discrete case:

Continuous case:

Variance
Discrete case:

Continuous case:

Notation: ( : standard deviation )

Machine Learning Tapas Kumar Mishra 40
Gaussian Density – Univariate Case
Gaussian density , a.k.a. normal density, for continuous
random variable

Machine Learning Tapas Kumar Mishra

Vector Random Variables
(joint pdf )

(marginal pdf )

Expected vector

marginal pdf on
the i‐th component
Notation:

Machine Learning Tapas Kumar Mishra

Vector Random Variables (Cont.)
Covariance matrix Properties of

 symmetric

 Positive
semidefinite

marginal pdf on a pairof

random variables (xi, xj)

Machine Learning Tapas Kumar Mishra

Gaussian Density – Multivariate Case

d‐dimensional column vector

d‐dimensional mean vector
covariance
matrix

Machine Learning Tapas Kumar Mishra

Gaussian Density – Multivariate Case
(Cont.)

Machine Learning Tapas Kumar Mishra

Discriminant Functions for Gaussian
Density
Minimum‐error‐rate classification

Constant, could be ignored

Machine Learning Tapas Kumar Mishra

Case I:

Covariance matrix: times the identity matrix I

Machine Learning Tapas Kumar Mishra

Case I: (Cont.)

the same for all states of nature,

could be ignored

Linear discriminant functions

weight vector

threshold/bias

Machine Learning Tapas Kumar Mishra

Case II:

Covariance matrix: identical for all classes

squared Mahalanobis
distance

reduces to Euclidean distance

Machine Learning Tapas Kumar Mishra

Case II: (Cont.)

the same for all states of nature,

could be ignored

Linear discriminant functions

weight vector

threshold/bias

Machine Learning Tapas Kumar Mishra 50

Case III:

quadratic discriminant functions

quadratic matrix

weight vector

threshold/bias

Machine Learning Tapas Kumar Mishra 50

Summary
◼ Bayesian Decision Theory
❑ Basic concepts
◼ States of nature
◼ Probability distribution, probability density function (pdf )
◼ Class‐conditional pdf
◼ Joint pdf, marginal distribution, law of total probability
❑ Bayes theorem
◼ Prior + likelihood + observation ➔ Posterior probability
❑ Bayes decision rule
◼ Decide the state of nature with maximum posterior

Machine Learning Tapas Kumar Mishra 51

Summary (Cont.)
◼ Feasibility of Bayes decision rule
❑ Prior probability + likelihood
❑ Solution I: counting relative frequencies
❑ Solution II: conduct density estimation
◼ Bayes decision rule: The general scenario
❑ Allowing more than one feature
❑ Allowing more than two states of nature
❑ Allowing actions than merely deciding state of nature
❑ Loss function:

Machine Learning Tapas Kumar Mishra 52

Summary (Cont.)
◼ Expected loss (conditional risk)

Average by enumerating over all possible states of nature

◼ General Bayes decision rule

❑ Decide the action with minimum expected loss
◼ Minimum‐error‐rate classification
❑ Actions ➔ Decide states of nature
❑ Zero‐one loss function
◼ Assign no loss/unit loss for correct/incorrect decisions
Machine Learning Tapas Kumar Mishra 53
Summary (Cont.)
◼ Discriminant functions
❑ General way to represent classifiers
❑ One function per category
❑ Induce decision regions and decision boundaries

◼ Gaussian/Normal density

◼ Discriminant functions for Gaussian pdf

linear discriminant function
quadratic discriminant function
Machine Learning Tapas Kumar Mishra 54
Probability Review
Fundamentals

 We measure the probability for Random Events

 How likely an event would occur
 The set of all possible events is called Sample Space
 In each experiment, an event may occur with a certain
probability (Probability Measure)
 Example:
 Tossing a dice with 6 faces
 The sample space is {1, 2, 3, 4, 5, 6}
 Getting the Event « 2 » in on experiment has a probability 1/6
2
Probability

 The probability of every set of possible events is between 0 and

1, inclusive.
 The probability of the whole set of outcomes is 1.
 Sum of all probability is equal to one
 Example for a dice: P(1)+P(2)+P(3)+ P(4)+P(5)+P(6)=1
 If A and B are two events with no common outcomes, then the
probability of their union is the sum of their probabilities.
 Event E1={1},
 Event E2 ={6}
 P(E1 U E2)=P(E1)+P(E2)

3
Complementary Event

 Complementary Event of A is not(A)

 P(A)=1-P(not A)

 The probability that event A will not happen is 1-P(A).

 Example
 Event E1={1}

 Probability to get a value different from {1} is 1-P(E1).

4
Joint Probability

 Event Union (U = OR) Event Intersection (∩= AND)

 Joint Probability (A ∩ B)
 The probability of two events in conjunction. It is the probability of both events together.

p (A  B ) = p (A ) + p (B ) − p (A  B )

 Independent Events
 Two events A and B are independent if

p (A  B ) = p (A )  p (B )

5
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3
E2: Drawing Ball 2 P(E2):1/3 p (A  B ) = p (A )  p (B )
E3: Drawing Ball 3 P(E3): 1/3

Case 1: Drawing with replacement of the ball

1 2
The second draw is independent of the first draw
3
1 1 1
p ( E 1  E 2 ) =  = = p ( E 1)  p ( E 2 )
3 3 9

Case 2: Drawing without replacement of the ball

The second draw is dependent on the first draw
1 1 1
p ( E 1  E 2 ) =  =  p ( E 1)  p ( E 2 )
3 2 6
1
Quiz: Show that in Case 2, we have p ( E 1  E 2) =
2
6
Conditional Probability
 Conditional Probability p(A|B) is the probability of some
event A, given the occurrence of some other event B.
p (A  B )
p (A | B ) = p (B | A ) =
p (B  A )

p (B ) P (A )

 If A and B are independent,

then p ( A | B ) = p ( A ) and p ( B | A ) = p ( B )
 If A and B are independent, the conditional probability of A, given B is
simply the individual probability of A alone; same for B given A.
 p(A) is the prior probability;
 p(A|B) is called a posterior probability.
 Once you know B is true, the universe you care about shrinks to B.

7
Example on Independence
E1: Drawing Ball 1 P(E1): 1/3 p (A  B )
E2: Drawing Ball 2 P(E2):1/3 p (A | B ) =
E3: Drawing Ball 3 P(E3): 1/3 p (B )
Case 1: Drawing with replacement of the ball
1 2
The second draw is independent of the first draw
3
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1 E 2) 3  3 1
= =
3 p ( E 2) 1 3
3
Case 2: Drawing without replacement of the ball
The second draw is dependent on the first draw
1 1 1
p ( E 1| E 2 ) = p ( E 1| E 2 ) =
p ( E 1  E 2) 3  2 1
= =
2 p ( E 2) 1 2
3

8
Baye’s Rule

 We know that
p (A  B ) = p (B  A )
 Using Conditional Probability definition, we have
p (A | B )  P (B ) = p (B | A )  P (A )

 The Bayes rule is:

p (B | A )  p (A )
p (A | B ) =
p (B )
9
Law of Total Probability

 In probability theory, the law of total probability means

the prior probability of A, P(A), is equal to
the expected value of the posterior probability of A.
 That is, for any random variable N,
p ( A ) = E  p ( A | N ) 
 where p(A|N) is the conditional probability of A given N.

10
Law of Total Probability

 Law of alternatives: The term law of total probability is

sometimes taken to mean the law of alternatives, which is
a special case of the law of total probability applying to
discrete random variables.
 if { Bn : n = 1, 2, 3, ... } is a finite partition of a probability
space and each set Bn is measurable, then for any event A we
have

p (A ) =
n
p (A  B )
n

or, alternatively, (using Rule of Conditional Probability)

p (A ) =  p (A | B
n
n )  p (B )
11
Example: Law of Total Probability
Sample Space S = {1, 2,3, 4,5, 6, 7}

Partitions B 1 = {1,5} B 2 = {2,3, 6} B 3 = {4, 7}

Event A = {3}

Law of Total Probability

p ( A ) = p ({3}) = p ( A  B1 ) + p ( A  B 2 ) + p ( A  B 3 )
= 0 + p ({3}  {2,3, 6}) + 0 = p ({3})
12
A random variable

► Random variable is a measurable function from a

sample space of events  to the measurable
space S of values of possible values of the variable

X : →S
Ei xi
Event Value

Each value/event has a probability of occurrence

13
A random variable

► Random Variable is also know as stochastic

variable.
► A random variable is not a variable. It is a function. It maps
from the sample space to the real numbers.
► random variable is defined as a quantity whose values
are random and to which a probability distribution
is assigned.

14
A random variable: Examples.

► The number of packets that arrives to the destination

► The waiting time of a customer in a queue

► The number of cars that enters the parking each hour

► The number of students that succeed in the exam

15
Random Variable Types
► Discrete Random Variable:
► possible values are discrete (countable sample space,
Integer values)
X :  → {1, 2,3, 4,...}
Ei xi
► Continuous Random Variable:
► possible values are continuous (uncountable space, Real
X :  → 1.4,32.3
values)

Ei xi
16
Discrete Random Variable
 The probability distribution for discrete random
variable is called Probability Mass Function (PMF).
p (x i ) = P (X = x i )
 Properties of PMF
0  p (x i )  1 and  p (x
i
i ) =1
 Cumulative Distribution Function (CDF)
p (X  x ) = x x
p (x i )
i

 Mean value n
X = E ( X ) = x
i =1
i  p (x i )
17
Discrete Random Variable
 Mean (Expected) value
n
X = E ( X ) = x
i =1
i  p (x i )
 Variance
V (X ) = X2 = E (X − E (X ))
2
=E X ( )
2
− E (X )2 General Equation

n
V (X ) =  ( x i − x )  p (x i )
2

i =1 For Discrete RV
2
 n   n 

V (X ) =  (x i )  p (x i ) −  
x i  p (x i ) 
2
  
 i =1   i =1 
18
Discrete Random Variable: Example
 Random Variable: Grades of the students

Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

Probability Mass Function

PDF
2
p (1) = P ( X = 1) = = 0.2 0  p (1)  1
10

0  p ( 2)  1
4
p ( 2) = P ( X = 2) = = 0.4
10
0  p ( 3)  1
4
p ( 3) = P ( X = 3) = = 0.4
10 Grade

19
Discrete Random Variable: Example
 Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

Probability Mass Function Property CDF

 p ( x i ) = p (1) + p ( 2) + p (3) = 1
i

Cumulative Distribution Function

p (X  x ) = x x
p (x i )
i

p ( X  2) =  x p (x i ) = p (1) + p ( 2 ) = 0.2 + 0.4 = 0.6

Grade
i 2

p ( X  3) =  x 2
p (x i ) = p (1) + p ( 2 ) + p ( 3) = 1
i

20
Discrete Random Variable: Example
 Random Variable: Grades of the students
Student ID 1 2 3 4 5 6 7 8 9 10

Grade 3 2 3 1 2 3 1 3 2 2

The Mean Value

 x  p (x
i
i i ) = 1 0.2 + 2  0.4 + 3  0.4 = 2.2

Grade
i
= 2.2
10
21
Continuous Random Variable
 The probability distribution for continuous random
variable is called Probability Density Function (PDF).
f (x )
 The probability of a given value is always 0 p ( x = x i ) = 0
 The sample space is infinite
 For continuous random variable, we compute p ( a  x  b )
 Properties of PDF
1. f ( x)  0 , for all x in R X
2.  f ( x)dx = 1
RX

3. f ( x) = 0, if x is not in R X
22
Continuous Random Variable
 Cumulative Distribution Function (CDF)

p (X  x ) = x x
p (x i )
i

p ( X  x ) =  f (t ) dt = 0
x

−

Mean/Expected value
p ( a  X  b ) =  f ( x ) dx
 b
+

 x  f ( x )dx
a
X = E ( X ) =
−

 Variance
+  + 
V (X ) =  (x −  )  f ( x ) dx and 
V ( X ) =  x 2  f ( x ) dx  − x2
2
x
 
−  − 
23
Discrete versus Continuous Random Variables
Discrete Random Variable Continuous Random Variable

Finite Sample Space Infinite Sample Space

e.g. {0, 1, 2, 3} e.g. [0,1], [2.1, 5.3]

Probability Mass Function (PMF) Probability Density Function (PDF)

p (x i ) = P (X = x i ) f (x )
1. f ( x)  0 , for all x in R X
1. p ( x i )  0, for all i
2.  i =1 p ( x i ) = 1
 2.  f ( x)dx = 1
RX

3. f ( x) = 0, if x is not in R X

Cumulative Distribution Function (CDF) p ( X  x )

p (X  x ) = x p ( X  x ) =  f (t ) dt = 0
x

x
p (x i )
i −

p ( a  X  b ) =  f ( x ) dx
b

24
Continuous Random Variables: Example

Example: modeling the waiting time in a queue

 People waiting for service in bank queue

 Time is a continuous random variable

 Random Time is typically modeled as exponential distribution
  is the mean value

1  x 
Exponential Distribution   exp  − , x 0
f (x ) =    
Exp (µ) 0,
 otherwise

25
Continuous Random Variables: Example

 We assume that with average waiting time of one customer

is 2 minutes
PDF: f (time)

1 −x / 2
 e , x0
f ( x) =  2
0, otherwise

time
26
Continuous Random Variables: Example
 Probability that the customer waits exactly 3 minutes is:
1 3 − x /2
P (x = 3) = P (3  x  3) = 3 e dx = 0
2
 Probability that the customer waits between 2 and 3 minutes is:
1 3 − x /2
P (2  x  3) =  e dx = 0.145
2 2
P(2  X  3) = F (3) − F (2) = (1 − e − (3 / 2) ) − (1 − e −1 ) = 0.145 CDF

 The Probability that the customer waits less than 2 minutes

P (0  X  2) = F (2) − F (0) = F (2) = 1 − e −1 = 0.632 CDF

27
Continuous Random Variables: Example

Expected Value and Variance

 The mean of life of the previous device is:


1 
 −xe −x /2
 

2 0
E (X ) = −x /2
dx = +  e − x / 2dx = 2
 
xe
0
0

 To compute variance of X, we first compute E(X2):


−x / 2
E ( X ) =  x e dx = − x e
1  
2 2 −x / 2 2 +  e − x / 2 dx = 8
2 0 0
0

 Hence, the variance and standard deviation of the device’s life

are:
V (X ) = 8 − 2 = 4
2

 = V (X ) = 2
28
Variance

 The standard deviation is defined as the square

root of the variance, i.e.:

X = X
2
= V (X ) =s

29
Coefficient of Variation

 The Coefficient of Variance of the random variable X is

defined as:

V (X ) X
CV (X ) = =
E ( X ) X

30
Discrete Probability Distribution

 The probability distribution of a discrete random

variable is a list of probabilities associated with each of
its possible values.

 It is also sometimes called the probability function or

the probability mass function (PMF) for discrete
random variable.

31
Probability Mass Function (PMF)
 Formally
 the probability distribution or probability mass function
(PMF) of a discrete random variable X is a function that
gives the probability p(xi) that the random variable equals xi,
for each value xi:
p (x i ) = P (X = x i )
 It satisfies the following conditions:
0  p (x i )  1
 p (x
i
i ) =1

32
Continuous Random Variable

 A continuous random variable is one which takes an

infinite number of possible values.
 Continuous random variables are usually measurements.
 Examples include height, weight, the amount of sugar in an orange,
the time required to run a mile.

33
Probability Density Function (PDF)
 For the case of continuous variables, we do not want to
ask what the probability of "1/6" is, because the answer is
always 0...
 Rather, we ask what is the probability that the value is in
the interval (a,b).
 So for continuous variables, we care about the derivative of
the distribution function at a point (that's the derivative of an
integral). This is called a probability density function
(PDF).
 The probability that a random variable has a value in a set A is
the integral of the p.d.f. over that set A.

34
Probability Density Function (PDF)
 The Probability Density Function (PDF) of a continuous
random variable is a function that can be integrated to
obtain the probability that the random variable takes a value in
a given interval.
 More formally, the probability density function, f(x), of a
continuous random variable X is the derivative of the
cumulative distribution function F(x):
d
f (x ) = F (x )
dx
 Since F(x)=P(X≤x), it follows that:
b


F (b ) − F (a ) = P (a  X  b ) = f ( x )  dx
a
35
Cumulative Distribution Function (CDF)

 The Cumulative Distribution Function (CDF) is a

function giving the probability that the random variable X
is less than or equal to x, for every value x.
 Formally
 the cumulative distribution function F(x) is defined to be:

 −   x  +,
F (x ) = P (X  x )

36
Cumulative Distribution Function (CDF)

 For a discrete random variable, the cumulative distribution

function is found by summing up the probabilities as in the
example below.
 −   x  +,

F (x ) = P (X  x ) =  P (X
x i x
= xi ) =  p (x
x i x
i )

 For a continuous random variable, the cumulative

distribution function is the integral of its probability density
function f(x).
b


F (a ) − F (b ) = P (a  X  b ) = f ( x )  dx
a
37
Cumulative Distribution Function (CDF)

► Example
► Discrete case: Suppose a random variable X has the
following probability mass function p(xi):
xi 0 1 2 3 4 5
p(xi) 1/32 5/32 10/32 10/32 5/32 1/32

► The cumulative distribution function F(x) is then:

xi 0 1 2 3 4 5
F(xi) 1/32 6/32 16/32 26/32 31/32 32/32

38
Mean or Expected Value

Expectation of discrete random variable X

n
X = E ( X ) = x
i =1
i  p (x i )

Expectation of continuous random variable X

+
X = E ( X ) = 
−
x  f ( x ) dx
39
Example: Mean and variance

 When a die is thrown, each of the possible faces 1, 2, 3, 4, 5, 6

(the xi's) has a probability of 1/6 (the p(xi)'s) of showing. The
expected value of the face showing is therefore:

µ = E(X) = (1 x 1/6) + (2 x 1/6) + (3 x 1/6) + (4 x

1/6) + (5 x 1/6) + (6 x 1/6) = 3.5

 Notice that, in this case, E(X) is 3.5, which is not a possible

value of X.

40
Variance

 The variance is a measure of the 'spread' of a

distribution about its average value.
 Variance is symbolized by V(X) or Var(X) or σ2.
 The mean is a way to describe the location of a
distribution,
 the variance is a way to capture its scale or degree of
being spread out. The unit of variance is the square of the
unit of the original variable.

41
Variance

 The Variance of the random variable X is defined as:

V (X ) = X2 = E (X − E (X ))
2
( ) − E (X )
=E X 2 2

 where E(X) is the expected value of the random variable X.

 The standard deviation is defined as the square
root of the variance, i.e.:

 X = X2 = V ( X ) =s
42
Sampling Distributions,
Confidence interval, Hypothesis
testing
Sampling Distribution of the means
• Central Limit Theorem: if 𝑋ത is the mean of a random sample of size n
taken from a population mean 𝜇 and finite variance 𝜎 2 , then the
limiting form of the distribution of
ത
𝑋−𝜇
𝑧= 𝜎
𝑛

Is a standard normal distribution N(0,1) as n -> ∞.

9/11/2023 @TKMISHRA ML NITRKL 2

9/11/2023 @TKMISHRA ML NITRKL 3
Sampling Distribution of S2: 2

Theorem 2:

9/11/2023 @TKMISHRA ML NITRKL 4

Sampling Distribution of S2

Example:

9/11/2023 @TKMISHRA ML NITRKL 5

Sampling Distribution of 2
S: t
Theorem 3:

9/11/2023 @TKMISHRA ML NITRKL 6

9/11/2023 @TKMISHRA ML NITRKL 7
Sampling Distribution of 2
S: F
Theorem 4:

9/11/2023 @TKMISHRA ML NITRKL 8

Confidence Intervals

9/11/2023 @TKMISHRA ML NITRKL 9

Interval Estimate and Confidence Level

• An interval estimate of a population parameter such as mean and

standard deviation is an interval or range of values within which the
true parameter value is likely to lie with certain probability.

• Confidence level, usually written as (1 − )100%, on the interval

estimate of a population parameter is the probability that the interval
estimate will contain the population parameter. When  = 0.05, 95%
is the confidence level and 0.95 is the probability that the interval
estimate will have the population parameter

9/11/2023 @TKMISHRA ML NITRKL 10

Significance and Confidence Level

• The value of  is called significance

• 95% confidence implies that in 19 out of 20 cases, the true population
mean will be within the interval estimate.

• Confidence interval is the interval estimate of the population

parameter estimated from a sample using a specified confidence level

9/11/2023 @TKMISHRA ML NITRKL 11

Confidence Interval for Population Mean

• Let X1, X2, …, Xn be the sample means of samples S1, S2, …, Sn that are
drawn from an independent and identically distributed population
with mean  and standard deviation . From central limit theorem
we know that the sample means Xi follow a normal distribution with
mean  and standard deviation  / n . The variable Z =  / n follows
X −
i

a standard normal variable.

9/11/2023 @TKMISHRA ML NITRKL 12

Assume that we are interested in finding (1 − ) 100% confidence
interval for the population mean. We can distribute  (probability of
not observing true population mean in the interval) equally (/2) on
either side of the distribution as shown in Figure

9/11/2023 @TKMISHRA ML NITRKL 13

CI for the population mean when population standard deviation is
known

• In general, (1 – ) 100% the confidence interval for the population

mean when population standard deviation is known can be written as
−
X  Z / 2   / n

• Above equation is valid for large sample sizes, irrespective of the

distribution of the population. The above equation is equivalent to

− −
P ( X − Z / 2   / n    X + Z / 2   / n ) = 1 − 
9/11/2023 @TKMISHRA ML NITRKL 14
CI for Different Significance Values

• That is, the probability that the population mean takes a value
between X − Z / 2   / n and X + Z / 2   / n is 1 – .
− −

• The absolute values of Z/2 for various values of  are shown below:
Confidence interval for
 |Z/2|
population mean when
population standard deviation is
known

−
0.1 1.64 X  1.64   / n
−
0.05 1.96 X  1.96   / n
−
0.02 2.33 X  2.33   / n
−
0.01 2.58 X  2.58   / n

9/11/2023 @TKMISHRA ML NITRKL 15

Example

A sample of 100 patients was chosen to estimate the length of stay

(LoS) at a hospital. The sample mean was 4.5 days and the population
standard deviation was known to be 1.2 days.

(a) Calculate the 95% confidence interval for the population mean.
(b) What is the probability that the population mean is greater than
4.73 days?

9/11/2023 @TKMISHRA ML NITRKL 16

Solution
−
(a) 95% confidence interval for population mean: We know that
X =4.5 and  = 1.2 and thus
/ n = 1.2 / 100 = 0.12

The 95% confidence interval is given by

( X − Z  /2   / n , X + Z  /2   / n ) = (4.5 − 1.96  0.12, 4.5 + 1.96  0.12) = (4.2648, 4.7352)

Note that 4.73 is the upper limit of the 95% confidence interval from part (a), thus the probability
that the population mean is greater than 4.73 is approximately 0.025.

9/11/2023 @TKMISHRA ML NITRKL 17

Confidence Interval for Population Mean when Standard Deviation is Unknown

• William Gossett (Student, 1908) proved that if the population follows a normal
distribution and the standard deviation is calculated from the sample, then the statistic
given in Eq will follow a t-distribution with (n − 1) degrees of freedom
−
X − 
t =
S / n

• Here S is the standard deviation estimated from the sample (standard error). The t-
distribution is very similar to standard normal distribution; it has a bell shape and its
mean, median, and mode are equal to zero as in the case of standard normal distribution.
The major difference between the t-distribution and the standard normal distribution is
that t-distribution has broad tail compared to standard normal distribution. However, as
the degrees of freedom increases the t-distribution converges to standard normal
distribution.

9/11/2023 @TKMISHRA ML NITRKL 22

Confidence Interval for Population Mean when Standard Deviation is Unknown

• The (1 − )100% confidence interval for mean from a population that

follows normal distribution when the population mean is unknown is
given by − S
X  t / 2, n −1 
n

• In above Eq, the value t/2,n − 1 is the value of t under t-distribution for
which the cumulative probability F(t) = /2 when the degrees of
freedom is (n − 1).

9/11/2023 @TKMISHRA ML NITRKL 23

• The absolute values of t/2,n−1 for different values of  are shown in
Table along with corresponding Z/2 values.

 |t/2,10| |t/2,50| |t/2,500| |Z/2|

0.1 1.812 1.675 1.647 1.64

0.05 2.228 2.008 1.964 1.96

0.02 2.763 2.403 2.333 2.33

0.01 3.169 2.677 2.585 2.58

It is evident from table that the values of t/2,n−1 and Z/2

converge for higher degrees of freedom. In fact, as the sample
size nears 100, the t-distribution gets very close to a normal
distribution.

9/11/2023 @TKMISHRA ML NITRKL 24

Example

• An online grocery store is interested in estimating the basket size (number of items
ordered by the customer) of its customers so that it can optimize its size of crates used
for delivering the grocery items. From a sample of 70 customers, the average basket size
was estimated as 24 and the standard deviation estimated from the sample was 3.8.
Calculate the 95% confidence interval for the basket size of the customer order.

Solution
−
We know that X = 24 , n = 70, S = 3.8 and t0.025, 69 = 1.995

The confidence interval for size of basket using Eq. is given by

− S 3.8
X  t / 2, n −1 = 24  1.995 = 24  0.9061
n 70

Thus the 95% confidence interval for the size of the basket is
(23.09,24.91).
9/11/2023 @TKMISHRA ML NITRKL 25
HYPOTHESIS TESTING
INTRODUCTION TO HYPOTHEIS TESTING

Hypothesis testing is a statistical process of either rejecting or

retaining a claim or belief or association related to a business
context, product, service, processes, etc

9/11/2023 @TKMISHRA ML NITRKL 27

INTRODUCTION TO HYPOTHEIS TESTING

• Hypothesis test consists of two complementary

statements called null hypothesis and alternative
hypothesis, and only one of them is true

• Hypothesis is an integral part of many predictive

analytics techniques such as multiple linear
regression and logistic regression

9/11/2023 @TKMISHRA ML NITRKL 28

HYPOTHESIS TESTING STEPS
1. Describe the hypothesis in words. Hypothesis is
described using a population parameter (such as
mean, standard deviation, proportion, etc.) about
which a claim (hypothesis) is made. Few sample
claims (hypothesis) are:

• Average time spent by women using social media is more

than men.
• Customers with more than one mobile handsets are more
likely to die early.

9/11/2023 @TKMISHRA ML NITRKL 30

HYPOTHESIS TESTING STEPS
2) Based on the claim made in step 1, define null and alternative
hypotheses. Initially we believe that the null hypothesis is true. In
general, null hypothesis means that there is no relationship between the
two variables under consideration (for example, null hypothesis for the
claim ‘women use social media more than men’ will be ‘there is no
relationship between gender and the average time spent in social
media’). Null and alternative hypotheses are defined using a population
parameter.

3) Identify the test statistic to be used for testing the validity of the null
hypothesis. Test statistic will enable us to calculate the evidence in
support of null hypothesis. The test statistic will depend on the
probability distribution of the sampling distribution; for example, if the
test is for mean value and the mean is calculated from a large sample
and if the population standard deviation is known, then the sampling
distribution will be a normal distribution and the test statistic will be a Z-
statistic (standard normal statistic).
9/11/2023 @TKMISHRA ML NITRKL 31
HYPOTHESIS TESTING STEPS
4. Decide the criteria for rejection and retention of null hypothesis.
This is called significance value traditionally denoted by symbol .
The value of  will depend on the context and usually 0.1, 0.05, and
0.01 are used.

5. Calculate the p-value (probability value), which is nothing but the

conditional probability of observing the test statistic value when the
null hypothesis is true. In simple terms p-value is the evidence in
support of the null hypothesis.

6. Take the decision to reject or retain the null hypothesis based on the
p-value and significance value . The null hypothesis is rejected
when p-value is less than  and the null hypothesis is retained when
p-value is greater than or equal to .

9/11/2023 @TKMISHRA ML NITRKL 32

Null and Alternative Hypothesis

• Null hypothesis, usually denoted as H0 (H zero and H

naught), refers to the statement that there is no
relationship or no difference between different groups
with respect to the value of a population parameter.

• Alternative hypothesis, usually denoted as HA (or H1),

is the complement of null hypothesis.

9/11/2023 @TKMISHRA ML NITRKL 33

Hypothesis statement to definition of null and alternative
hypothesis
S. No. Hypothesis Description Null and Alternative Hypothesis
1 Average annual salary of machine learning H0: m = f
experts is different for males and females. HA: m  f
m and f are average annual salary of male and
(In this case, the null hypothesis is that there female machine learning experts, respectively.
is no difference in male and female salary of
machine learning experts)
H0: a  e
2 On average people with Ph.D. in analytics earn
HA: a > e
more than people with Ph.D. in engineering.
a = Average annual salary of people with Ph.D. in

analytics.

e = Average annual salary of people with Ph.D. in

engineering.

It is essential to have the equal sign in null hypothesis

statement.
9/11/2023 @TKMISHRA ML NITRKL 34
Test Statistic
• Test statistic is the standardized difference between the
estimated value of the parameter being tested
calculated from the sample(s) and the hypothesis value
(that is standardized difference between and ) in
−

order to establish the evidence in support of the null

hypothesis.

• It measures the standardized distance (measured in

terms of number of standard deviations) between the
value of the parameter estimated from the sample(s)
and the value of the null hypothesis.

9/11/2023 @TKMISHRA ML NITRKL 35

P - Value
• The p-value is the conditional probability of
observing the statistic value when the null hypothesis
is true.

• For example, consider the following hypothesis:

Average annual salary of machine learning experts is at
least 100,000. The corresponding null hypothesis is H0:
m  100,000. Assume that estimated value of the
salary from a sample is 1,10,000 (that is X = 1,10,000 and
assume that the standard deviation of population is
known and standard error of the sampling distribution
is 5000 (that is,  / n = 5000 where n is the sample size
using which X = 1,10,000 was calculated).
9/11/2023 @TKMISHRA ML NITRKL 36
Hypothesis Testing
• The standardized distance between estimated salary from
hypothesis salary is (1,10,000 – 1,00,000)/5000 = 2.

• That is, the standardized distance between estimated value

and the hypothesis value is 2 and we can now find the
probability of observing this statistic value from the sample
if the null hypothesis is true (that is if m  100,000).

• A large standardized distance between the estimated value

and the hypothesis value will result in a low p-value.

• Note that the value 2 is actually the value under a standard

normal distribution since it is calculated from
−
X − 
 / n
9/11/2023 @TKMISHRA ML NITRKL 37
Standard normal distribution and the p-value
corresponding to Z = 2 are shown below:

9/11/2023 @TKMISHRA ML NITRKL 38

Hypothesis Testing
• Probability of observing a value of 2 and higher from a
standard normal distribution is 0.02275.

• That is, if the population mean is 1,00,000 and standard error

of the sampling distribution is 5000 then probability of
observing a sample mean greater than or equal to 1,10,000
is 0.02275.

• The value 0.02275 is the p-value, which is the evidence in

support of the statement in the null hypothesis.

p-value = P(Observing test statistics value | null hypothesis is

true)
9/11/2023 @TKMISHRA ML NITRKL 39
Decision Criteria – Significance Value

• Significance level, usually denoted by , is the criteria used

for taking the decision regarding the null hypothesis
(reject or retain) based on the calculated p-value.

• The significance value  is the maximum threshold for p-

value.

• The decision to reject or retain will depend on whether

the calculated p-value crosses the threshold value  or not

9/11/2023 @TKMISHRA ML NITRKL 40

Decision making under hypothesis testing

Criteria Decision

p-value <  Reject the null hypothesis

p-value   Retain (or fail to reject) the null hypothesis

9/11/2023 @TKMISHRA ML NITRKL 41

Example 1
Statement 1 − Salary of machine learning experts on
average is at least US $100,000:
The null and alternative hypotheses in this case are given
by

H0: m  100,000
HA: m > 100,000

where m is the average annual salary of machine learning

experts. Note that the equality symbol is always part of
the null hypothesis since we have to measure the
difference between estimated value from the sample and
the hypothesis value. In this case, reject or retain decision
will depend on the direction of deviation of the estimated
parameter from the sample from hypothesis value.
9/11/2023 @TKMISHRA ML NITRKL 42
Solution
Below figure shows the rejection region on the right side
of the distribution. Since the rejection region is only on
one side this is a one-tailed test (right tailed test).
Specifically, since the alternative hypothesis in this case is
m > 100,000, this is called right-tailed test.

9/11/2023 @TKMISHRA ML NITRKL 43

Example 2
• Statement 2 − Average waiting time at the London
Heathrow airport security check is less than 30 minutes:
The null and alternative hypotheses in this case are given
by
H0: w  30
HA: w < 30

where w is the average waiting time at London Heathrow

security check. In this case, reject region will on the left
side (known as left tailed test) of the distribution as shown
in Figure

9/11/2023 @TKMISHRA ML NITRKL 44

Solution

Rejection region in case of left-sided test

9/11/2023 @TKMISHRA ML NITRKL 45

Example 3
Statement 3 − Average salary of male and female MBA students
at graduation is different:

The null and alternative hypotheses in this case are given by

H0: m = f
HA: m  f

Where m and f are the average salaries of male and female

MBA students, respectively, at the time of graduation.
In this case, the rejection region will be on either side of the
distribution and if the significance level is  then the rejection
region will be /2 on either side of the distribution. Since the
rejection region is on either side of the distribution, it will be a
two-tailed test.
9/11/2023 @TKMISHRA ML NITRKL 46
Solution

Rejection region in case of two-tailed test

9/11/2023 @TKMISHRA ML NITRKL 47

Hypothesis Testing for Population Mean with known
Variance: Z-Test
• Z-test (also known as one-sample Z-test) is used when a
claim (hypothesis) is made about the population parameter
such as population mean or proportion when population
variance is known.
• Since the hypothesis test is carried out with just one sample,
this test is also known as one-sample Z-test.
• Z-test to conduct a hypothesis test for population mean
when the population variance is known; the test statistics for
Z-test is given by

Z-statistic = X −
 / n
• The critical value in this case will depend on the significance
value  and whether it is a one-tailed or two-tailed test

9/11/2023 @TKMISHRA ML NITRKL 48

Critical value for different values of 

Approximate Critical Values

 Left-tailed test Right-tailed test Two-tailed test

0.1
−1.28 1.28 −1.64 and 1.64

0.05
−1.64 1.64 −1.96 and 1.96

0.01
−2.33 2.33 −2.58 and 2.58

9/11/2023 @TKMISHRA ML NITRKL 49

Condition for rejection of null hypothesis H0

Type of Test Condition Decision

Left-tailed test Z-statistic < Critical value Reject H0

Z-statistic  Critical value Retain H0

Right-tailed test Z-statistic > Critical value Reject H0

Z-statistic  Critical value Retain H0

Two-tailed test |Z-statistic| > |Critical Value| Reject H0

|Z-statistic|  |Critical Value| Retain H0

9/11/2023 @TKMISHRA ML NITRKL 50

Example
• An agency based out of Bangalore claimed that the
average monthly disposable income of families
living in Bangalore per month is greater than INR
4200 with a standard deviation of INR 3200. From
a random sample of 40,000 families, the average
disposable income was estimated as INR 4250.
Assume that the population standard deviation is
INR 3200. Conduct an appropriate hypothesis test
at 95% confidence level ( = 0.05) to check the
validity of the claim by the agency.

9/11/2023 @TKMISHRA ML NITRKL 51

Solution
Claim: Average disposable income is more than INR
4200.
Let  and  denote the mean and standard deviation in
the population. The corresponding null and alternative
hypotheses are
H0:   4200
HA:  > 4200
Since we know the population standard deviation, we
can use the Z-test. The corresponding Z-statistic is given
by

X − 4250 − 4200
Z = = = 3.125
/ n 3200 / 40000

9/11/2023 @TKMISHRA ML NITRKL 52

Solution Continued…
This is a right-tailed test.
The corresponding Z-critical value at  = 0.05 for right-
tailed test is approximately 1.64
Since the calculated Z-statistic value is greater than the
Z-critical value, we reject the null hypothesis.

The corresponding p-value = 0.00088.

9/11/2023 @TKMISHRA ML NITRKL 53

Critical value, Z-statistic value, and corresponding p-value.

9/11/2023 @TKMISHRA ML NITRKL 54

Example
A passport office claims that the passport applications
are processed within 30 days of submitting the
application form and all necessary documents. Table 6.6
shows processing time of 40 passport applicants. The
population standard deviation of the processing time is
12.5 days.
Conduct a hypothesis test at significance level  = 0.05 to
verify the claim made by the passport office.

16 16 30 37 25 22 19 35 27 32

34 28 24 35 24 21 32 29 24 35

28 29 18 31 28 33 32 24 25 22

21 27 41 23 23 16 24 38 26 28

9/11/2023 @TKMISHRA ML NITRKL 55

Solution
Null and alternative hypotheses in this case are given by
H0:   30
HA:  < 30
From the data in Table 6.6, the estimated sample mean is
27.05 days.
The standard deviation of the sampling distribution
The value of Z-statistic is given by
X − 27.05 − 30
Z= = = −1.4926
/ n 12.5 / 40

 / n = 12 .5 / 40 = 1.9764
9/11/2023 @TKMISHRA ML NITRKL 56
Solution Continued…
• The critical value of left-tailed test for  = 0.05 is –1.644.
• Since the critical value is less than the Z-statistic value,
we fail to reject the null hypothesis. The p-value for Z =
−1.4926 is 0.06777 which is greater than the value of .

• That is, there is no strong evidence against null

hypothesis so we retain the null hypothesis, which is  
30. Figure 6.6 shows the calculated Z-statistic value and
the rejection region.

9/11/2023 @TKMISHRA ML NITRKL 57

Left-tailed test

9/11/2023 @TKMISHRA ML NITRKL 58

Example
According to the company IQ Research, the average
Intelligence Quotient (IQ) of Indians is 82 derived
based on a research carried out by Professor Richard
Lynn, a British Professor of Psychology, using the data
collected from 2002 to 2006 (Source: IQ Research).
The population standard deviation of IQ is estimated
as 11.03. Based on a sample of 100 people from
India, the sample IQ was estimated as 84.
(a) Conduct an appropriate hypothesis test at  =
0.05 to validate the claim of IQ Research (that
average IQ of Indians is 82).

9/11/2023 @TKMISHRA ML NITRKL 59

Solution
a)Hypothesis test: It is given that  = 82,  = 11.03, n = 100,
and
X =84.
The null and alternative hypotheses in this case are:
H0:  = 82
HA:   82
Since the direction of alternative hypothesis is both ways,
we have a two-tailed t-test. The test statistics is given by

X − 84 − 82
Z= = = 1.8132
 / n 11.03 / 100

9/11/2023 @TKMISHRA ML NITRKL 60

Solution Continued…
• For a two-tailed test, the critical values at /2 = 0.025
are -1.96 and 1.96.

• Since the calculated Z-statistic value is less than the

critical value, we fail to reject the null hypothesis
(retain the null hypothesis).

• Since the Z-statistic value is 1.8132 and falls on the

right tail, we first calculate normal distribution beyond
1.8132 which is equal to 0.0348.

• Since this is a two-tailed test, the p-value is twice the

area to the right side of the Z-statistic value, which is =
0.0698, that is the p-value in this case is 0.0698

9/11/2023 @TKMISHRA ML NITRKL 61

Statistic, critical values, and the rejection region

9/11/2023 @TKMISHRA ML NITRKL 62

Hypothesis Test for Population Mean Under Unknown Population
Variance: t-Test

• We use the fact that a sampling distribution of a

sample from a population that follows normal
distribution with unknown variance follows a t-
distribution with (n − 1) degrees of freedom.

• In many cases the population variance (and thus the

standard deviation) will not be known. In such cases we
will have to estimate the variance using the sample
itself.

• Let S be the standard deviation estimated from the

sample of size n.

9/11/2023 @TKMISHRA ML NITRKL 63

t-test continued…
X −
• Then the statistic S / n will follow a t-distribution
with (n − 1) degrees of freedom if the sample is drawn
from a population that follows a normal distribution.

• Here 1 degree of freedom is lost since the standard

deviation is estimated from the sample.

• Thus, we use the t-statistic (hence the test is called t-

test) to test the hypothesis when the population
standard deviation is unknown
X −
t-statistic = S / n
9/11/2023 @TKMISHRA ML NITRKL 64
Example
Aravind Productions (AP) is a newly formed movie
production house based out of Mumbai, India. AP was
interested in understanding the production cost required
for producing a Bollywood movie. The industry believes
that the production house will require at least INR 500
million (50 crore) on average. It is assumed that the
Bollywood movie production cost follows a normal
distribution. Production cost of 40 Bollywood movies in
millions of rupees are shown in Table 6.7. Conduct an
appropriate hypothesis test at  = 0.05 to check whether
the belief about average production cost is correct.

9/11/2023 @TKMISHRA ML NITRKL 65

Production cost of Bollywood movies

601 627 330 364 562 353 583 254 528 470

125 60 101 110 60 252 281 227 484 402

408 601 593 729 402 530 708 599 439 762

292 636 444 286 636 667 252 335 457 632

9/11/2023 @TKMISHRA ML NITRKL 66

Solution
It is given that the production cost of Bollywood movies
follows a normal distribution; however, the standard
deviation of the population is not known and we need to
estimate the standard deviation value from the sample.
Thus, we have to use the t-test for testing the hypothesis.
From the sample data in Table we get the following values:
−
n = 40, X =429.55, and S = 195.0337

The null and alternative hypotheses are

• H0:   500
• HA:  > 500

9/11/2023 @TKMISHRA ML NITRKL 67

Solution Continued…
The corresponding test statistic is

X − 429.55 − 500
t - statistic = = = −2.2845
S/ n 195.0337 / 40

Note that this is a one-tailed test (right-tailed) and

the critical t-value at  = 0.05 under right-tailed test,
tcritical = 1.6848

9/11/2023 @TKMISHRA ML NITRKL 68

Solution Continued…
Since t-statistic value is less than the critical t-value, we
retain the null hypothesis. The t-statistic value and
critical value for the t-test are shown in Figure

9/11/2023 @TKMISHRA ML NITRKL 69

Example
According to statistics released by the Department of
Civil Aviation, the average delay of flights is equal to
16.8 minutes, flight delays are assumed to follow a
normal distribution. However, from a sample of 50
flights, the average delay was estimated to be 19.5
minutes and the sample standard deviation was 6.6
minutes.
Conduct a hypothesis test to disprove the claim that
the average delay is equal to 16.8 minutes at  = 0.01.

9/11/2023 @TKMISHRA ML NITRKL 70

Solution
_
Given n = 50, X = 19.5 , S = 6.6.
Null and alternative hypotheses are
H0:  = 16.8
HA:   16.8
The corresponding t-statistic value is

X − 19.5 − 16.8
t = = = 2.8927
S / n 6.6 / 50

9/11/2023 @TKMISHRA ML NITRKL 71

Solution Continued…
• The critical t-value for two-tailed t-test when  = 0.01 and
degrees of freedom = 49 is 2.67
• Since the calculated t-statistic value is greater than the t-critical
value, we reject the null hypothesis. The corresponding p-value is
0.0057. The values of t-statistic, t-critical value, rejection and
retention regions are shown in Figure

9/11/2023 @TKMISHRA ML NITRKL 72

Paired Sample t-Test
• In a paired t-test, the data related to the parameter is
captured twice from the same subject, once before the
intervention and once after intervention

• Alternatively, the paired t-test can be used for

comparing two different interventions such as two
different promotion strategies applied on the same
subject

9/11/2023 @TKMISHRA ML NITRKL 73

Examples of paired t-test
❖ Body weight of subjects before and after
attending a yoga training program.
❖ Cholesterol levels of subjects before and after
attending meditation training.
❖ Amount of time spent by subjects on the internet
before and after marriage.
❖ Quantity of alcohol consumed by people before
and after breakup.
❖ Level of cortisol among students during and after
exam.

9/11/2023 @TKMISHRA ML NITRKL 74

Paired t-Test
Assume that the mean difference in the parameter value
before and after the treatment is d and the corresponding
standard deviation of difference is Sd . Let D be the
hypothesized mean difference. Then the statistic defined in
Eq follows a t-distribution with (n − 1) degrees of freedom.
d − D
Sd / n

Here we assume that the differences follow a normal

distribution.

9/11/2023 @TKMISHRA ML NITRKL 75

Example
Table shows data on alcohol consumption before and after breakup.
Conduct a paired t-test to check whether the alcohol consumption is more
after the breakup (that is d > 0) at 95% confidence ( = 0.05).
Average weekly consumption of alcohol (in ml) before and after breakup
S. No. Before Breakup (X1) After Breakup (X2) Difference (X2 − X1)
1 470 408 −62
2 354 439 85
3 496 321 −175
4 351 437 86
5 349 335 −14
6 449 344 −105
7 378 318 −60
8 359 492 133
9 469 531 62
10 329 417 88
11 389 358 −31
12 497 391 −106
13 493 398 −95
14 268 394 126
15 445 508 63
16 287 399 112
17 338 345 7
18 271 341 70
19 412 326 −86
20 335 467 132
9/11/2023 @TKMISHRA ML NITRKL 76
Solution
The mean difference, that is mean of (X2 – X1), is 11.5 and
the corresponding sample standard deviation is 95.67.
The null and alternative hypotheses are (when the claim is
that the difference is greater than zero):
H0: d  0
HA d > 0
The value of test statistic is

d − D 11.5 − 0
t = = = 0.5375
S/ n 95.6757 / 20

9/11/2023 @TKMISHRA ML NITRKL 77

Solution Continued…
• The critical t-value for one-tailed test when  = 0.05
and df = 19 is 1.7291
• Since the t-statistic value is 0.5375, which is less than
the critical value, we retain the null hypothesis and
conclude the difference in alcohol consumption is not
greater than 0 before and after breakup.
• The corresponding p-value is 0.70.

9/11/2023 @TKMISHRA ML NITRKL 78

Two Sample Z test
Assume that 1 and 2 are the population means. Our interest is
to check a hypothesis on difference between 1 and 2, that is
(1 − 2). If X 1and X are
2 the estimated mean values from two
samples drawn from the two populations, the statistic ( Xˆ 1 − Xˆ 2 )
follows a standard normal distribution with mean (1 − 2) and
 12  22
standard deviation n1
+
n2 where n1 and n2 are the sample sizes
of two samples. The corresponding Z-statistic is given by

ˆ − X
(X ˆ ) − ( −  )
Z = 1 2 1 2

12  22
+
n1 n2

9/11/2023 @TKMISHRA ML NITRKL 79

Example
The Dean of St Peter School of Management Education
(SPSME) believes that the graduating students with
specialization in Marketing earn at least INR 5000 more per
month than the students with specialization in Operations
Management. To verify his belief, the Dean collected a
sample data from his graduating students, given in Table .
Conduct an appropriate hypothesis test at  = 0.05 to check
whether the difference in monthly salary is at least 5000 more for
students with marketing specialization compared to operations
specialization. Assume that the salary of students with marketing
specialization and operations specialization follow normal
distribution.

9/11/2023 @TKMISHRA ML NITRKL 80

Sample values on Marketing and Operations Students

Specialization Sample Size Estimated Mean Salary (in Rupees) Population Standard

per Month Deviation

Marketing 120 67,500.00 7,200

Operations 45 58,950.00 4,600

9/11/2023 @TKMISHRA ML NITRKL 81

Solution
We have n1 = 120, n2 = 45, X1 = 67,500 ,X 2 = 58,950  1 = 7,200 and  2 = 4,600
The null and alternative hypotheses are
H0: 1 − 2  5000
HA: 1 − 2 > 5000
The corresponding test statistic value is

(67500 − 58950) − 5000 3550

Z= = = 3.7374
7200 2 4600 2 949.85
+
120 45

The critical value of Z at  = 0.05 is 1.64 [= NORMSINV(1 − 0.05)]. Since the

Z-statistic value is higher than the Z-critical value, we reject the null

hypothesis. The corresponding p-value is 9.29  10-05.

9/11/2023 @TKMISHRA ML NITRKL 82

Two-Sample t-Test
Difference in Two Population Means when
Population Standard Deviations are Unknown and
Believed to be Equal: Two-Sample t-Test
• In this section we discuss the hypothesis test for
difference in two population means when the
standard deviation of the populations are
unknown.
• Hence we need to estimate them from the samples
drawn from these two populations.
• An additional assumption we make here is that the
standard deviation of the two populations are
equal (however, unknown).

9/11/2023 @TKMISHRA ML NITRKL 83

Then the sampling distribution of the difference in estimated
means ( X − X )follows a t-distribution with (n1 + n2 – 2) degrees of
1 2

freedom with mean (1 – 2) and standard deviation

 1 1 
S2
p + 

 n1 n2 

Where pis
S2 the pooled variance of two samples and is given by
2 2
( n − 1) S + ( n − 1) S
S 2p = 1 1 2 2
(n1 + n2 − 2)

The corresponding t-statistic is

( X 1 − X 2 ) − (1 −  2 )
t =
 1 1 
S p2  + 
 1
n n2 

9/11/2023 @TKMISHRA ML NITRKL 84

Example
A company makes a claim that children (between the age group
between 7 and 12) who drink their health drink will grow (height)
taller than the children who do not drink that health drink. Data in
Table shows average increase in height over one-year period from
two groups: one drinking the health drink and the other not drinking
the health drink. At  = 0.05, test whether the increase in height for
the children who drink the health drink is at least 1.2 cm.

Group Sample Size Increase in Height (in cm) during the Standard Deviation

Test Period Estimated from Sample

Drink health
80 7.6 cm 1.1 cm
drink

Do not drink
80 6.3 cm 1.3 cm
health drink

9/11/2023 @TKMISHRA ML NITRKL 85

Solution
1 ,  1 = 1.1, and
We have n1 = 80, n2 = 80, X = 7.6 , X = 6.3 2

 2 = 1.3.
The null and alternative hypotheses are
H0: 1 − 2  1.2
HA: 1 − 2 > 1.2

Pooled variance is
(n1 − 1) S12 + (n2 − 1) S 22 79  1.12 + 79  1.32
S 2p = = = 1.45
(n1 + n2 − 2) 80 + 80 − 2

9/11/2023 @TKMISHRA ML NITRKL 86

Solution Continued…
The t-statistic is
(7.6 − 6.3) − 1.2
t = = 0.5252
 1 1 
1.45  + 
 80 80 

The t-critical value for one-tailed t-test when  = 0.05

and degrees of freedom = 158 (80 + 80 – 2) is 1.6546.
Since the calculated t-statistic value is less than t-critical
value we retain the null hypothesis. That is, the
difference between two groups is less than 1.2 and the
corresponding right-tailed test has a p-value of 0.3.

9/11/2023 @TKMISHRA ML NITRKL 87

Two Sample t test with unequal Variance

Difference in Two Population Means when Population

Standard Deviations are Unknown and Not Equal – Two
Sample t test with unequal Variance

We need to estimate standard deviations from the samples

drawn from these two populations. Then the sampling
distribution of the difference in estimated means ( X − X )
follows a t-distribution with mean (1 − 2) and standard
1 2

deviation

 S2 S 2 
Su =  1 + 2 
 n1 n 
 2 

9/11/2023 @TKMISHRA ML NITRKL 88

The corresponding degrees of freedom is given by
 
 4 
df =  
Su
 (S 2 n )2 2
(S2 n2 ) 2 
 1 1 + 

 n1 − 1 n2 − 1 


where the symbol   implies rounding down to the

nearest integer. The t-statistic for testing two
populations with unequal variance is given by

( X 1 − X 2 ) − (1 −  2 )
t =
 S12 S 22 
 + 
 1
n n2 

9/11/2023 @TKMISHRA ML NITRKL 89

Example
A researcher is interested in finding average duration of
marriage based on the educational qualifications. Two
groups were considered for the study: Group 1 consisted of
couples with no Bachelor’s degree (both partners) and Group
2 consisted of couple who both have Bachelor’s degree or
higher. Data in Table shows average duration of marriage in
years. At  = 0.05, test whether the average duration of
marriage is more for couples with no Bachelor’s degree
compared to couples with Bachelor’s degree.

Group Sample Size Duration of Marriage in Years Standard Deviation

Estimated from Sample

Couples with no
120 10.1 years 2.4 years
Degree

Couples with
100 9.5 years 3.1 years
Degree

9/11/2023 @TKMISHRA ML NITRKL 90

Solution
X = 10.1 X ,= 9.5
We have n1 = 120, n2 = 100, 1 2 ,  1 = 2.4,
and  2 = 3.1.
The null and alternative hypotheses are
H0: 1 − 2  0
HA: 1 − 2 > 0
The t-statistic value is
( X 1 − X 2 ) − (1 −  2 ) (10.1 − 9.5) − 0
t= = = 1.5805
S 2
S  2
2.4 2
3.1 2


1
+ 
2 +
 1
n n2 
120 100

9/11/2023 @TKMISHRA ML NITRKL 91

Solution Continued…
The corresponding degrees of freedom is
 
 
 Su4   0.0207 
df =   =  0.000113  = 184.33 = 184
2
 ( S1 n1 )
2 2 n )2
(S2 2   
 + 
 n1 − 1 n2 − 1 

The critical value of t for  = 0.05 and df = 184 is 1.6531.

Since the t-statistic is less than critical value of t, we
retain the null hypothesis.
That is, the difference in duration of marriage between
two groups is less than or equal to zero.
The corresponding p-value is 0.05785

9/11/2023 @TKMISHRA ML NITRKL 92

Estimating Probabilities from data

Tapas Kumar Mishra

August 3, 2022
Bayes Optimal classifier

If we are provided with P(X , Y ) we can predict the most

likely label for x, formally argmaxy P(y |x).
It is therefore worth considering if we can estimate P(X,Y)
directly from the training data.
If this is possible (to a good approximation) we could then use
the Bayes Optimal classifier in practice on our estimate of
P(X,Y).

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

If we are provided with P(X , Y ) we can predict the most

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

If we are provided with P(X , Y ) we can predict the most

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating

P(X,Y).
Generally, they fall into two categories:
When we estimate P(X , Y ) = P(X |Y )P(Y ), then we call it
generative learning.
When we only estimate P(Y |X ) directly, then we call it
discriminative learning.
So how can we estimate probability distributions from samples?

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Optimal classifier

In fact, many supervised learning can be viewed as estimating

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss

Suppose you find a coin and it’s ancient and very valuable.
Naturally, you ask yourself, ”What is the probability that this coin
comes up heads when I toss it?”
You toss it n = 10 times and obtain the following sequence of
outcomes: D = {H, T , T , H, H, H, T , T , T , T }. Based on these
samples, how would you estimate P(H)?

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss

We observed nH = 4 heads and nT = 6 tails. So, intuitively,

nH 4
P(H) ≈ = = 0.4
nH + nT 10
Can we derive this more formally?

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood

Estimate (MLE). For MLE you typically proceed in two steps:
First, you make an explicit modelling assumption about what
type of distribution your data was sampled from.
Second, you set the parameters of this distribution so that the
data you observed is as likely as possible.

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

The estimator we just mentioned is the Maximum Likelihood

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

Let us return to the coin example.

A natural assumption about a coin toss is that the distribution of
the observed outcomes is a binomial distribution.
The binomial distribution has two parameters n and θ and it
captures the distribution of n independent Bernoulli(i.e. binary)
random events that have a positive outcome with probability θ.
In our case n is the number of coin tosses, and θ could be the
probability of the coin coming up heads (e.g. P(H) = θ).

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

Let us return to the coin example.

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

Let us return to the coin example.

Tapas Kumar Mishra Estimating Probabilities from data

Maximum Likelihood Estimation (MLE)

Formally, the binomial distribution is defined as

nH + nT
P(D | θ) = θnH (1 − θ)nT , (1)
nH

and it computes the probability that we would observe exactly nH

heads, nT tails, if a coin was tossed n = nH + nT times and its
probability of coming up heads is θ.

Tapas Kumar Mishra Estimating Probabilities from data

MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
MLE Principle
Find θ̂ to maximize the likelihood of the data, P(D; θ):
θ̂MLE = argmax P(D; θ) (2)
θ
Often we can solve this maximization problem with a simple two
step procedure:
1. plug in all the terms for the distribution, and take the log of the
function.
2. Compute its derivative, and equate it with zero.
Taking the log of the likelihood (often referred to as the
log-likelihood) does not change its maximum (as the log is a
monotonic function, and the likelihood positive), but it turns all
products into sums which are much easier to deal with when you
differentiate.
Equating the derivative with zero is a standard way to find an
extreme point. (To be precise you should verify that it really is a
maximum and not a minimum, by verifying that the second
derivative is negative.)
Tapas Kumar Mishra Estimating Probabilities from data
Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)

θ

nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H

nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it

with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data

Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)

θ

nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H

nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it

with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data

Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)

θ

nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H

nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it

with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data

Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)

θ

nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H

nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it

with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data

Returning to our binomial distribution, we can now plug in the
definition and compute the log-likelihood:

θ̂MLE = argmax P(D; θ) (3)

θ

nH + nT
= argmax θnH (1 − θ)nT (4)
θ n H

nH + nT
= argmax log + nH · log(θ) + nT · log(1 − θ)
θ nH
(5)
= argmax nH · log(θ) + nT · log(1 − θ) (6)
θ

We can then solve for θ by taking the derivative and equating it

with zero. This results in
nH nT nH
= =⇒ nH − nH θ = nT θ =⇒ θ = (7)
θ 1−θ nH + nT

Tapas Kumar Mishra Estimating Probabilities from data

MLE

MLE gives the explanation of the data you observed.

If n is large and your model/distribution is correct (that is H
includes the true model), then MLE finds the true parameters.
But the MLE can overfit the data if n is small. It works well
when n is large.
If you do not have the correct model (and n is small) then
MLE can be terribly wrong!
For example, suppose you observe H,H,H,H,H. What is θ̂MLE ?

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss with prior knowledge

Assume you have a hunch that θ is close to 0.5. But your sample
size is small, so you don’t trust your estimate.
Simple fix:Add 2m imaginery throws that would result in θ0 (e.g.
θ = 0.5). Add m Heads and m Tails to your data.
nH + m
θ̂ =
nH + nT + 2m

For large n, this is an insignificant change. For small n, it

incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss with prior knowledge

For large n, this is an insignificant change. For small n, it

incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss with prior knowledge

For large n, this is an insignificant change. For small n, it

incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data

Simple scenario: coin toss with prior knowledge

For large n, this is an insignificant change. For small n, it

incorporates your ”prior belief” about what θ should be. Can we
derive this formally?

Tapas Kumar Mishra Estimating Probabilities from data

The Bayesian Way

Model θ as a random variable, drawn from a distribution P(θ).

Note that θ is not a random variable associated with an event in a
sample space.
In frequentist statistics, this is forbidden.
In Bayesian statistics, this is allowed and you can specify a prior
belief P(θ) defining what values you believe θ is likely to take on.

Tapas Kumar Mishra Estimating Probabilities from data

P(D|θ)P(θ)
Now, we can look at P(θ | D) = P(D) (recall Bayes Rule!),
where
P(θ) is the prior distribution over the parameter(s) θ, before
we see any data.
P(D | θ) is the likelihood of the data given the parameter(s) θ.
P(θ | D) is the posterior distribution over the parameter(s) θ
after we have observed the data.

Tapas Kumar Mishra Estimating Probabilities from data

A natural choice for the prior P(θ) is the Beta distribution:

θα−1 (1 − θ)β−1
P(θ) = (8)
B(α, β)

where B(α, β) = Γ(α)Γ(β)

Γ(α+β) is the normalization constant (if this
looks scary don’t worry about it, it is just there to make sure
everything sums to 1).

Tapas Kumar Mishra Estimating Probabilities from data

Why is the Beta distribution a good fit?
it models probabilities (θ lives on [0, 1])
it is of the same distributional family as the binomial
distribution (conjugate prior) → the math will turn out nicely:

P(θ | D) ∝ P(D | θ)P(θ) ∝ θnH +α−1 (1 − θ)nT +β−1 (9)

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

Find θ̂ that maximizes the posterior distribution P(θ | D):

θ̂MAP = argmax P(θ | D) (10)

θ
= argmax log P(D | θ) + log P(θ) (11)
θ

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)

θ
P(Data|θ)P(θ)
= argmax (By Bayes rule)
θ P(Data)
= argmax log(P(Data|θ)) + log(P(θ))
θ
= argmax nH · log(θ) + nT · log(1 − θ) + (α − 1) · log(θ)+
θ
(β − 1) · log(1 − θ)
= argmax (nH + α − 1) · log(θ) + (nT + β − 1) · log(1 − θ)
θ
nH + α − 1
=⇒ θ̂MAP =
nH + nT + β + α − 2

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)

Tapas Kumar Mishra Estimating Probabilities from data

Maximum a Posteriori Probability Estimation (MAP)

For our coin flipping scenario, we get:

θ̂MAP = argmax P(θ|Data)

Tapas Kumar Mishra Estimating Probabilities from data

MAP A few comments:

The MAP estimate is identical to MLE with α − 1

hallucinated heads and β − 1 hallucinated tails.
As n → ∞, θ̂MAP → θ̂MLE as α − 1 and β − 1 become
irrelevant compared to very large nH , nT .
MAP is a great estimator if an accurate prior belief is
available (and mathematically tractable).
If n is small, MAP can be very wrong if prior belief is wrong!

Tapas Kumar Mishra Estimating Probabilities from data

Machine Learning and estimation

As always the differences are subtle.

In MLE we maximize log [P(D; θ)];
in MAP we maximize log [P(D|θ)] + log [P(θ)].
So essentially in MAP we only add the term log [P(θ)] to our
optimization. This term is independent of the data and penalizes if
the parameters, θ deviate too much from what we believe is
reasonable.
We will later revisit this as a form of regularization, where
log [P(θ)] will be interpreted as a measure of classifier complexity.

Tapas Kumar Mishra Estimating Probabilities from data

Bayes Classifier and Naive Bayes

Tapas Kumar Mishra

August 7, 2022

1/45
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our

training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our

training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our

training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our

training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Supervised ML Setup

Let us formalize the supervised machine learning setup. Our

training data comes in pairs of inputs (x, y ), where x ∈ Rd is the
input instance and y its label.
The entire training data is denoted as

D = {(x1 , y1 ), . . . , (xn , yn )} ⊆ Rd × C

where:
Rd is the d-dimensional feature space
xi is the input vector of the i th sample
yi is the label of the i th sample
C is the label space

2/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The data points (xi , yi ) are drawn from some (unknown)
distribution P(X , Y ). Ultimately we would like to learn a function
h such that for a new pair (x, y ) ∼ P, we have h(x) = y with high
probability (or h(x) ≈ y ).

3/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar

to the coin example in the previous lecture, where we imagine
a gigantic die that has one side for each possible value of
(x, y ).
We can estimate the probability that one specific side comes
up through counting:
Pn
I (xi = x ∧ yi = y )
P̂(x, y ) = i=1 ,
n
where I (xi = x ∧ yi = y ) = 1 if xi = x and yi = y and 0
otherwise.
4/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Our training consists of the set D = {(x1 , y1 ), . . . , (xn , yn )}
drawn from some unknown distribution P(X , Y ).
Because all pairs are sampled i.i.d., we obtain

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar

P(D) = P((x1 , y1 ), . . . , (xn , yn )) = Πnα=1 P(xα , yα ).

If we do have enough data, we could estimate P(X , Y ) similar

Of course, if we are primarily interested in predicting the label

y from the features x, we may estimate P(Y |X ) directly
instead of P(X , Y ).
We can then use the Bayes Optimal Classifier for a specific
P̂(y |x) to make predictions.
The bayes optimal classifier predicts
y ∗ = hopt (x) = argmaxy P(y |x).
Although the Bayes optimal classifier is as good as it gets, it
still can make mistakes. It is always wrong if a sample does
not have the most likely label. We can compute the
probability of that happening precisely (which is exactly the
error rate):

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Bayes Optimal Classifier

Of course, if we are primarily interested in predicting the label

BayesOpt = 1 − P(hopt (x)|x) = 1 − P(y ∗ |x)

5/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Assume for example an email x can either be classified as spam
(+1) or ham (−1). For the same email x the conditional class
probabilities are:

P(+1|x) = 0.8, P(−1|x) = 0.2

In this case the Bayes optimal classifier would predict the label
y ∗ = +1 as it is most likely, and its error rate would be
BayesOpt = 0.2.

6/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
So how can we estimate P̂(y |x)?
Pn
I (yi =y )
Previously we have derived that P̂(y ) = i=1
n .
Pn Pn
I (x =x) I (xi =x∧yi =y )
Similarly, P̂(x) = i=1 n i and P̂(y , x) = i=1
n .
We can put these two together
Pn
P̂(y , x) I (xi = x ∧ yi = y )
P̂(y |x) = = i=1Pn
P(x) i=1 I (xi = x)

7/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
The Venn diagram illustrates that the MLE method estimates
P̂(y |x) as
|C |
P̂(y |x) =
|B|

8/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!

9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Problem: But there is a big problem with this method. The MLE
estimate is only good if there are many training vectors with the
same identical features as x!
In high dimensional spaces (or with continuous x), this never
happens! So |B| → 0 and |C | → 0.
P(y = yes|x1 = dallas, x2 = female, x3 = 5) always empty!!

9/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

We can approach this dilemma with a simple trick, and an

additional assumption. The trick part is to estimate P(y ) and
P(x|y ) instead, since, by Bayes rule,

P(x|y )P(y )
P(y |x) = .
P(x)

Recall from Estimating Probabilities from Data that estimating

P(y ) and P(x|y ) is called generative learning.

10/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(y ) is easy.
For example, if Y takes on discrete binary values estimating P(Y )
reduces to coin tossing. We simply need to count how many times
we observe each outcome (in this case each class):
Pn
I (yi = c)
P(y = c) = i=1 = π̂c
n

11/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes

P(x|y )P(y )
P(y |x) = .
P(x)
Estimating P(x|y ), however, is not easy! The additional
assumption that we make is the Naive Bayes assumption.

12/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes assumption

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a

d
Y
P(x|y ) = P(xα |y ), where xα = [x]α is the value for feature α
α=1

i.e., feature values are independent given the label! This is a

very bold assumption.
For example, a setting where the Naive Bayes classifier is
often used is spam filtering. Here, the data is emails and the
label is spam or not-spam.
The Naive Bayes assumption implies that the words in an
email are conditionally independent, given that you know that
an email is spam or not.
Clearly this is not true.
Given a disease, all its symptoms are independent. Causal
relationship.
13/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Illustration behind the Naive Bayes algorithm. We estimate
P(xα |y ) independently in each dimension (middle two images) and
then obtain an estimate of the full data
Q distribution by assuming
conditional independence P(x|y ) = α P(xα |y ) (very right image).

14/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)

y
P(x|y )P(y )
= argmax
y P(x)
= argmax P(x|y )P(y ) (P(x) does not depend on y )
y
d
Y
= argmax P(xα |y )P(y ) (by the naive Bayes assumption)
y
α=1
Xd
= argmax log(P(xα |y )) + log(P(y )) (as log is a monotonic func
y
α=1

Estimating log(P(xα |y )) is easy as we only need to consider one

dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
MLE
So, for now, let’s pretend the Naive Bayes assumption holds. Then
the Bayes Classifier can be defined as

h(x) = argmax P(y |x)

Estimating log(P(xα |y )) is easy as we only need to consider one

h(x) = argmax P(y |x)

Estimating log(P(xα |y )) is easy as we only need to consider one

h(x) = argmax P(y |x)

Estimating log(P(xα |y )) is easy as we only need to consider one

h(x) = argmax P(y |x)

Estimating log(P(xα |y )) is easy as we only need to consider one

h(x) = argmax P(y |x)

Estimating log(P(xα |y )) is easy as we only need to consider one

dimension. And estimating P(y ) is not affected by the assumption. 15/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Estimating P([x]α |y )

Now that we know how we can use our assumption to make the
estimation of P(y |x) tractable. There are 3 notable cases in which
we can use our naive Bayes classifier.
Case #1: Categorical features.
Case #2: Multinomial features.
Case #3: Continuous features (Gaussian Naive Bayes)

16/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
17/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Features:
[x]α ∈ {f1 , f2 , · · · , fKα }
Each feature α falls into one of Kα categories. (Note that the case
with binary features is just a specific case of this, where Kα = 2.)
An example of such a setting may be medical data where one
feature could be
gender (male / female) or
marital status (single / married / widowed).

18/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Model P(xα | y ):

P(xα = j|y = c) = [θjc ]α

and
Kα
X
[θjc ]α = 1
j=1

where [θjc ]α is the probability of feature α having the value j,

given that the label is c. And the constraint indicates that xα must
have one of the categories {1, . . . , Kα }.

19/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα

where xiα = [xi ]α and l is a smoothing parameter.

By setting l = 0 we get an MLE estimator, l > 0 leads to MAP. If
we set l = +1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means

# of samples with label c that have feature α with value j

.
# of samples with label c

20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Parameter estimation:
Pn
I (y = c)I (xiα = j) + l
[θ̂jc ]α = Pn i
i=1
, (1)
i=1 I (yi = c) + lKα

where xiα = [xi ]α and l is a smoothing parameter.

By setting l = 0 we get an MLE estimator, l > 0 leads to MAP. If
we set l = +1 we get Laplace smoothing.
In words (without the l hallucinated samples) this means

# of samples with label c that have feature α with value j

.
# of samples with label c

20/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #1: Categorical features

Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c [θ̂jc ]α
y y
α=1
Pn d P
= c) Y ni=1 I (yi = c)I (xiα = j) + l
i=1 I (yi
= argmax Pn
y n i=1 I (yi = c) + lKα
α=1

21/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Dallas|Yes) = 2/3 and P(Dallas|No) = 2/5
P(NewYorkCity |Yes) = 1/3 and P(NewYorkCity |No) = 3/5
P(Dallas) = 4/8 and P(NewYorkCity ) = 4/8
P(Yes) = 3/8 and P(No) = 5/8

22/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

So , the various probabilities we get from the following table are :

P(Male|Yes) = 1/3 and P(Male|No) = 3/5
P(Female|Yes) = 2/3 and P(Female|No) = 2/5
P(Male) = 5/8 and P(Female) = 3/8
P(Yes) = 3/8 and P(No) = 5/8

23/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female):

24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female):

24/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Here, feature values don’t represent categories (e.g. male/female)

but counts we need to use a different model. E.g. in the text
document categorization, feature value xα = j means that in this
particular document x the αth word in my dictionary appears j
times.
Think of starting with a all zero vector of dimension d = V . We
throw a dice of dimension V to sample the first word, increase the
count of the vector at the position by 1. Keep repeating this
experiment until we have m word that is the email.

25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Here, feature values don’t represent categories (e.g. male/female)

25/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Let us consider the example of spam filtering. Imagine the αth

word is indicative towards “spam”. Then if xα = 10 means that
this email is likely spam (as word α appears 10 times in it). And
another email with xα0 = 20 should be even more likely to be spam
(as the spammy word appears twice as often).
With categorical features this is not guaranteed. It could be that
the training set does not contain any email that contain word α
exactly 20 times. In this case you would simply get the
hallucinated smoothing values for both spam and not-spam - and
the signal is lost.

26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Let us consider the example of spam filtering. Imagine the αth

26/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

We need a model that incorporates our knowledge that features

are counts
Features:
d
X
xα ∈ {0, 1, 2, . . . , m} and m = xα (2)
α=1

Each feature α represents a count and m is the length of the

sequence. An example of this could be the count of a specific word
α in a document of length m and d is the size of the vocabulary.

27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

We need a model that incorporates our knowledge that features

are counts
Features:
d
X
xα ∈ {0, 1, 2, . . . , m} and m = xα (2)
α=1

Each feature α represents a count and m is the length of the

sequence. An example of this could be the count of a specific word
α in a document of length m and d is the size of the vocabulary.

27/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Model P(x | y ): Use the multinomial distribution

d
m! Y
P(x | m, y = c) = (θαc )xα
x1 ! · x2 ! · · · · · xd !
α=1

where θαc is the probability

P of selecting α, xα is the number of
times α is chosen and dα=1 θαc = 1.
So, we can use this to generate a spam email, i.e., a document x
of class y = spam by picking m words independently at random
from the vocabulary of d words using P(x | y = spam).

28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Model P(x | y ): Use the multinomial distribution

d
m! Y
P(x | m, y = c) = (θαc )xα
x1 ! · x2 ! · · · · · xd !
α=1

where θαc is the probability

28/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d

where mi = dβ=1 xiβ denotes the number of words in document i.

P
The numerator sums up all counts for feature xα and the
denominator sums up all counts of all features across all data
points. E.g.,
# of times word α appears in all spam emails
.
# of words in all spam emails combined
Again, l is the smoothing parameter.

29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Parameter estimation:
Pn
I (yi = c)xiα + l
θ̂αc = Pn i=1 (3)
i=1 I (yi = c)mi + l · d

where mi = dβ=1 xiβ denotes the number of words in document i.

29/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #2: Multinomial features

Prediction:
d
Y
argmax P(y = c | x) ∝ argmax π̂c (θ̂αc )xα
c c
α=1

30/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
31/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Illustration of Gaussian NB. Each class conditional feature

distribution P(xα |y ) is assumed to originate from an independent
Gaussian distribution with its own mean µα,y and variance σα,y2 .

32/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Features:

xα ∈ R (each feature takes on a real value) (4)

33/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Model P(xα | y ): Use Gaussian distribution

2
2
1 −1 xα −µαc
P(xα | y = c) = N µαc , σαc =√ e 2 σαc
(5)
2πσαc
Note that the model specified above is based on our assumption
about the data - that each feature α comes from a
class-conditional Gaussian distribution. The full distribution
P(x|y ) ∼ N (µy , Σy ), where Σy is a diagonal covariance matrix
2 .
with [Σy ]α,α = σα,y

34/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Parameter estimation: As always, we estimate the parameters of

the distributions for each dimension and class independently.
Gaussian distributions only have two parameters, the mean and
variance. The mean µα,y is estimated by the average feature value
of dimension α from all samples with label y . The (squared)
standard deviation is simply the variance of this estimate.
n n
1 X X
µαc ← I (yi = c)xiα where nc = I (yi = c)
nc
i=1 i=1
(6)
n
2 1 X
σαc ← I (yi = c)(xiα − µαc )2 (7)
nc
i=1

35/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Case #3: Continuous features (Gaussian Naive Bayes)

Prediction:

argmax P(y = c | x) ∝ argmax π̂c P(x|y = c)

c c

36/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Likelihood of Yes , the person has disease is

P(Income = 100000|Illness = Yes) = (1/(42397.5 ∗ (2.506))) ∗
(2.718)− ((10000068386.66)/(2 ∗ 42397.5)) = 0.00000712789
Likelihood of No , the person has not disease is
P(Income = 100000|Illness = No) = (1/(28972.49∗(2.506)))∗
(2.718)− ((10000079571.8)/(2 ∗ 28972.49)) = 0.0000107421 37/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female,

38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Example

Prediction → for : X = (City= Dallas ,Gender = Female,

38/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

Naive Bayes leads to a linear decision boundary in many common

cases. Illustrated here is the case where P(xα |y ) is Gaussian and
where σα,c is identical for all c (but can differ across dimensions
α). The boundary of the ellipsoids indicate regions of equal
probabilities P(x|y ). The black decision line indicates the decision
boundary where P(y = 1|x) = P(y = 2|x).
39/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

1. Suppose that yi ∈ {−1, +1} and features are multinomial

We can show that
d
Y
h(x) = argmax P(y ) P(xα | y ) = sign(w> x + b)
y
α=1

That is,
w> x + b > 0 ⇐⇒ h(x) = +1.

40/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)

b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for

w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)

b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for

w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

xα
As before, we define P(xα |y = +1) ∝ θα+ and P(Y = +1) = π+ :

[w]α = log(θα+ ) − log(θα− ) (8)

b = log(π+ ) − log(π− ) (9)

If we use the above to do classification, we can compute for

w> · x + b

41/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)

d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
w> x + b > 0
[w]α b
d
X z }| { z }| {
⇐⇒ [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 0 (Plugging in definition of w, b.)
d
!
X
⇐⇒ exp [x]α (log(θα+ ) − log(θα− )) + log(π+ ) − log(π− )
α=1
> 1 (exponentiating both sides)

d exp log θ [x]α + log(π+ )
Y α+
⇐⇒
[x]α
α=1 exp log θα− + log(π− )

ea
> 1 (Because a log(b) = log(b a ) and exp (a − b) = operation
eb
42/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier
Simplifying this further leads to
d [x]
Y θα+α π+
⇐⇒ [x]α
> 1 (Because exp(log(a)) = a and e a+b = e a e b )
θ
α=1 α− − π
Qd
P([x]α |Y = +1)π+ xα
⇐⇒ Qα=1
d
> 1 (Because P([x]α |Y = −1) = θα− )
α=1 P([x] α |Y = −1)π −
P(x|Y = +1)π+
⇐⇒ > 1 (By the naive Bayes assumption.)
P(x|Y = −1)π−
P(Y = +1|x)
⇐⇒ > 1 (By Bayes rule (π+ = P(Y = +1).))
P(Y = −1|x)
⇐⇒P(Y = +1|x) > P(Y = −1|x)
⇐⇒ argmax P(Y = y |x) = +1
y

i.e. the point x lies on the positive side of the hyperplane iff Naive
Bayes predicts +1. 43/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Naive Bayes is a linear classifier

2. In the case of continuous features (Gaussian Naive Bayes), we

can show that
1
P(y | x) =
1 + e (w> x+b)
−y

This model is also known as logistic regression. NB and LR

produce asymptotically the same model if the Naive Bayes
assumption holds.

44/45
Tapas Kumar Mishra Bayes Classifier and Naive Bayes
Validation of the Linear
Regression Model
Validation of the Simple Linear Regression Model

It is important to validate the regression model to ensure its validity

and goodness of fit before it can be used for practical applications.
The following measures are used to validate the simple linear
regression models:

• Co-efficient of determination (R-square).

• Hypothesis test for the regression coefficient
• Analysis of Variance for overall model validity (relevant more for
multiple linear regression).

The above measures and tests are essential, but not exhaustive.
Coefficient of Determination (R-Square or R2)

• The co-efficient of determination (or R-square or R2)

measures the percentage of variation in Y explained by the
model (0 + 1 X).
• The simple linear regression model can be broken into
explained variation and unexplained variation as shown in

Yi

=  0 + 1 X i + i

Variation in Y Variation in Y explained Variation in Y not explained
by the model by the model

In absence of the predictive model for Yi, the users will use the
mean value of Yi. Thus, the total variation is measured
−
as the
difference between Yi and mean value of Yi (i.e.,Yi -Y ).
Description of total variation, explained variation and
unexplained variation

Variation Type Measure Description

−
Total Variation (SST) ( Yi −Y ) Total variation is the difference between the actual
value and the mean value.

 −
Variation explained by the model ( Yi − Y ) Variation explained by the model is the difference
between the estimated value of Yi and the mean value
of Y

Variation not explained by model (  ) Variation not explained by the model is the difference
Yi − Yi
between the actual value and the predicted value of Yi
(error in prediction)
The relationship between the total variation, explained variation and
the unexplained variation is given as follows:
− ∧ − ∧
𝑌𝑖 − 𝑌 = 𝑌𝑖 − 𝑌 + 𝑌𝑖 − 𝑌𝑖
Total Variation in Y Variation in Y explained by the model Variation in Y not explained by the model

It can be proved mathematically that sum of squares of total

variation is equal to sum of squares of explained variation plus sum of
squares of unexplained variation
𝑛 2 𝑛 𝑛
− ∧ − 2 ∧ 2
෍ 𝑌𝑖 − 𝑌 = ෍ 𝑌𝑖 − 𝑌 + ෍ 𝑌𝑖 − 𝑌𝑖
𝑖=1 𝑖=1 𝑖=1
𝑆𝑆𝑇 𝑆𝑆𝑅 𝑆𝑆𝐸

where SST is the sum of squares of total variation, SSR is the sum of
squares of variation explained by the regression model and SSE is the
sum of squares of errors or unexplained variation.
Coefficient of Determination or R-Square
The coefficient of determination (R2) is given by

∧ − 2
Explained variation 𝑆𝑆𝑅 𝑌𝑖 − 𝑌
Coefficient of determination = R2 = = = − 2
Total variation 𝑆𝑆𝑇
𝑌𝑖 − 𝑌

∧
𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
Coefficient of Determination or R-Square
Thus, R2 is the proportion of variation in response variable Y explained
by the regression model. Coefficient of determination (R2) has the
following properties:

• The value of R2 lies between 0 and 1.

• Higher value of R2 implies better fit, but one should be aware of
spurious regression.
• Mathematically, the square of correlation coefficient is equal to
coefficient of determination (i.e., r2 = R2).
• We do not put any minimum threshold for R2; higher value of R2
implies better fit.
Spurious Regression
Number of Facebook users and the number of
people who died of helium poisoning in UK

Year Number of Facebook users in Number of people who died of helium

millions (X) poisoning in UK (Y)

2004 1 2

2005 6 2

2006 12 2

2007 58 2

2008 145 11

2009 360 21

2010 608 31

2011 845 40

2012 1056 51
Facebook users versus helium poisoning in UK

The regression model is given as Y = 1.9967 + 0.0465 X

The R-square value for regression model between the number of deaths due to
helium poisoning in UK and the number of Facebook users is 0.9928. That is,
99.28% variation in the number of deaths due to helium poisoning in UK is
explained by the number of Facebook users.
Hypothesis Test for Regression Co-efficient (t-Test)

• The regression co-efficient ( 1) captures the existence of a

linear relationship between the response variable and the
explanatory variable.
• If 1 = 0, we can conclude that there is no statistically
significant linear relationship between the two variables.
The standard error of 1 is given by
 Se
S e ( 1 ) =
−
( X i − X )2

In above Eq. Se is the standard error of estimate (or standard error of the
residuals) that measures the accuracy of prediction and is given by

n  n
 i
2 2
 (Yi − Y i )
Se = i =1 = i =1
n−2 n−2

The denominator in above Eq. is (n − 2) since 0 and 1 are estimated

from the sample in estimating Yi and thus two degrees of freedom are
∧
lost. The standard error of 𝛽 can be written as
1

n 
2

 (Yi − Y i ) n−2
Se i =1
S e ( 1 ) = =
− −
2
(Xi − X ) ( X i − X )2
The null and alternative hypotheses for the SLR model can be
stated as follows:
H0: There is no relationship between X and Y
HA: There is a relationship between X and Y
• 1 = 0 would imply that there is no linear relationship between
the response variable Y and the explanatory variable X. Thus, the
null and alternative hypotheses can be restated as follows:

H0: 1 = 0
HA: 1  0
• The corresponding t-statistic is given as

  
 1 − 1  1− 0 1
t = = =
  
Se (  1) Se ( 1) Se ( 1)
Confidence Interval for Regression coefficients 0 and
1
 
The standard error of estimates of and  1are given by
0

𝑆𝑒 × σ𝑛𝑖=1 𝑋𝑖2
∧ 𝑆𝑒
∧ 𝑆𝑒 (𝛽1 ) =
𝑆𝑒 (𝛽0 ) = 𝑆𝑆𝑋
𝑛 × 𝑆𝑆𝑋

∧ 2
where 𝑌𝑖 − 𝑌𝑖
𝑆𝑒 =
𝑛−2

n −
2
Where Se is the standard error of residuals and SSX =  (Xi − X )
i =1

The interval estimate or (1-)100% confidence interval for  0and

 1 are given by
∧ ∧ ∧ ∧
𝛽1 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽1 ) 𝛽0 ∓ 𝑡𝛼/2,𝑛−2 𝑆𝑒 (𝛽0 )
Multiple Linear Regression
• Multiple linear regression means linear in
regression parameters (beta values). The following
are examples of multiple linear regression:

Y =  0 + 1x1 +  2 x2 + ... +  k xk + 
Y =  0 + 1x1 +  2 x2 + 3 x1x2 +  4 x2 ... +  k xk
2
+

An important task in multiple regression is to estimate the

beta values (1, 2, 3 etc…)
Co-efficient of Multiple Determination
(R-Square) and Adjusted R-Square
As in the case of simple linear regression, R-square measures
the proportion of variation in the dependent variable
explained by the model. The co-efficient of multiple
determination (R-Square or R2) is given by

∧
𝑛
𝑆𝑆𝐸 σ𝑖=1(𝑌𝑖 − 𝑌𝑖 )2
𝑅2 =1− =1− −
𝑆𝑆𝑇
σ𝑛𝑖=1(𝑌𝑖 − 𝑌 )2
• SSE is the sum of squares of errors and SST is the sum of
squares of total deviation. In case of MLR, SSE will decrease as
the number of explanatory variables increases, and SST
remains constant.

• To counter this, R2 value is adjusted by normalizing both SSE

and SST with the corresponding degrees of freedom. The
adjusted R-square is given by

SSE/(n - k - 1)
Adjusted R - Square = 1 -
SST/(n - 1)
Statistical Significance of Individual Variables in MLR – t-test

Checking the statistical significance of individual variables is achieved

through t-test. Note that the estimate of regression coefficient is given
by Eq:


β = (XT X) −1 XTY

This means the estimated value of regression coefficient is a linear

function of the response variable. Since we assume that the residuals
follow normal distribution, Y follows a normal distribution and the
estimate of regression coefficient also follows a normal distribution.
Since the standard deviation of the regression coefficient is estimated
from the sample, we use a t-test.
The null and alternative hypotheses in the case of individual
independent variable and the dependent variable Y is given,
respectively, by

• H0: There is no relationship between independent variable Xi and

dependent variable Y
• HA: There is a relationship between independent variable Xi and
dependent variable Y

Alternatively,
• H0: i = 0
• HA: i  0
The corresponding test statistic is given by
 
i − 0 i
t = =
 
Se (i ) Se (i )
Validation of Overall Regression Model – F-test

Analysis of Variance (ANOVA) is used to validate the overall

regression model. If there are k independent variables in the
model, then the null and the alternative hypotheses are,
respectively, given by

H0: 1 =  2 =  3 = … =  k = 0
H1: Not all  s are zero.

F-statistic is given by:

F = (SST-SSE)/k/SSE/(n-k-1) ~ Fk,n-k-1
F-test for the overall fit of the model

• The decision rule at significance level  is:

• Reject H0 if F  F ( ; k , n − k − 1)

• Where the critical value F(, k, n-k-1) can be found from an F-table.
• The existence of a regression relation by itself does not assure that
useful prediction can be made by using it.
• Note that when k=1, this test reduces to the F-test for testing in simple
linear regression whether or not 1= 0
Linear Regression

Tapas Kumar Mishra

August 22, 2022

1/39
Supervised learning
Lets start by talking about a few examples of supervised
learning problems. Suppose we have a dataset giving the
living areas and prices of 47 houses from Delhi, India:

2/39
Tapas Kumar Mishra Linear Regression
Given data like this, how can we learn to predict the prices of other
houses in Delhi, as a function of the size of their living areas?

3/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.

4/39
Tapas Kumar Mishra Linear Regression
To establish notation for future use, well use
x (i) to denote the input variables (living area in this example),
also called input features, and
y (i) to denote the output or target variable that we are trying
to predict (price).
A pair (x (i) , y (i) ) is called a training example, and
the dataset that well be using to learn a list of m training
examples {(x (i) , y (i) ); i = 1, ..., m}is called a training set.
We will also use X denote the space of input values, and Y the
space of output values.
In this example, X = Y = R.

4/39
Tapas Kumar Mishra Linear Regression
To describe the supervised learning problem slightly more formally,
our goal is,
given a training set, to learn a function h : X → Y so that h(x)is a
good predictor for the corresponding value of y . For historical
reasons, this function h is called a hypothesis.

5/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.

6/39
Tapas Kumar Mishra Linear Regression
When the target variable that were trying to predict is continuous,
such as in our housing example, we call the learning problem a
regression problem.
When y can take on only a small number of discrete values (such
as if, given the living area, we wanted to predict if a dwelling is a
house or an apartment, say), we call it a classification problem.

6/39
Tapas Kumar Mishra Linear Regression
Linear Regression
To make our housing example more interesting, lets consider a
slightly richer dataset in which we also know the number of
bedrooms in each house:

7/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)

8/39
Tapas Kumar Mishra Linear Regression
Here, the x’s are two-dimensional vectors in R2 .
(i)
For instance, x1 is the living area of the i-th house in the training
(i)
set, and x2 is its number of bedrooms.
To perform supervised learning, we must decide how were going to
represent functions/hypotheses h in a computer.
As an initial choice, lets say we decide to approximate y as a linear
function of x:

hθ (x) = θ0 + θ1 x1 + θ2 x2 (1)

8/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0

Linear regression stands for a function that is linear in regression

coefficients.

9/39
Tapas Kumar Mishra Linear Regression
Here, the θi s are the parameters (also called weights)
parameterizing the space of linear functions mapping from X → Y.
When there is no risk of confusion, we will drop the θ subscript in
hθ (x), and write it more simply as h(x).
To simplify our notation, we also introduce the convention of
letting x0 = 1 (this is the intercept term), so that
n
X
h(x) = θj xj = θT x. (2)
j=0

Linear regression stands for a function that is linear in regression

coefficients.

Linear regression stands for a function that is linear in regression

coefficients.

Linear regression stands for a function that is linear in regression

coefficients.

9/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1

OLS model - ordinary least squares regression model.

10/39
Tapas Kumar Mishra Linear Regression
Now, given a training set, how do we pick, or learn, the
parameters θ?
One reasonable method seems to be to make h(x) close to y ,
at least for the training examples we have.
To formalize this, we will define a function that measures, for
each value of the θs, how close the h(x (i) )s are to the
corresponding y (i) s.
We define the cost function:
m
1X
J(θ) = (h(x (i) − y (i) )2 . (3)
2
i=1

OLS model - ordinary least squares regression model.

10/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square

We want to choose θ so as to minimize J(θ). To do so,

lets use a search algorithm that starts with some initial guess
for θ , and that repeatedly changes θ to make J(θ) smaller,
until hopefully we converge to a value of θ that minimizes
J(θ).
Specifically, lets consider the gradient descent algorithm,
which starts with some initial θ , and repeatedly performs the
update:
∂
θj := θj − α J(θ) (4)
∂θj

Here, α is the learning rate. We update simultaneously for

every value of j in range 0 to n.

11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square

We want to choose θ so as to minimize J(θ). To do so,

Here, α is the learning rate. We update simultaneously for

every value of j in range 0 to n.

11/39
Tapas Kumar Mishra Linear Regression
LMS Algorithm: least mean square

We want to choose θ so as to minimize J(θ). To do so,

Here, α is the learning rate. We update simultaneously for

every value of j in range 0 to n.

11/39
Tapas Kumar Mishra Linear Regression
LMS for a single instance

Lets first work it out for the case of if we have only one training
example (x, y ), so that we can neglect the sum in the definition of
J. We have:

12/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)

the magnitude of the update is proportional to the error term

(y (i) − h(x (i) )):
for a training example if the prediction nearly matches y (i) ,
little change required;
for a training example if the prediction has large error with
respect to y (i) , large parameter change required;

13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)

the magnitude of the update is proportional to the error term

13/39
Tapas Kumar Mishra Linear Regression
For a single training example, this gives the update rule:
(i)
θj := θj + α(y (i) − h(x (i) ))xj (LMS update rule) (5)

the magnitude of the update is proportional to the error term

13/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset

The reader can easily verify that the quantity in the summation in
∂
the update rule above is just ∂θ j
J(θ) (for the original definition of
J).
So, this is simply gradient descent on the original cost function J.
This method looks at every example in the entire training set on
every step, and is called batch gradient descent.

14/39
Tapas Kumar Mishra Linear Regression
LMS for the full dataset

14/39
Tapas Kumar Mishra Linear Regression
m
1X
J(θ) = (h(x (i) − y (i) )2 . (6)
2
i=1

Consider the equation: y = 2 + 2x1 .

15/39
Tapas Kumar Mishra Linear Regression
16/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
The ellipses shown above are the contours of a quadratic function.
Also shown is the trajectory taken by gradient descent, with was
initialized at (48,30).
The x’s in the figure (joined by straight lines) mark the successive
values of θ that gradient descent went through.
17/39
Tapas Kumar Mishra Linear Regression
When we run batch gradient descent to fit θ on our previous
dataset, to learn to predict housing price as a function of living
area, we obtain θ0 = 71.27, θ1 = 0.1345. If we plot hθ (x) as a
function of x(area), along with the training data, we obtain the
following figure:

18/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent

In this algorithm, we repeatedly run through the training set, and

each time we encounter a training example, we update the
parameters according to the gradient of the error with respect to
that single training example only. This algorithm is called
stochastic gradient descent (also incremental gradient descent)

19/39
Tapas Kumar Mishra Linear Regression
Stochastic Gradient Descent

In this algorithm, we repeatedly run through the training set, and

19/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.

20/39
Tapas Kumar Mishra Linear Regression
Whereas batch gradient descent has to scan through the entire
training set before taking a single stepa costly operation if m is
large stochastic gradient descent can start making progress right
away, and continues to make progress with each example it looks
at.
Often, stochastic gradient descent gets θ close to the minimum
much faster than batch gradient descent.
But it may never converge.

20/39
Tapas Kumar Mishra Linear Regression
Normal Equation

Giving a training set, define the design matrix X to be the m-by-n

matrix (actually m-by-n + 1, if we include the intercept term) that
contains
 the(1)training
 examples input values in its rows:
−(x )T −
 −(x (2) )T − 
 
X =  . 

 . 
−(x (m) )T −
Also, let ~y be the m-dimensional vector
 (1)containing
 all the target
(y )
 (y (2) ) 
 
values from the training set: ~y = 
 . 

 . 
(y (m) )

21/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
We need an θ such that X θ = ~y . So, we need to solve for θ.

X θ = ~y
=⇒ X T X θ = X T ~y
=⇒ (X T X )−1 (X T X )θ = (X T X )−1 X T ~y
=⇒ θ = (X T X )−1 X T ~y

(X T X )−1 X T is the Moore-penrose Psuedoinverse.

22/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation

Data Assumption: y (i) ∈ R.

Model Assumption: y (i) = w T x(i) + i , where i ∼ N(0, σ 2 ).
⇒ y (i) |x(i) ∼ N(w> x(i) , σ 2 )
(w> x(i) −y (i) )2
−
⇒ P(y (i) |x(i) , w) = √ 1 e 2σ 2
2πσ 2
23/39
Tapas Kumar Mishra Linear Regression
Probabilistic Interpretation

Data Assumption: y (i) ∈ R.

Model Assumption: y (i) = w T x(i) + i , where i ∼ N(0, σ 2 ).
⇒ y (i) |x(i) ∼ N(w> x(i) , σ 2 )
(w> x(i) −y (i) )2
−
⇒ P(y (i) |x(i) , w) = √ 1 e 2σ 2
2πσ 2
23/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.

24/39
Tapas Kumar Mishra Linear Regression
In words, we assume that the data is drawn from a ”line” w> x
through the origin (one can always add a bias / offset through an
additional dimension).
For each data point with features x(i) , the label y is drawn from a
Gaussian with mean w> x(i) and variance σ 2 .
Our task is to estimate the slope w from the data.

24/39
Tapas Kumar Mishra Linear Regression
Estimating with MLE

w = argmax P(y (1) , x(1) , ..., y (m) , x(m) |w)

w
m
Y
= argmax P(y (i) , x(i) |w) Because data points are sampled iid.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) |w) Chain rule of probability.
w
i=1
m
Y
= argmax P(y (i) |x(i) , w)P(x(i) )
w
i=1
(i)
(x is independent of w, we only model P(y (i) |x))
m
Y
= argmax P(y (i) |x(i) , w)
w
i=1
(i)
(P(x ) is a constant - can be dropped)
25/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m
(w> x(i) −y (i) )2

X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m
(w> x(i) −y (i) )2

X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m
(w> x(i) −y (i) )2

X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m
(w> x(i) −y (i) )2

X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
m
Y
w = argmax P(y (i) |x(i) , w)
w
i=1
m
X h i
= argmax log P(y (i) |x(i) , w) log is a monotonic function
w
i=1
m
(w> x(i) −y (i) )2

X 1 −
= argmax log √ + log e 2σ 2
w
i=1 2πσ 2
Plugging in probability distribution
m
1 X > (i)
= argmax − 2 (w x − y (i) )2 First term is a constant
w 2σ
i=1
m
1 X > (i)
= argmin (w x − y (i) )2
w m
i=1
1
Always minimize; makes the loss interpretable
m
26/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

Additional Model Assumption:

wi ∼ N(0, τ 2 ).
w> w
−
P(w) = √ 1 e 2τ 2
2πτ 2

27/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )

28/39
Tapas Kumar Mishra Linear Regression
Estimating with MAP

w = argmax P(w|y (1) , x(1) , ..., y (m) , x(m) )

28/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
"m #
Y
(i) (i) (i)
w = argmax P(y |x , w)P(x ) P(w)
w
i=1
"m #
Y
= argmax P(y (i) |x(i) , w) P(w)
w
i=1
m
X
= argmax log P(y (i) |x(i) , w) + log P(w)
w
i=1
m
1 X > (i) 1
= argmin 2
(w x − y (i) )2 + 2 w> w
w 2σ 2τ
i=1
n
1 X
σ2
= argmin (w> x(i) − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

29/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

This objective is known as Ridge Regression.

It has a closed form solution of: w = (X> X + λI)−1 X> ~y , where
−(x (1) )T −
   (1) 
(y )
 −(x (2) )T −   (y (2) ) 
   
X = .  ~y =  . 
   
 .   . 
−(x (m) )T − (y (m) )

30/39
Tapas Kumar Mishra Linear Regression
n
1 X > (i) σ2
w = argmin (w x − y (i) )2 + λ||w||22 λ= mτ 2
w m
i=1

This objective is known as Ridge Regression.

30/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,

31/39
Tapas Kumar Mishra Linear Regression
Ordinary Least Squares: Ridge Regression:
m
minw m1 i=1 (x> 2 minw m1 m > 2
P P
i w − yi ) . i=1 (xi w − yi ) +
2
λ||w||2 .
Squared loss.
No regularization. Squared loss.
Closed form: l2-regularization.
w = (X> X)−1 X> ~y , Closed form:
w = (X> X + λI)−1 X> ~y ,

31/39
Tapas Kumar Mishra Linear Regression
Locally weighted linear regression
Consider the problem of predicting y from x ∈ R. The leftmost
figure below shows the result of fitting a y = θ0 + θ1 x to a
dataset. We see that the data doesnt really lie on straight line, and
so the fit is not very good. This is Underfitting: the structure of
the data is not captured by the model.

32/39
Tapas Kumar Mishra Linear Regression
Instead, if we had added an extra feature x 2 , and
fity = θ0 + θ1 x + θ2 x 2 , then we obtain a slightly better fit to the
data. Naively, it might seem that the more features we add, the
better.

33/39
Tapas Kumar Mishra Linear Regression
However, there is a danger of adding too many features. P The
figure below is the result of 5th order polynomial y = 5j=0 θj x j .
Even though the fitted curve passes through the data perfectly, it
is not a good predictor of y (housing prices)for different x (living
area). This is Overfitiing.

34/39
Tapas Kumar Mishra Linear Regression
In the original linear regression algorithm, to make a prediction at
a query point x (to evaluate h(x)), we would:
P (i)
1 Fit θ to minimize
i (y − θT x (i) )2 .
2 Output θT x.

35/39
Tapas Kumar Mishra Linear Regression
In Contrast, the locally weighted linear regression algorithm does
the following:
P (i) (i)
1 Fit θ to minimize
i z (y − θT x (i) )2 .
2 Output θT x.

36/39
Tapas Kumar Mishra Linear Regression
Here z (i) are non-negative valued weights.
If z (i) is large for a particular value i, then in picking θ, we will try
hard to make (y (i) − θT x (i) )2 small.
If z (i) is small for a particular value i, then (y (i) − θT x (i) )2 is
ignored in the fit.

37/39
Tapas Kumar Mishra Linear Regression
A fairly standard choice for weights is
!
(i) (x (i) − x)2
z = exp −
2τ 2

weights depend on x.
If |x (i) − x| is small, z (i) is close to 1 and if |x (i) − x| is large,
z (i) is small.
Hence, θ is chosen giving a much higher weight to the
training examples close to the query point x.
τ is the bandwidth parameter.

38/39
Tapas Kumar Mishra Linear Regression
Parametric vs non-Parametric

Unweighted Linear regression: Parametric. Once we have fitted the

θi ’s and stored them, we do not need the training data to make
future predictions.
Weighted Linear regression: Non-Parametric. to make future
predictions, we need to keep the entire training set around.

39/39
Tapas Kumar Mishra Linear Regression
Logistic Regression

Tapas Kumar Mishra

August 22, 2022

1/10
Classification problem

A classification problem is just like a regression problem:

difference is y can take small number of discrete values.
We will focus on the binary classification problem, where y
can take only two values: 0 and 1.
In Spam Classification: x (i) may be a feature of the ith email
and y = 1 denotes the email is spam and y = 0 denotes the
email is not spam.
y = 0 is denoted as the negative class, 1 is the positive class.
y (i) is the label of ith instance.

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:

2/10
Tapas Kumar Mishra Logistic Regression
Classification problem

A classification problem is just like a regression problem:

2/10
Tapas Kumar Mishra Logistic Regression
Logistic regression

1
hθ (x) = g (θT x) =
1 + e −θT x
1
where g (a) = 1+e −a
is the logistic/sigmoid function.

3/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).

4/10
Tapas Kumar Mishra Logistic Regression
Note that g (z) → 1 as z → ∞ and g (z) → 0 as z → −∞.
Moreover, g (z) (hence h(x)) is always bounded
P between 0 and 1.
We always set x0 = 1 so that θT x = θ0 + nj=0 θj xj .
Useful property: g 0 (z) = g (z)(1 − g (z)).

4/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))(1−y )

Assuming that all training examples were generated independently,
the likelyhood L(θ) is given by
L(θ) = p(~y |X ; θ)

5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))(1−y )

Assuming that all training examples were generated independently,
the likelyhood L(θ) is given by
L(θ) = p(~y |X ; θ)

5/10
Tapas Kumar Mishra Logistic Regression
Let us assume that
P(y = 1|x; θ) = hθ (x);
P(y = 0|x; θ) = 1 − hθ (x);
This can be written compactly as

p(y |x; θ) = (hθ (x))y (1 − hθ (x))(1−y )

Assuming that all training examples were generated independently,
the likelyhood L(θ) is given by
L(θ) = p(~y |X ; θ)

5/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)

θ θ
m
Y
= argmax p(y (i) |x (i) ; θ)
θ i=1
m
(i) (i)
Y
= argmax (hθ (x (i) ))y (1 − hθ (x (i) ))(1−y )
θ i=1
taking log as its easier to handle
Xm
= argmax y (i) log hθ (x (i) ) + (1 − y (i) ) log(1 − hθ (x (i) ))
θ i=1

6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)

6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)

6/10
Tapas Kumar Mishra Logistic Regression
Our job is to maximize L(θ).

argmax L(θ) = argmax p(~y |X ; θ)

6/10
Tapas Kumar Mishra Logistic Regression
To maximize the likelihood, we will use gradient descent.

θ := θ + α∇θ l(θ)
.
We start by taking just one training example (x, y ) and take
derivatives to derive the stochastic gradient descent rule.

7/10
Tapas Kumar Mishra Logistic Regression
This gives the stochastic ascent rule
(i)
θj := θj + α(y (i) − hθ (x (i) ))xj
.
This is a similar looking rule as compared to LMS update rule!!

8/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.

9/10
Tapas Kumar Mishra Logistic Regression
Logistic Regression is the discriminative counterpart to Naive
Bayes.
In Naive Bayes, we first model P(x|y ) for each label y , and
then obtain the decision boundary that best discriminates
between these two distributions.
In Logistic Regression we do not attempt to model the data
distribution P(x|y ), instead, we model P(y |x) directly.
The fact that we dont make any assumption about P(x|y )
allows logistic regression to be more flexible, but such
flexibility also requires more data to avoid overfitting.

9/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit

10/10
Tapas Kumar Mishra Logistic Regression
Typically, in scenarios with little data and if the modeling
assumption is appropriate, Naive Bayes tends to outperform
Logistic Regression.
However, as data sets become large logistic regression often
outperforms Naive Bayes, which suffers from the fact that the
assumptions made on P(x|y ) are probably not exactly correct.
If the assumptions hold exactly, i.e. the data is truly drawn
from the distribution that we assumed in Naive Bayes, then
Logistic Regression and Naive Bayes converge to the exact
same result in the limit

10/10
Tapas Kumar Mishra Logistic Regression
Optimizing the training process:
Underfitting, overfitting,
testing,
and regularization
• Let’s say that we have to study for a test.
• Several things could go wrong during our study process.
• Maybe we didn’t study enough. There’s no way to fix that, and we’ll likely
perform poorly in our test. ---------- Underfitting
• What if we studied a lot but in the wrong way. For example, instead of focusing
on learning, we decided to memorize the entire textbook word for word. Will
we do well in our test? It’s likely that we won’t, because we simply memorized
everything without learning. ----------Overfitting
• The best option, of course, would be to study for the exam properly and in a
way that enables us to answer new questions that we haven’t seen before on
the topic. -----------Generalization
• Notice that model 1 is too simple, because it is a line trying to fit a quadratic dataset. There is no way
we’ll find a good line to fit this dataset, because the dataset simply does not look like a line.
Therefore, model 1 is a clear example of underfitting.
• Model 2, in contrast, fits the data pretty well. This model neither overfits nor underfits.
• Model 3 fits the data extremely well, but it completely misses the point. The data is meant to look
like a parabola with a bit of noise, and the model draws a very complicated polynomial of degree 10
that manages to go through each one of the points but doesn’t capture the essence of the data.
Model 3 is a clear example of overfitting.
How do we get the computer to pick the right
model?
By testing
• Testing a model consists of picking a small set of the points in the dataset and choosing to use
them not for training the model but for testing the model’s performance. This set of points is
called the testing set.
• The remaining set of points (the majority), which we use for training the
model, is called the training set.
• Once we’ve trained the model on the training set, we use the
testing set to evaluate the model.
• In this way, we make sure that the model is good at generalizing
to unseen data, as opposed to memorizing the training set.
• Going back to the exam analogy, let’s imagine training and testing this way.
• Let’s say that the book we are studying for in the exam has
100 questions at the end.
• We pick 80 of them to train, which means we study them carefully, look
up the answers, and learn them.
• Then we use the remaining 20 questions to test ourselves—we
try to answer them without looking at the book, as in an exam setting.
• Looking at the top row we can see that model 1 has a large training
error, model 2 has a small training error, and model 3 has a tiny
training error (zero, in fact). Thus, model 3 does the best job on the
training set.

• Model 1 still has a large testing error,meaning that this is simply a bad
model, underperforming with the training and the testing set: it
underfits.

• Model 2 has a small testing error, which means it is a good model,

because it fits both the training and the testing set well.

• Model 3, however, produces a large testing error. Because it

did such a terrible job fitting the testing set, yet such a good job
fitting the training set, we conclude that model 3 overfits.
How do we pick the testing set, and how big should it be?

• A portion of the dataset is picked randomly (or based on some

features in case of temporal data) as the test set.
• In practice, 10-20% of the data is kept as the test set.

Can we use our testing data for training the model? No.
• We broke the golden rule in the previous example.
• Recall that we had three polynomial regression models: one of degree
1, one of degree 2, and one of degree 10, and we didn’t know which one
to pick.
• We used our training data to train the three models, and then we used
the testing data to decide which model to pick.
• We are not supposed to use the testing data to train our model or to
make any decisions on the model or its hyperparameters.
Solution: Validation Set
We break our dataset into the following three sets:
• Training set: for training all our models
• Validation set: for making decisions on which model to use
• Testing set: for checking how well our model did

it is common to use a 60-20-20 split or an 80-10-10 split—in other words,

60% training, 20% validation, 20% testing, or

80%training, 10% validation, 10% testing.

A numerical way to decide how complex our model should be:

The model complexity graph

• Imagine that we have a different and much more complex dataset, and we are trying to build a
polynomial regression model to fit it. We want to decide the degree of our model among the
numbers between 0 and 10 (inclusive).

• the way to decide which model to use is to pick the one that has the smallest validation error.

• However, plotting the training and validation errors can give us some valuable information and
help us examine trends.
The model
complexity graph
Another alternative to avoiding overfitting: Regularization

• simple models tend to underfit and complex models tend to

overfit
• in the previous methods, we tested several models and selected
the one that best balanced performance and complexity.
• In contrast, when we use regularization, we don’t need to
train several models. We simply train the model once, but during
the training, we try to not only
improve the model’s performance but also reduce its complexity.
Another alternative to avoiding overfitting: Regularization
• Performance (in mL of water leaked)
Roofer 1: 1000 mL water
Roofer 2: 1 mL water
Roofer 3: 0 mL water
• Complexity (in price)
Roofer 1: $1
Roofer 2: $100
Roofer 3: $100,000
• Performance + complexity
Roofer 1: 1001
Roofer 2: 101
Roofer 3: 100,000

Now it is clear that roofer 2 is the best one, which means that optimizing performance and
complexity at the same time yields good results that are also as simple as possible. This is what
regularization is about: measuring performance and complexity with two different error functions,
and adding them to get a more robust error function.
Regularization- Measuring how complex a model is: L1 and L2 norm

• Notice that a model with

more coefficients, or
coefficients of higher value,
tends to be more complex.
Therefore, any formula that
matches this will work, such
as the following:
• The sum of the absolute
values of the coefficients
• The sum of the squares of
the coefficients
The first one is called the L1
norm, and the second one is
called the L2 norm.
Regularization- Modifying the error function

• in the roofer analogy, our goal was to find a roofer that provided both good quality
and low complexity. We did this by minimizing the sum of two numbers: the measure of quality
and the measure of complexity. Regularization consists of applying the same principle to our
machine learning model.
• regression error A measure of the quality of the model. In this case, it can be the absolute
or square errors
• regularization term A measure of the complexity of the model. It can be the L1 or the L2
norm of the model.
• Error = Regression error + λ Regularization term
• λ is the regularization hyperparameter.
• Lasso regression error = Regression error + λ L1 norm
• Ridge regression error = Regression error + λ L2 norm
Regularization- Effects of L1 and L2 regularization

• A quick rule of thumb to use when deciding if we want to use L1 or L2

regularization follows:
• if we have too many features and we’d like to get rid of most of them, L1 regularization is
perfect for that.
• If we have only few features and believe they are all relevant, then L2 regularization is
what we need, because it won’t get rid of our useful features.
Bias-Variance Tradeoff

Tapas Kumar Mishra

August 28, 2022

As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
As usual, we are given a dataset D = {(x1 , y1 ), . . . , (xn , yn )},
drawn i.i.d. from some distribution P(X , Y ). Throughout this
lecture we assume a regression setting, i.e. y ∈ R
In this lecture we will decompose the generalization error of a
classifier into three rather interpretable terms.
Before we do that, let us consider that for any given input x
there might not exist a unique label y .
For example, if your vector x describes features of house (e.g.
#bedrooms, square footage, ...) and the label y its price, you
could imagine two houses with identical description selling for
different prices.
So for any given feature vector x, there is a distribution over
possible labels. We therefore define the following, which will
come in useful later on:

2/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Expected Label (given x ∈ Rd ):
Z
ȳ (x) = Ey |x [Y ] = y Pr(y |x)∂y .
y

The expected label denotes the label you would expect to obtain,
given a feature vector x.

3/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).

4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
we draw our training set D, consisting of n inputs, i.i.d. from the
distribution P. As a second step we typically call some machine
learning algorithm A on this data set to learn a hypothesis (aka
classifier). Formally, we denote this process as hD = A(D).

4/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):

h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y

5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
For a given hD , learned on data set D with algorithm A, we can
compute the generalization error (as measured in squared loss) as
follows:
Expected Test Error (given hD ):

h i ZZ
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 Pr(x, y )∂y ∂x.
x y

5/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a

weighted average over functions

6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The previous statement is true for a given training set D.
However, remember that D itself is drawn from P n , and is
therefore a random variable.
Further, hD is a function of D, and is therefore also a random
variable. And we can of course compute its expectation:
Expected Classifier (given A):
Z
h̄ = ED∼P n [hD ] = hD Pr(D)∂D
D

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a

weighted average over functions

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a

weighted average over functions

where Pr(D) is the probability of drawing D from P n . Here, h̄ is a

weighted average over functions

6/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y

To be clear, D is our training points and the (x, y ) pairs are the
test points.
We are interested in exactly this expression, because it evaluates
the quality of a machine learning algorithm A with respect to a
data distribution P(X , Y ). In the following we will show that this
expression decomposes into three meaningful terms.

7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can also use the fact that hD is a random variable to compute
the expected test error only given A, taking the expectation also
over D.
Expected Test Error (given A):
h i Z Z Z
2
E(x,y )∼P (hD (x) − y ) = (hD (x) − y )2 P(x, y )P(D)∂x∂y ∂D
D∼P n D x y

7/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error

h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +

h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0

8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error

h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +

h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0

8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Decomposition of Expected Test Error

h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +

h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (1)
| {z }
0

8/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below

Ex,y ,D hD (x) − h̄(x) h̄(x) − y

= Ex,y ED hD (x) − h̄(x) h̄(x) − y

= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y

= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The middle term of the above equation is 0 as we show below

Ex,y ,D hD (x) − h̄(x) h̄(x) − y

= Ex,y ED hD (x) − h̄(x) h̄(x) − y

= Ex,y ED [hD (x)] − h̄(x) h̄(x) − y

= Ex,y h̄(x) − h̄(x) h̄(x) − y
= Ex,y [0]
=0

9/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i h 2 i
Ex,y ,D [hD (x) − y ]2 = Ex,y ,D

hD (x) − h̄(x) + h̄(x) − y
= Ex,D (hD (x) − h̄(x))2 +

h 2 i
2 Ex,y ,D hD (x) − h̄(x) h̄(x) − y +Ex,y h̄(x) − y (2)
| {z }
0

Returning to the earlier expression, we’re left with the variance and
another term
h i h 2 i h 2 i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) +Ex,y h̄(x) − y
| {z }
Variance
(3)

10/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2

2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0

11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
We can break down the second term in the above equation as
follows:
h 2 i h 2 i
Ex,y h̄(x) − y = Ex,y h̄(x) − ȳ (x)) + (ȳ (x) − y
h i h 2 i
= Ex,y (ȳ (x) − y )2 + Ex h̄(x) − ȳ (x) +
| {z } | {z }
Noise Bias2

2 Ex,y h̄(x) − ȳ (x) (ȳ (x) − y ) (4)
| {z }
0

11/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below

Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )

= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)

= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)

= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)

= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
The third term in the equation above is 0, as we show below

Ex,y h̄(x) − ȳ (x) (ȳ (x) − y )

= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)

= Ex Ey |x [ȳ (x) − y ] h̄(x) − ȳ (x)

= Ex ȳ (x) − Ey |x [y ] h̄(x) − ȳ (x)

= Ex (ȳ (x) − ȳ (x)) h̄(x) − ȳ (x)
= Ex [0]
=0

12/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2

13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
This gives us the decomposition of expected test error as follows
h i h 2 i h i
Ex,y ,D (hD (x) − y )2 = Ex,D hD (x) − h̄(x) + Ex,y (ȳ (x) − y )2 +
| {z } | {z } | {z }
Expected Test Error Variance Noise
h 2 i
Ex h̄(x) − ȳ (x)
| {z }
Bias2

13/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?

14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Variance: Ex,D hD (x) − h̄(x)
| {z }
Variance
Captures how much your classifier changes if you train on a
different training set.
How ”over-specialized” is your classifier to a particular training set
(overfitting)?
If we have the best possible model for our training data, how far
off are we from the average classifier?

14/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.

15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h 2 i
Bias: Ex h̄(x) − ȳ (x)
| {z }
Bias2
What is the inherent error that you obtain from your classifier even
with infinite training data?
This is due to your classifier being ”biased” to a particular kind of
solution (e.g. linear classifier).
In other words, bias is inherent to your model.

15/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.

16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
h i
Noise: Ex,y (ȳ (x) − y )2
| {z }
Noise
How big is the data-intrinsic noise?
This error measures ambiguity due to your data distribution and
feature representation. You can never beat this, it is an aspect of
the data.

16/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Graphical illustration of bias and variance.

17/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : The variation of Bias and Variance with the model complexity.
This is similar to the concept of overfitting and underfitting. More
complex models overfit while the simplest models underfit.

18/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Detecting High Bias and High Variance

Figure : Test and training error as the number of training instances

increases.

The graph above plots the training error and the test error and can
be divided into two overarching regimes. In the first regime (on the
left side of the graph), training error is below the desired error
threshold (denoted by ), but test error is significantly higher.
19/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Figure : Test and training error as the number of training instances
increases.

In the second regime (on the right side of the graph), test error is
remarkably close to training error, but both are above the desired
tolerance of .

20/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 1 (High Variance)
Symptoms:
Training error is much lower than test error
Training error is lower than
Test error is above
Remedies:
Add more training data
Reduce model complexity – complex models are prone to high
variance
Bagging (will be covered later in the course)

21/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Regime 2 (High Bias): the model being used is not robust enough
to produce an accurate prediction
Symptoms:
Training error is higher than , but close to test error.
Remedies:
Use more complex model (e.g. kernelize, use non-linear
models)
Add features
Boosting (will be covered later in the course)

22/22
Tapas Kumar Mishra Bias-Variance Tradeoff
Model Selection
Performance estimation techniques
Always evaluate models as if they are predicting future data
We do not have access to future data, so we pretend that some data is hidden
Simplest way: the holdout (simple train-test split)
Randomly split data (and labels) into training , Validation and test set (e.g. 60%-20%-20 %)
Train (fit) a model on the training da ta,minimize error on validation set and score on the test data
K-fold Cross-validation

Each random split can yield very different models (and scores)
e.g. all easy (of hard) examples could end up in the test set
Split data into k equal-sized parts, called folds
Create k splits, each time using a different fold as the test set
Compute k evaluation scores, agg regate afte rwards (e.g. take the mea n)
Large k gives be tter estimates (more training data), but is expensive
Stratified K-Fold cross-validation

If the data is unbalanced, some classes have only few samples

Likely that some classes are not present in the test set
Stratification: proportions between classes are conserved in each fold
Order examples per class
Separate the samples of each class in k sets (strata)
Combine corresponding strata into folds
Leave-One-Out cross-validation

k fold cross-validation with k equal to the number of samples

Completely unbiased (in terms of data splits), but computationally expensive
Actually generalizes less well towards unseen data
The training sets are correlated (overlap heavily)
Overfits on the data used for (the entire) evaluation
A different sample of the data can yield different results
Recommended only for small datasets
The Bootstrap

Sample n (dataset size) data points, with replacement, as training set (the bootstrap)
On average, bootstraps include 66% of all data points (some are duplicates)
Use the unsampled (out-of-bootstrap) samples as the test set
Repeat times to obtain scores
k k
Repeated cross-validation

Cross-validation is still biased in that the initial split can be made in many ways
Repeated, or n-times-k-fold cross-validation:
Shuffle data randomly, do k-fold cross-validation
Repeat n times, yields n times k scores
Unbiased, very robust, but n times more expensive
Cross-validation with groups

Sometimes the data contains inherent groups:

Multiple samples from same patient, images from same person,...
Data from the same person may end up in the training and test set
We want to measure how well the model generalizes to other people
Make sure that data from one person are in either the train or test set
This is called grouping or blocking
Leave-one-subject-out cross-validation: test set for each subject/group
Time series
When the data is ordered, random test sets are not a good idea
Test-then-train (prequential evaluation)

Every new sample is evaluated only once, then added to the training set
Can also be done in batches (of n samples at a time)
TimeSeriesSplit
In the kth split, the first k folds are the train set and the (k+1)th fold as the Validation set
Often, a maximum training set size (or window) is used
more robust against concept drift (change in data over time)
Choosing a performance estimation procedure
No strict rules, only guidelines:
Always use stratification for classification (sklearn does this by default)
Use holdout for very large datasets (e.g. >1.000.000 examples)
Or when learners don't always converge (e.g. deep learning)
Choose k depending on dataset size and resources
Use leave-one-out for very small datasets (e.g. <100 examples)
Use cross-validation otherwise
Most popular (and theoretically sound): 10-fold CV
Literature suggests 5x2-fold CV is better
Use grouping or leave-one-subject-out for grouped data
Use train-then-test for time series
Binary classification

We have a positive and a negative class

2 different kind of errors:
False Positive (type I error): model predicts positive while true label is negative
False Negative (type II error): model predicts negative while true label is positive
They are not always equally important
Which side do you want to err on for a medical test?
Confusion matrices

We can represent all predictions (correct and incorrect) in a confusion matrix

n by n array (n is the number of classes)
Rows correspond to true classes, columns to predicted classes
Count how often samples belonging to a class C are classified as C or any other
class.
For binary classification, we label these true negative (TN), true positive (TP), false
negative (FN), false
Predicted Neg Predicted Pos
positive (FP)
Actual Neg TN FP
Actual Pos FN TP
confusion_matrix(y_test, y_pred):
[[48 5]
[ 5 85]]
Predictive accuracy

Accuracy can be computed based on the confusion matrix

Not useful if the dataset is very imbalanced
E.g. credit card fraud: is 99.99% accuracy good enough?
TP + TN
Accuracy = (1)
TP + TN + FP + FN

3 models: very different predictions, same accuracy:

Precision

Use when the goal is to limit FPs

Clinical trails: you only want to test drugs that really work
Search engines: you want to avoid bad search results
TP
Precision = (2)
TP + FP
Recall

Use when the goal is to limit FNs

Cancer diagnosis: you don't want to miss a serious disease
Search engines: You don't want to omit important hits
Also know as sensitivity, hit rate, true positive rate (TPR)
TP
Recall = (3)
TP + FN
F1-score

Trades off precision and recall:

precision ⋅ recall
F1 = 2 ⋅ (4)
precision + recall
Classification measure Zoo

https://en.wikipedia.org/wiki/Precision_and_recall
Multi-class classification

Train models per class : one class viewed as positive, other(s) als negative, then average
micro-averaging: count total TP, FP, TN, FN (every sample equally important)
micro-precision, micro-recall, micro-F1, accuracy are all the same
C
∑ TPc c=2 TP + TN
c=1
Precision: −
−→
C C
TP + TN + FP + FN
∑ TPc + ∑ FPc
c=1 c=1
Other useful classification metrics

Cohen's Kappa
Measures 'agreement' between different models (aka inter-rater agreement)
To evaluate a single model, compare it against a model that does random guessing
Similar to accuracy, but taking into account the possibility of predicting the
right class by chance
Can be weighted: different misclassifications given different weights
1: perfect prediction, 0: random prediction, negative: worse than random
With = accuracy, and = accuracy of random classifier:
p0 pe

po − pe
κ =
1 − pe

Matthews correlation coefficient

Corrects for imbalanced data, alternative for balanced accuracy or AUROC
1: perfect prediction, 0: random prediction, -1: inverse prediction
tp × tn − f p × f n
M CC =
√(tp + f p)(tp + f n)(tn + f p)(tn + f n)
Precision-Recall curve

The best trade-off between precision and recall depends on your application
You can have arbitrary high recall, but you often want reasonable precision, too.
Plotting precision against recall for all possible thresholds yields a precision-recall curve
Change the treshold until you find a sweet spot in the precision-recall trade-off
Often jagged at high thresholds, when there are few positive examples left
Model selection

Some models can achieve trade-offs that others can't

Your application may require very high recall (or very high precision)
Choose the model that offers the best trade-off, given your application
The area under the PR curve (AUPRC) gives the best overall model
Receiver Operating Characteristics (ROC)

Trade off true positive rate

TPR =
TP
with false positive rate
FPR =
FP

Plotting TPR against FPR for all possible thresholds yields a Receiver Operating
T P +F N F P +T N

Characteristics curve
Change the treshold until you find a sweet spot in the TPR-FPR trade-off
Lower thresholds yield higher TPR (recall), higher FPR, and vice versa
Visualization

Histograms show the amount of points with a certain decision value (for each class)
TPR =
TP
can be seen from the positive predictions (top histogram)
can be seen from the negative predictions (bottom histogram)
T P +F N
FP
FPR =
F P +T N
Model selection

Again, some models can achieve trade-offs that others can't

Your application may require minizing FPR (low FP), or maximizing TPR (low FN)
The area under the ROC curve (AUROC or AUC) gives the best overall model
Frequently used for evaluating models on imbalanced data
Random guessing (TPR=FPR) or predicting majority class (TPR=FPR=1): 0.5 AUC
Regression metrics
Most commonly used are
mean squared error:
2
∑ (ypred −yactual )
i i i

root mean squared error (RMSE) often used as well

mean absolute error: ∑ |ypred −yactual |

i i i

Less sensitive to outliers and large errors

n
R squared
2
∑ (ypred −yactual )
2 i i i
R = 1 − 2
∑ (ymean −yactual )

Ratio of variation explained by the model / total variation

i i

Between 0 and 1, but negative if the model is worse than just predicting the mean
Easier to interpret (higher is better).
Decision tree learning
Inductive inference with decision trees
▪ Inductive reasoning is a method of reasoning in which
a body of observations is considered to derive a
general principle.
▪ Decision Trees is one of the most widely used and
practical methods of inductive inference
▪ Features
▪ Method for approximating discrete-valued functions
(including boolean)
▪ Learned functions are represented as decision trees (or if-
then-else rules)
▪ Expressive hypotheses space, including disjunction
Decision tree representation (PlayTennis)

Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong No

Decision trees expressivity
▪ Decision trees represent a disjunction of conjunctions on
constraints on the value of attributes:
(Outlook = Sunny  Humidity = Normal) 
(Outlook = Overcast) 
(Outlook = Rain  Wind = Weak)
Decision trees representation

+
−
When to use Decision Trees
▪ Problem characteristics:
▪ Instances can be described by attribute value pairs
▪ Target function is discrete valued
▪ Disjunctive hypothesis may be required
▪ Possibly noisy training data samples
▪ Robust to errors in training data
▪ Missing attribute values
▪ Different classification problems:
▪ Equipment or medical diagnosis
▪ Credit risk analysis
▪ Several tasks in natural language processing
Top-down induction of Decision Trees
▪ ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
▪ Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
▪ The construction of the tree is top-down. The algorithm is greedy.
▪ The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
▪ Select the best attribute
▪ A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
▪ The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?

▪ A statistical property called information gain, measures how

well a given attribute separates the training examples
▪ Information gain uses the notion of entropy, commonly used in
information theory
▪ Information gain = expected reduction of entropy
Entropy in binary classification
▪ Entropy measures the impurity of a collection of examples. It
depends from the distribution of the random variable p.
▪ S is a collection of training examples
▪ p+ the proportion of positive examples in S
▪ p– the proportion of negative examples in S
Entropy (S)  – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0.94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0  p  1, 0  entropy  1
Entropy
Entropy in general
▪ Entropy measures the amount of information in a random
variable
H(X) = – p+ log2 p+ – p– log2 p– X = {+, –}
for binary classification [two-valued random variable]
c c
H(X) = –  pi log2 pi =  pi log2 1/ pi X = {i, …, c}
i=1 i=1
for classification in c classes
Example: rolling a die with 8, equally probable, sides
8
H(X) = –  1/8 log2 1/8 = – log2 1/8 = log2 8 = 3
i=1
Entropy and information theory
▪ Entropy specifies the number the average length (in bits) of the
message needed to transmit the outcome of a random variable.
This depends on the probability distribution.
▪ Optimal length code assigns − log2 p bits to messages with
probability p. Most probable messages get shorter codes.
▪ Example: 8-sided [unbalanced] die
1 2 3 4 5 6 7 8
4/16 4/16 2/16 2/16 1/16 1/16 1/16 1/16
2 bits 2 bits 3 bits 3 bits 4bits 4bits 4bits 4bits
E = (1/4 log2 4)  2 + (1/8 log2 8)  2 + (1/16 log2 16)  4 = 1+3/4+1 = 2.75
Information gain as entropy reduction
▪ Information gain is the expected reduction in entropy caused by
partitioning the examples on an attribute.
▪ The higher the information gain the more effective the attribute
in classifying training data.
▪ Expected reduction in entropy knowing A

Gain(S, A) = Entropy(S) −  |Sv|

Entropy(Sv)
v  Values(A) |S|
Values(A): possible values for A
Sv: subset of S for which A has value v
Example: expected information gain
▪ Let
▪ Values(Wind) = {Weak, Strong}
▪ S = [9+, 5−]
▪ SWeak = [6+, 2−]
▪ SStrong = [3+, 3−]
▪ Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14 Entropy(SStrong)
= 0.94 − 8/14  0.811 − 6/14  1.00
= 0.048
Which attribute is the best classifier?
Example
First step: which attribute to test at the root?

▪ Which attribute should be tested at the root?

▪ Gain(S, Outlook) = 0.246
▪ Gain(S, Humidity) = 0.151
▪ Gain(S, Wind) = 0.084
▪ Gain(S, Temperature) = 0.029
▪ Outlook provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Outlook
▪ partition the training samples according to the value of Outlook
After first step
Second step
▪ Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970 − 3/5  0.0 − 2/5  0.0 = 0.970
Gain(SSunny, Wind) = 0.970 − 2/5  1.0 − 3.5  0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970 − 2/5  0.0 − 2/5  1.0 − 1/5  0.0 = 0.570
▪ Humidity provides the best prediction for the target
▪ Lets grow the tree:
▪ add to the tree a successor for each possible value of Humidity
▪ partition the training samples according to the value of Humidity
Second and third steps

{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A  best attribute; decision attribute for Root  A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi  subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs − {A})
return Root
Inductive bias in decision tree learning

▪ The inductive bias (also known as learning bias) of a learning

algorithm is the set of assumptions that the learner uses to
predict outputs of given inputs that it has not encountered.
▪ What is the inductive bias of DT learning?
1. Shorter trees are preferred over longer trees
2. Prefer trees that place high information gain attributes close to
the root
Prefer shorter hypotheses: Occam's rasor
▪ Why prefer shorter hypotheses?
▪ Arguments in favor:
▪ There are fewer short hypotheses than long ones
▪ If a short hypothesis fits data unlikely to be a coincidence
▪ Elegance and aesthetics
▪ Arguments against:
▪ Not every short hypothesis is a reasonable one.
▪ Occam's razor:"The simplest explanation is usually the best one."
Issues in decision trees learning
▪ Overfitting
▪ Extensions
▪ Continuous valued attributes
▪ Alternative measures for selecting attributes
▪ Handling training examples with missing attribute values
▪ Handling attributes with different costs
▪ Improving computational efficiency
▪ Most of these improvements in C4.5 (Quinlan, 1993)
Overfitting: definition
▪ Building trees that “adapt too much” to the training examples
may lead to “overfitting”.
▪ Consider error of hypothesis h over
▪ training data: errorD(h) empirical error
▪ entire distribution X of data: errorX(h) expected error
▪ Hypothesis h overfits training data if there is an alternative
hypothesis h'  H such that
errorD(h) < errorD(h’) and
errorX(h’) < errorX(h)
i.e. h’ behaves better over unseen data
Example

D15 Sunny Hot Normal Strong No

Overfitting in decision trees

Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No 

New noisy example causes splitting of second leaf node.

Overfitting in decision tree learning
Avoid overfitting in Decision Trees
▪ Two strategies:
1. Stop growing the tree earlier, before perfect classification
2. Allow the tree to overfit the data, and then post-prune the tree
▪ Training and validation set
▪ split the training in two parts (training and validation) and use
validation to assess the utility of post-pruning
▪ Reduced error pruning
▪ Rule pruning
Reduced-error pruning (Quinlan 1987)
▪ Each node is a candidate for pruning
▪ Pruning consists in removing a subtree rooted in a node: the
node becomes a leaf and is assigned the most common
classification
▪ Nodes are removed only if the resulting tree performs no
worse on the validation set.
▪ Nodes are pruned iteratively: at each iteration the node
whose removal most increases accuracy on the validation set is
pruned.
▪ Pruning stops when no pruning increases accuracy
Effect of reduced error pruning
Rule post-pruning
1. Create the decision tree from the training set
2. Convert the tree into an equivalent set of rules
▪ Each path corresponds to a rule
▪ Each node along a path corresponds to a pre-condition
▪ Each leaf classification to the post-condition
3. Prune (generalize) each rule by removing those preconditions
whose removal improves accuracy …
▪ … over validation set
▪ … over training with a pessimistic, statistically inspired, measure
4. Sort the rules in estimated order of accuracy, and consider
them in sequence when classifying new instances
Converting to rules

(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
▪ Each distinct path produces a different rule: a condition
removal may be based on a local (contextual) criterion. Node
pruning is global and affects all the rules
▪ In rule form, tests are not ordered and there is no book-
keeping involved when conditions (nodes) are removed
▪ Converting to rules improves readability for humans
Dealing with continuous-valued attributes
▪ So far discrete values for attributes and for outcome.
▪ Given a continuous-valued attribute A, dynamically create a
new attribute Ac
Ac = True if A < c, False otherwise
▪ How to determine threshold value c ?
▪ Example. Temperature in the PlayTennis example
▪ Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
▪ Determine candidate thresholds by averaging consecutive values where
there is a change in classification: (48+60)/2=54 and (80+90)/2=85
▪ Evaluate candidate thresholds (attributes) according to information gain.
The best is Temperature>54.The new attribute competes with the other
ones
Problems with information gain
▪ Natural bias of information gain: it favours attributes with
many possible values.
▪ Consider the attribute Date in the PlayTennis example.
▪ Date would have the highest information gain since it perfectly
separates the training data.
▪ It would be selected at the root resulting in a very broad tree
▪ Very good on the training, this tree would perform poorly in predicting
unknown instances. Overfitting.
▪ The problem is that the partition is too specific, too many small
classes are generated.
▪ We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A)  −  log2
|S |
i=1 |S |
▪ Si are the sets obtained by partitioning on value i of A
▪ SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A) 
SplitInformation(S, A)
▪ GainRatio penalizes attributes that split examples in many small classes such as
Date. Let |S |=n, Date splits examples in n classes
▪ SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
▪ Compare with A, which splits data in two even classes:
▪ SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
▪ Problem: SplitInformation(S, A) can be zero or very small
when |Si | ≈ |S | for some value i
▪ To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
Handling incomplete training data
▪ How to cope with the problem that the value of some attribute
may be missing?
▪ Example: Blood-Test-Result in a medical diagnosis problem
▪ The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training examples at
the node
2. Assign a probability to each value, based on frequencies, and assign
values to missing attribute, according to this probability distribution
▪ Missing values in new instances to be classified are treated
accordingly, and the most probable classification is chosen
(C4.5)
Handling attributes with different
costs
▪ Instance attributes may have an associated cost: we would
prefer decision trees that use low-cost attributes
▪ ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) − 1
(Cost(A) + 1)w w ∈ [0,1]
Gini (impurity) Index
▪ The Gini index is a measure of diversity in a dataset. In other
words, if we have a set in which all the elements are similar,
this set has a low Gini index, and if all the elements are
different, it has a large Gini index.
▪ For clarity, consider the following two sets of 10 colored balls
(where any two balls of the same color are indistinguishable):
▪ • Set 1: eight red balls, two blue balls
▪ • Set 2: four red balls, three blue balls, two yellow balls, one green
ball
▪ Set 1 looks more pure than set 2, because set 1 contains
mostly red balls and a couple of blue ones, whereas set 2 has
many different colors. Next, we devise a measure of impurity
that assigns a low value to set 1 and a high value to set 2.
Gini (impurity) Index
▪ If we pick two random elements of the set, what is the
probability that they have a different color ? The two elements
don’t need to be distinct; we are allowed to pick the same
element twice.
▪ P(picking two balls of different color) = 1 – P(picking two balls
of the same color)
▪ P(picking two balls of the same color) = P(both balls are color 1)
+ P(both balls are color 2) + … + P(both balls are color n)
▪ P(both balls are color i) = pi2
▪ P(picking two balls of different colors) = 1 – p12 – p22 – … – pn2
Gini (impurity) Index
▪ Gini impurity index:
In a set with m elements and n classes, with ai elements belonging
to the i-th class, the Gini impurity index is
Gini = 1 – p12 – p22 – … – pn2 , where pi = ai / m
Gini (impurity) Index

Split on Gender: Split on Class:

1. Female Gini = 1- (0.2)2-(0.8)2=0.32 1. IX Gini = 1- (0.43)2-(0.57)2=0.49
2. Male Gini= 1- (0.65)2-(0.35)2=0.45 2. X Gini= 1- (0.56)2-(0.44)2=0.49
3. Ginigender = (10/30)*0.32+(20/30)*0.45 3. Giniclass = (14/30)*0.49+(16/30)*0.49
= 0.406 = 0.49
The Attribute producing least Gini impurity index is selected for
split.
References
▪ Machine Learning, Tom Mitchell, Mc Graw-Hill International
Editions, 1997 (Cap 3).
Dimensionality
Constrained Optimization - Lagrange Multipliers

• For a rectangle whose perimeter is 20 m, find the dimensions that

will maximize the area.
• Solution
• The area A of a rectangle with width x and height y is A=xy . The
perimeter P of the rectangle is then given by the formula P=2x+2y . Since
we are given that the perimeter P=20 , this problem can be stated as:
• Maximize : f(x,y)=xy
• given : 2x+2y=20
• y=10−x f(x,y)=xy=x(10−x)=10x−x2
• f'(x)=10−2x=0⇒x=5 and f''(5)=−2<0 , then the Second Derivative Test tells
us that x=5 is a local maximum for f,
• y=10−x=5
Constrained Optimization - Lagrange Multipliers

• Notice in the above example that the ease of the solution

depended on being able to solve for one variable in terms of
the other in the equation 2x+2y=20 .
• But what if that were not possible (which is often the case)? In
this section we will use a general method, called the Lagrange
multiplier method, for solving constrained optimization
problems:
Constrained Optimization - Lagrange Multipliers
Constrained Optimization - Lagrange Multipliers

• For a rectangle whose perimeter is 20 m, use the Lagrange

multiplier method to find the dimensions that will maximize the
area.
• Maximize : f(x,y)=xy given : 2x+2y=20
Constrained Optimization - Lagrange Multipliers
Curse of Dimensionality: Complexity
Curse of Dimensionality: Complexity
KNN
● A binary classification example with k=3.
● The green point in the center is the test

sample x.
● The labels of the 3 neighbors are 2×(+1) and

1×(-1) resulting in majority prediction (+1)

● Assumption: Similar Inputs have similar

outputs
● Classification rule: For a test input x, assign

the most common label amongst its k most

similar training inputs
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
Curse of Dimensionality:Number of Samples
The Curse of Dimensionality
• We should try to avoid creating lot of features
• Often no choice, problem starts with many features
• Example: Face Detection
The Curse of Dimensionality
Data Representation vs. Data Classification
PCA finds the most accurate data representation
in a lower dimensional space
Project data in the directions of maximum variance
However the directions of maximum variance may
be useless for classification
separable

apply PCA not separable

to each class

Fisher Linear Discriminant project to a line which

preserves direction useful for data classification
Fisher Linear Discriminant
Main idea: find projection to a line s.t. samples
from different classes are well separated

Example in 2D

bad line to project to, good line to project to,

classes are mixed up classes are well separated
Fisher Linear Discriminant
Suppose we have 2 classes and d-dimensional
samples x1,…,xn where
n1 samples come from the first class
n2 samples come from the second class
consider projection on a line
Let the line direction be given by unit vector v

Scalar vtxi is the distance of

projection of xi from the origin
txi
v xi Thus it vtxi is the projection of
v
xi into a one dimensional
subspace
Fisher Linear Discriminant
Thus the projection of sample xi onto a line in
direction v is given by vtxi
How to measure separation between projections of
different classes?
Let µ~1 and µ~2 be the means of projections of
classes 1 and 2
Let µ1 and µ2 be the means of classes 1 and 2
µ~1 − µ~2 seems like a good measure
n1 n1
1 1
µ~1 = v t xi = v t x i = v t µ1
n1 x i ∈C 1 n1 x i ∈C 1

similarly , µ~ 2 = v t µ 2
Fisher Linear Discriminant
How good is µ~1 − µ~2 as a measure of separation?
The larger µ~1 − µ~ 2 , the better is the expected separation

µ~1 µ1
µ~2 µ2

µ1 µ2

the vertical axes is a better line than the horizontal

axes to project to for class separability
however µ 1 − µ 2 > µ~1 − µ~2
Fisher Linear Discriminant
The problem with µ~1 − µ~2 is that it does not
consider the variance of the classes

µ~1 µ1
small variance

µ~2 µ2

µ1 µ2

large variance
Fisher Linear Discriminant
We need to normalize µ~1 − µ~2 by a factor which is
proportional to variance
1 n
Have samples z1,…,zn . Sample mean is µ z = n z i
i =1

Define their scatter as

n
s = (z i − µz )
2

i =1

Thus scatter is just sample variance multiplied by n

scatter measures the same thing as variance, the spread
of data around the mean
scatter is just on different scale than variance

larger scatter: smaller scatter:

Fisher Linear Discriminant
Fisher Solution: normalize µ~1 − µ~2 by scatter

Let yi = vtxi , i.e. yi ‘s are the projected samples

Scatter for projected samples of class 1 is

s~12 = (y i − µ~1 ) 2
y i ∈Class 1

Scatter for projected samples of class 2 is

~2 =
s 2 ( y i − ~ )2
µ 2
y i ∈Class 2
Fisher Linear Discriminant
We need to normalize by both scatter of class 1 and
scatter of class 2
Thus Fisher linear discriminant is to project on line
in the direction v which maximizes

want projected means are far from each other

( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2

want scatter in class 1 is as want scatter in class 2 is as

small as possible, i.e. samples small as possible, i.e. samples
of class 1 cluster around the of class 2 cluster around the
projected mean µ ~ projected mean µ ~
1 2
Fisher Linear Discriminant
( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s 2
If we find v which makes J(v) large, we are
guaranteed that the classes are well separated
projected means are far from each other

µ~1 µ~2

small s~ implies that small s~ implies that

1 2
projected samples of projected samples of
class 1 are clustered class 2 are clustered
around projected mean around projected mean
Fisher Linear Discriminant Derivation
( µ~1 − µ~2 )2
J (v ) = ~ 2 ~ 2
s1 + s2
All we need to do now is to express J explicitly as a
function of v and maximize it
straightforward but need linear algebra and Calculus
Define the separate class scatter matrices S1 and
S2 for classes 1 and 2. These measure the scatter
of original samples xi (before projection)
S1 = (x i − µ 1 )( x i − µ 1 )
t

x i ∈Class 1

S2 = (x i − µ 2 )( x i − µ 2 )
t

x i ∈Class 2
Fisher Linear Discriminant Derivation
Now define the within the class scatter matrix
SW = S 1 + S 2

Recall that s~12 = (y i − µ~1 ) 2

y i ∈Class 1

Using yi = vtxi and µ~1 = v µ 1

~2 =
s 1 (v x − v µ )
t
i
t
1
2

y i ∈Class 1

= (v (x − µ )) (v (x
t
i 1
t t
i − µ 1 ))
y i ∈Class 1

= ((x i − µ1 ) v
t
) ((x
t
i − µ1 ) v
t
)
y i ∈Class 1

= v t
(x i − µ 1 )( x i − µ 1 ) v = v t S 1v
t

y i ∈Class 1
Fisher Linear Discriminant Derivation
Similarly s~22 = v t S 2v
Therefore s~12 + s~22 = v t S 1v + v t S 2v = v t S W v
Define between the class scatter matrix
S B = (µ 1 − µ 2 )(µ 1 − µ 2 )
t

SB measures separation between the means of two

classes (before projection)
Let’s rewrite the separations of the projected means
(µ 1 − µ 2 ) = (v µ 1 − v µ 2 )
~ ~ 2 t t 2

= v (µ 1 − µ 2 )(µ 1 − µ 2 ) v
t t

= v t SBv
Fisher Linear Discriminant Derivation
Thus our objective function can be written:
( µ1 − µ 2 )
~ ~
J (v ) = ~ 2 ~ 2 = t
2
v t S Bv
s1 + s 2 v SW v
Minimize J(v) by taking the derivative w.r.t. v and
setting it to 0
d t t d t
v S B v v SW v − v SW v v t S B v
d dv dv
J (v ) =
dv (v t
SW v )
2

=
(2 S B v )v t SW v − (2 SW v )v t S B v
=0
(v t
SW v )
2
Fisher Linear Discriminant Derivation
Need to solve v t S W v (S B v ) − v t S B v (S W v ) = 0

v t S W v (S B v ) v t S B v (S W v )
t
− t
=0
v SW v v SW v
v t S B v (S W v )
SBv − t
=0
v SW v = λ

S B v = λ SW v

generalized eigenvalue problem

Fisher Linear Discriminant Derivation
S B v = λ SW v
If SW has full rank (the inverse exists), can convert
this to a standard eigenvalue problem
S W− 1S B v = λ v
But SB x for any vector x, points in the same
direction as µ1 - µ2 α
S B x = (µ 1 − µ 2 )(µ 1 − µ 2 ) x = (µ 1 − µ 2 )((µ 1 − µ 2 )t x ) = α (µ 1 − µ 2 )
t

Thus can solve the eigenvalue problem immediately

v = S W− 1 (µ 1 − µ 2 )
−1
S SB
W [S (µ
−1
W 1 − µ2 )] [ (
= SW α µ 1 − µ 2
−1
)] = α [S W (µ 1 − µ 2 )]
−1

v λ v
Fisher Linear Discriminant Example
Data
Class 1 has 5 samples c1=[(1,2),(2,3),(3,3),(4,5),(5,5)]
Class 2 has 6 samples c2=[(1,0),(2,1),(3,1),(3,2),(5,3),(6,5)]
Arrange data in 2 separate matrices
1 2 1 0
c1 = c2 =
5 5 6 5

Notice that PCA performs very

poorly on this data because the
direction of largest variance is not
helpful for classification
Fisher Linear Discriminant Example
First compute the mean for each class
µ 1 = mean (c 1 ) = [3 3 . 6 ] µ 2 = mean (c 2 ) = [3 . 3 2]

Compute scatter matrices S1 and S2 for each class

S 1 = 4 ∗ cov (c 1 ) = 810 8 .0
.0 7 .2 S 2 = 5 ∗ cov (c 2 ) = 17 . 3 16
16 16

Within the class scatter:

S W = S 1 + S 2 = 27 . 3 24
24 23 . 2
it has full rank, don’t have to solve for eigenvalues

The inverse of SW is S W− 1 = inv (S W ) = −00. 39 − 0 . 41

. 41 0 . 47
Finally, the optimal line direction v
v = S W− 1 (µ 1 − µ 2 ) = −00. 89
. 79
Fisher Linear Discriminant Example

Notice, as long as the line

has the right direction, its
exact position does not
matter

Last step is to compute

the actual 1D vector y.
Let’s do it separately for
each class
1
Y 1 = v t c 1t = [− 0 . 65 0 . 73 ] 2 5 = [0 . 81 0 .4 ]
5
1
Y 2 = v t c 2t = [− 0 . 65 0 . 73 ] 0 6 = [− 0 . 65 − 0 . 25 ]
5
Multiple Discriminant Analysis (MDA)
Can generalize FLD to multiple classes
In case of c classes, can reduce dimensionality to
1, 2, 3,…, c-1 dimensions
Project sample xi to a linear subspace yi = V txi
V is called projection matrix
Multiple Discriminant Analysis (MDA)
Let ni by the number of samples of class i
and µi be the sample mean of class i
µ be the total mean of all samples
µi = 1 x µ = 1 xi
ni x ∈ class i n xi

det (V t S BV )
Objective function: J (V ) =
det (V t S W V )
within the class scatter matrix SW is
c c
SW = Si = (x k − µ i )( x k − µ i )
t

i =1 i = 1 x k ∈class i

between the class scatter matrix SB is

c
n i (µ i − µ )(µ i − µ )
t
SB =
i =1
maximum rank is c -1
Multiple Discriminant Analysis (MDA)
det (V t S BV )
J (V ) =
det (V t S W V )
First solve the generalized eigenvalue problem:
S B v = λ SW v

At most c-1 distinct solution eigenvalues

Let v1, v2 ,…, vc-1 be the corresponding eigenvectors
The optimal projection matrix V to a subspace of
dimension k is given by the eigenvectors
corresponding to the largest k eigenvalues
Thus can project to a subspace of dimension at
most c-1
FDA and MDA Drawbacks
Reduces dimension only to k = c-1 (unlike PCA)
For complex data, projection to even the best line may
result in unseparable projected samples
Will fail:
1. J(v) is always 0: happens if µ1 = µ2

PCA performs PCA also

reasonably well fails:
here:
2. If J(v) is always large: classes have large overlap when
projected to any line (PCA will also fail)
The Perceptron

Tapas Kumar Mishra

September 28, 2022

Assumptions

Binary classification (i.e. yi ∈ {−1, +1})

Data is linearly separable

2/15
Tapas Kumar Mishra The Perceptron
Classifier: h(xi ) = sign(w⊤ xi + b)

3/15
Tapas Kumar Mishra The Perceptron
Classifier: h(xi ) = sign(w⊤ xi + b)
b is the bias term (without the bias term, the hyperplane that
w defines would always have to go through the origin).
Dealing with b can be a pain, so we ’absorb’ it into the
feature vector w by adding one additional constant dimension.
Under this convention,

xi
xi becomes
1

w
w becomes
b

We can verify that

⊤
xi w
= w ⊤ xi + b
1 b

Using this, we can simplify the above formulation of h(xi ) to

h(xi ) = sign(w⊤ x) 4/15
Tapas Kumar Mishra The Perceptron
Classifier: h(xi ) = sign(w⊤ xi + b)
b is the bias term (without the bias term, the hyperplane that
w defines would always have to go through the origin).
Dealing with b can be a pain, so we ’absorb’ it into the
feature vector w by adding one additional constant dimension.
Under this convention,

xi
xi becomes
1

w
w becomes
b

We can verify that

⊤
xi w
= w ⊤ xi + b
1 b

Using this, we can simplify the above formulation of h(xi ) to

We can verify that

⊤
xi w
= w ⊤ xi + b
1 b

Using this, we can simplify the above formulation of h(xi ) to

We can verify that

⊤
xi w
= w ⊤ xi + b
1 b

Using this, we can simplify the above formulation of h(xi ) to

We can verify that

⊤
xi w
= w ⊤ xi + b
1 b

Using this, we can simplify the above formulation of h(xi ) to

h(xi ) = sign(w⊤ x) 4/15
Tapas Kumar Mishra The Perceptron
Classifier: h(xi ) = sign(w⊤ xi + b)

Observation: Note that

yi (w⊤ xi ) > 0 ⇐⇒ xi is classified correctly

where ’classified correctly’ means that xi is on the correct side of

the hyperplane defined by w.
Also, note that the left side depends on yi ∈ {−1, +1} (it wouldn’t
work if, for example yi ∈ {0, +1}).

5/15
Tapas Kumar Mishra The Perceptron
Perceptron Algorithm: obtaining w

6/15
Tapas Kumar Mishra The Perceptron
Geometric Intuition

Figure: Illustration of a Perceptron update. (Left:) The hyperplane

defined by wt misclassifies one red (-1) and one blue (+1) point.
(Middle:) The red point x is chosen and used for an update. Because its
label is -1 we need to subtract x from wt . (Right:) The udpated
hyperplane wt+1 = wt − x separates the two classes and the Perceptron
algorithm has converged.

7/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence

The Perceptron was arguably the first algorithm with a strong

formal guarantee. If a data set is linearly separable, the Perceptron
will find a separating hyperplane in a finite number of updates. (If
the data is not linearly separable, it will loop forever.)

8/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: set up

Suppose ∃w∗ such that yi (x⊤ w∗ ) > 0 ∀(xi , yi ) ∈ D.

Now, suppose that we rescale each data point and the w∗
such that

||w∗ || = 1 and ||xi || ≤ 1 ∀xi ∈ D

Let us define the Margin γ of the hyperplane w∗ as

γ = min(xi ,yi )∈D |x⊤ ∗
i w |.

9/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: set up

Suppose ∃w∗ such that yi (x⊤ w∗ ) > 0 ∀(xi , yi ) ∈ D.

Now, suppose that we rescale each data point and the w∗
such that

||w∗ || = 1 and ||xi || ≤ 1 ∀xi ∈ D

Let us define the Margin γ of the hyperplane w∗ as

γ = min(xi ,yi )∈D |x⊤ ∗
i w |.

9/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: set up

Suppose ∃w∗ such that yi (x⊤ w∗ ) > 0 ∀(xi , yi ) ∈ D.

Now, suppose that we rescale each data point and the w∗
such that

||w∗ || = 1 and ||xi || ≤ 1 ∀xi ∈ D

Let us define the Margin γ of the hyperplane w∗ as

γ = min(xi ,yi )∈D |x⊤ ∗
i w |.

9/15
Tapas Kumar Mishra The Perceptron
Set up

All inputs xi live within the unit sphere

There exists a separating hyperplane defined by w∗ , with
∥w∥∗ = 1 (i.e. w∗ lies exactly on the unit sphere)
γ is the distance from this hyperplane (blue) to the closest
data point.
10/15
Tapas Kumar Mishra The Perceptron
Set up

All inputs xi live within the unit sphere

Theorem
If all of the above holds, then the Perceptron algorithm makes at
most 1/γ 2 mistakes.

11/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Keeping what we defined above, consider the effect of an update

(w becomes w + y x) on the two terms w⊤ w∗ and w⊤ w.
We will use two facts:
y (x⊤ w) ≤ 0: This holds because x is misclassified by w -
otherwise we wouldn’t make the update.
y (x⊤ w∗ ) > 0: This holds because w∗ is a separating
hyper-plane and classifies all points correctly.

12/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Keeping what we defined above, consider the effect of an update

12/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Keeping what we defined above, consider the effect of an update

12/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Consider the effect of an update on w⊤ w∗ :

(w + y x)⊤ w∗ = w⊤ w∗ + y (x⊤ w∗ ) ≥ w⊤ w∗ + γ

The inequality follows from the fact that, for w∗ , the distance from
the hyperplane defined by w∗ to x must be at least γ (i.e.
y (x⊤ w∗ ) = |x⊤ w∗ | ≥ γ).
This means that for each update
w⊤ w∗ grows by at least γ.

13/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Consider the effect of an update on w⊤ w∗ :

(w + y x)⊤ w∗ = w⊤ w∗ + y (x⊤ w∗ ) ≥ w⊤ w∗ + γ

13/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Consider the effect of an update on w⊤ w∗ :

(w + y x)⊤ w∗ = w⊤ w∗ + y (x⊤ w∗ ) ≥ w⊤ w∗ + γ

13/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Consider the effect of an update on w⊤ w:

(w + y x)⊤ (w + y x) = w⊤ w + 2y (w⊤ x) + y 2 (x⊤ x) ≤ w⊤ w + 1

| {z } | {z }
<0 0≤ ≤1

The inequality follows from the fact that

2y (w⊤ x) < 0 as we had to make an update, meaning x was
misclassified
0 ≤ y 2 (x⊤ x) ≤ 1 as y 2 = 1 and all x⊤ x ≤ 1 (because
∥x∥ ≤ 1).
This means that for each update, w⊤ w grows by at most 1.

14/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Consider the effect of an update on w⊤ w:

(w + y x)⊤ (w + y x) = w⊤ w + 2y (w⊤ x) + y 2 (x⊤ x) ≤ w⊤ w + 1

| {z } | {z }
<0 0≤ ≤1

The inequality follows from the fact that

14/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem

Consider the effect of an update on w⊤ w:

(w + y x)⊤ (w + y x) = w⊤ w + 2y (w⊤ x) + y 2 (x⊤ x) ≤ w⊤ w + 1

| {z } | {z }
<0 0≤ ≤1

The inequality follows from the fact that

14/15
Tapas Kumar Mishra The Perceptron
Perceptron Convergence: Proof of Theorem
Now we know that after M updates the following two inequalities
must hold:
1 w⊤ w∗ ≥ Mγ

2 w⊤ w ≤ M.

We can then complete the proof:

2 w⊤ w ≤ M.

We can then complete the proof:

2 w⊤ w ≤ M.

We can then complete the proof:

2 w⊤ w ≤ M.

We can then complete the proof:

2 w⊤ w ≤ M.

We can then complete the proof:

2 w⊤ w ≤ M.

We can then complete the proof:

2 w⊤ w ≤ M.

We can then complete the proof:

Mγ ≤ w⊤ w∗ By (1)
= ∥w∥ cos(θ) by definition of inner-product, where θ is the
angle between w and w∗ .
≤ ||w|| by definition of cos, we must have cos(θ) ≤ 1.
√
= w⊤ w by definition of ∥w∥
√
≤ M By (2)
2 2
⇒M γ ≤M
1
⇒M≤ 2 And hence, the number of updates M is bounded
γ
from above by a constant. 15/15
Tapas Kumar Mishra The Perceptron
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit>
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
<latexit sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
.
..
x2
x1
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit>
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
<latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
sha1_base64="3SltFZgdSbEccduFdJMJ4sVJM+s=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHirYW2qVk07QNTbJLMquUpT/BiwdFvPqLvPlvTNs9aOsLgYd3ZsjMGyVSWPT9b6+wsrq2vlHcLG1t7+zulfcPmjZODeMNFsvYtCJquRSaN1Cg5K3EcKoiyR+i0fW0/vDIjRWxvsdxwkNFB1r0BaPorLunruqWK37Vn4ksQ5BDBXLVu+WvTi9mqeIamaTWtgM/wTCjBgWTfFLqpJYnlI3ogLcdaqq4DbPZqhNy4pwe6cfGPY1k5v6eyKiydqwi16koDu1ibWr+V2un2L8MM6GTFLlm84/6qSQYk+ndpCcMZyjHDigzwu1K2JAaytClU3IhBIsnL0PzrBo4vj2v1K7yOIpwBMdwCgFcQA1uoA4NYDCAZ3iFN096L9679zFvLXj5zCH8kff5A2Ucjds=</latexit>
sha1_base64="3SltFZgdSbEccduFdJMJ4sVJM+s=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHirYW2qVk07QNTbJLMquUpT/BiwdFvPqLvPlvTNs9aOsLgYd3ZsjMGyVSWPT9b6+wsrq2vlHcLG1t7+zulfcPmjZODeMNFsvYtCJquRSaN1Cg5K3EcKoiyR+i0fW0/vDIjRWxvsdxwkNFB1r0BaPorLunruqWK37Vn4ksQ5BDBXLVu+WvTi9mqeIamaTWtgM/wTCjBgWTfFLqpJYnlI3ogLcdaqq4DbPZqhNy4pwe6cfGPY1k5v6eyKiydqwi16koDu1ibWr+V2un2L8MM6GTFLlm84/6qSQYk+ndpCcMZyjHDigzwu1K2JAaytClU3IhBIsnL0PzrBo4vj2v1K7yOIpwBMdwCgFcQA1uoA4NYDCAZ3iFN096L9679zFvLXj5zCH8kff5A2Ucjds=</latexit><latexit
sha1_base64="3SltFZgdSbEccduFdJMJ4sVJM+s=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHirYW2qVk07QNTbJLMquUpT/BiwdFvPqLvPlvTNs9aOsLgYd3ZsjMGyVSWPT9b6+wsrq2vlHcLG1t7+zulfcPmjZODeMNFsvYtCJquRSaN1Cg5K3EcKoiyR+i0fW0/vDIjRWxvsdxwkNFB1r0BaPorLunruqWK37Vn4ksQ5BDBXLVu+WvTi9mqeIamaTWtgM/wTCjBgWTfFLqpJYnlI3ogLcdaqq4DbPZqhNy4pwe6cfGPY1k5v6eyKiydqwi16koDu1ibWr+V2un2L8MM6GTFLlm84/6qSQYk+ndpCcMZyjHDigzwu1K2JAaytClU3IhBIsnL0PzrBo4vj2v1K7yOIpwBMdwCgFcQA1uoA4NYDCAZ3iFN096L9679zFvLXj5zCH8kff5A2Ucjds=</latexit><latexit
sha1_base64="3SltFZgdSbEccduFdJMJ4sVJM+s=">AAAB6nicbZBNSwMxEIZn61etX1WPXoJF8FR2RdBj0YvHirYW2qVk07QNTbJLMquUpT/BiwdFvPqLvPlvTNs9aOsLgYd3ZsjMGyVSWPT9b6+wsrq2vlHcLG1t7+zulfcPmjZODeMNFsvYtCJquRSaN1Cg5K3EcKoiyR+i0fW0/vDIjRWxvsdxwkNFB1r0BaPorLunruqWK37Vn4ksQ5BDBXLVu+WvTi9mqeIamaTWtgM/wTCjBgWTfFLqpJYnlI3ogLcdaqq4DbPZqhNy4pwe6cfGPY1k5v6eyKiydqwi16koDu1ibWr+V2un2L8MM6GTFLlm84/6qSQYk+ndpCcMZyjHDigzwu1K2JAaytClU3IhBIsnL0PzrBo4vj2v1K7yOIpwBMdwCgFcQA1uoA4NYDCAZ3iFN096L9679zFvLXj5zCH8kff5A2Ucjds=</latexit><latexit
<latexit
w2
w1

wm
sum
weighted threshold
binary signal
McCulloch & Pitts Neuron Model

1943

9
1
1
0
0
x1

1
0
1
0
x2

1
0
0
0
Out

sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit>
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
sha1_base64="8ur8Qnjf68veizOKVqkUmBXGiPw=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwANNo2h</latexit><latexit
<latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
x2
x1

sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit>
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
sha1_base64="sAAe226MpFncoK5AcSpzzUnkA9I=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FSSIuix6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+rV+ueJW3bnIKng5VCBXo1/+6g1ilkZcIZPUmK7nJuhnVKNgkk9LvdTwhLIxHfKuRUUjbvxsvuqUnFlnQMJY26eQzN3fExmNjJlEge2MKI7Mcm1m/lfrphhe+ZlQSYpcscVHYSoJxmR2NxkIzRnKiQXKtLC7EjaimjK06ZRsCN7yyavQqlU9y3cXlfp1HkcRTuAUzsGDS6jDLTSgCQyG8Ayv8OZI58V5dz4WrQUnnzmGP3I+fwALsI2g</latexit><latexit
<latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
w2 =1
w1 =1

sum
weighted
Logical AND Gate

t=1.5

10
1
1
0
0
x1

1
0
1
0
x2

1
1
1
0
Out

sum
weighted
Logical OR Gate

t=0.5

11
1
0
x1

0
1
Out

sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit>
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
sha1_base64="Z7jxfJr8/pbKF9IEHv5u2p28PzU=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlYQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwuyjaA=</latexit><latexit
<latexit
x1

sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit>
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
sha1_base64="ozSIzVA/SGXegmac4XRXthOpvw0=">AAAB6nicbZBNS8NAEIYn9avWr6pHL4tF8FQSEeqx6MVjRfsBbSib7aZdutmE3YlSQn+CFw+KePUXefPfuG1z0NYXFh7emWFn3iCRwqDrfjuFtfWNza3idmlnd2//oHx41DJxqhlvsljGuhNQw6VQvIkCJe8kmtMokLwdjG9m9fYj10bE6gEnCfcjOlQiFIyite6f+l6/XHGr7lxkFbwcKpCr0S9/9QYxSyOukElqTNdzE/QzqlEwyaelXmp4QtmYDnnXoqIRN342X3VKzqwzIGGs7VNI5u7viYxGxkyiwHZGFEdmuTYz/6t1Uwyv/EyoJEWu2OKjMJUEYzK7mwyE5gzlxAJlWthdCRtRTRnadEo2BG/55FVoXVQ9y3eXlfp1HkcRTuAUzsGDGtThFhrQBAZDeIZXeHOk8+K8Ox+L1oKTzxzDHzmfPwosjZ8=</latexit><latexit
<latexit
sum
w1 = -1 weighted
Logical NOT Gate

t= -0.5

12
1
1
0
0
x1

1
0
1
0
x2

0
1
1
0
Out

sum
(Take-home exercise)

weighted
Logical XOR Gate

t=?

13
40
SVM Example
Dan Ventura
March 12, 2009

Abstract
We try to give a helpful simple example that demonstrates a linear
SVM and then extend the example to a simple non-linear case to illustrate
the use of mapping functions and kernels.

1 Introduction
Many learning models make use of the idea that any learning problem can be
made easy with the right set of features. The trick, of course, is discovering that
“right set of features”, which in general is a very difficult thing to do. SVMs are
another attempt at a model that does this. The idea behind SVMs is to make use
of a (nonlinear) mapping function Φ that transforms data in input space to data
in feature space in such a way as to render a problem linearly separable. The
SVM then automatically discovers the optimal separating hyperplane (which,
when mapped back into input space via Φ−1 , can be a complex decision surface).
SVMs are rather interesting in that they enjoy both a sound theoretical basis
as well as state-of-the-art success in real-world applications.
To illustrate the basic ideas, we will begin with a linear SVM (that is, a
model that assumes the data is linearly separable). We will then expand the
example to the nonlinear case to demonstrate the role of the mapping function
Φ, and finally we will explain the idea of a kernel and how it allows SVMs to
make use of high-dimensional feature spaces while remaining tractable.

2 Linear Example – when Φ is trivial

Suppose we are given the following positively labeled data points in <2 :

3 3 6 6
, , ,
1 −1 1 −1

and the following negatively labeled data points in <2 (see Figure 1):

1 0 0 −1
, , ,
0 1 −1 0

1
Figure 1: Sample data points in <2 . Blue diamonds are positive examples and
red squares are negative examples.

We would like to discover a simple SVM that accurately discriminates the

two classes. Since the data is linearly separable, we can use a linear SVM (that
is, one whose mapping function Φ() is the identity function). By inspection, it
should be obvious that there are three support vectors (see Figure 2):

1 3 3
s1 = , s2 = , s3 =
0 1 −1

In what follows we will use vectors augmented with a 1 as a bias input, and
for clarity we will differentiate these with an over-tilde. So, if s1 = (10), then
s˜1 = (101). Figure 3 shows the SVM architecture, and our task is to find values
for the αi such that

α1 Φ(s1 ) · Φ(s1 ) + α2 Φ(s2 ) · Φ(s1 ) + α3 Φ(s3 ) · Φ(s1 ) = −1

α1 Φ(s1 ) · Φ(s2 ) + α2 Φ(s2 ) · Φ(s2 ) + α3 Φ(s3 ) · Φ(s2 ) = +1
α1 Φ(s1 ) · Φ(s3 ) + α2 Φ(s2 ) · Φ(s3 ) + α3 Φ(s3 ) · Φ(s3 ) = +1

Since for now we have let Φ() = I, this reduces to

α1 s˜1 · s˜1 + α2 s˜2 · s˜1 + α3 s˜3 · s˜1 = −1

α1 s˜1 · s˜2 + α2 s˜2 · s˜2 + α3 s˜3 · s˜2 = +1
α1 s˜1 · s˜3 + α2 s˜2 · s˜3 + α3 s˜3 · s˜3 = +1

Now, computing the dot products results in

2
Figure 2: The three support vectors are marked as yellow circles.

Figure 3: The SVM architecture.

3
2α1 + 4α2 + 4α3 = −1
4α1 + 11α2 + 9α3 = +1
4α1 + 9α2 + 11α3 = +1

A little algebra reveals that the solution to this system of equations is α1 =

−3.5, α2 = 0.75 and α3 = 0.75.
Now, we can look at how these α values relate to the discriminating hyper-
plane; or, in other words, now that we have the αi , how do we find the hyper-
plane that discriminates the positive from the negative examples? It turns out
that

X
w̃ = αi s˜i
i
     
1 3 3
= −3.5  0  + 0.75  1  + 0.75  −1 
1 1 1
 
1
=  0 
−2

Finally, remembering that our vectors are augmented with a bias, we can
equate the last entry in w̃ as the hyperplane
offset
b and write the separating
1
hyperplane equation y = wx + b with w = and b = −2. Plotting the line
0
gives the expected decision surface (see Figure 4).

2.1 Input space vs. Feature space

3 Nonlinear Example – when Φ is non-trivial

Now suppose instead that we are given the following positively labeled data
points in <2 :

2 2 −2 −2
, , ,
2 −2 −2 2

and the following negatively labeled data points in <2 (see Figure 5):

1 1 −1 −1
, , ,
1 −1 −1 1

4
Figure 4: The discriminating hyperplane corresponding to the values α1 =
−3.5, α2 = 0.75 and α3 = 0.75.

Figure 5: Nonlinearly separable sample data points in <2 . Blue diamonds are
positive examples and red squares are negative examples.

5
Figure 6: The data represented in feature space.

Our goal, again, is to discover a separating hyperplane that accurately dis-

criminates the two classes. Of course, it is obvious that no such hyperplane
exists in the input space (that is, in the space in which the original input data
live). Therefore, we must use a nonlinear SVM (that is, one whose mapping
function Φ is a nonlinear mapping from input space into some feature space).
Define

4 − x2 + |x1 − x2 | p
  if x21 + x22 > 2
x1 4 − x + |x − x |

Φ1 = 1 1 2
(1)
x2 x1
otherwise


x2


Referring back to Figure 3, we can see how Φ transforms our data before
the dot products are performed. Therefore, we can rewrite the data in feature
space as

2 6 6 2
, , ,
2 2 6 6
for the positive examples and

1 1 −1 −1
, , ,
1 −1 −1 1
for the negative examples (see Figure 6). Now we can once again easily identify
the support vectors (see Figure 7):

1 2
s1 = , s2 =
1 2
We again use vectors augmented with a 1 as a bias input and will differentiate
them as before. Now given the [augmented] support vectors, we must again find
values for the αi . This time our constraints are

6
Figure 7: The two support vectors (in feature space) are marked as yellow
circles.

α1 Φ1 (s1 ) · Φ1 (s1 ) + α2 Φ1 (s2 ) · Φ1 (s1 ) = −1

α1 Φ1 (s1 ) · Φ1 (s2 ) + α2 Φ1 (s2 ) · Φ1 (s2 ) = +1

Given Eq. 1, this reduces to

α1 s˜1 · s˜1 + α2 s˜2 · s˜1 = −1

α1 s˜1 · s˜2 + α2 s˜2 · s˜2 = +1

(Note that even though Φ1 is a nontrivial function, both s1 and s2 map to

themselves under Φ1 . This will not be the case for other inputs as we will see
later.)
Now, computing the dot products results in

3α1 + 5α2 = −1
5α1 + 9α2 = +1

And the solution to this system of equations is α1 = −7 and α2 = 4.

Finally, we can again look at the discriminating hyperplane in input space
that corresponds to these α.
X
w̃ = αi s˜i
i

7
Figure 8: The discriminating hyperplane corresponding to the values α1 = −7
and α2 = 4

   
1 2
= −7  1  + 4  2 
1 1
 
1
=  1 
−3

1
giving us the separating hyperplane equation y = wx + b with w = and
1
b = −3. Plotting the line gives the expected decision surface (see Figure 8).

3.1 Using the SVM

Let’s briefly look at how we would use the SVM model to classify data. Given
x, the classification f (x) is given by the equation
!
X
f (x) = σ αi Φ(si ) · Φ(x) (2)
i

where σ(z) returns the sign of z. For example, if we wanted to classify the point
x = (4, 5) using the mapping function of Eq. 1,

4 1 4 2 4
f = σ −7Φ1 · Φ1 + 4Φ1 · Φ1
5 1 5 2 5
        
1 0 2 0
= σ −7  1  ·  1  + 4 2  ·  1 
1 1 1 1
= σ(−2)

8
Figure 9: The decision surface in input space corresponding to Φ1 . Note the
singularity.

and thus we would classify x = (4, 5) as negative. Looking again at the input
space, we might be tempted to think this is not a reasonable classification; how-
ever, it is what our model says, and our model is consistent with all the training
data. As always, there are no guarantees on generalization accuracy, and if we
are not happy about our generalization, the likely culprit is our choice of Φ.
Indeed, if we map our discriminating hyperplane (which lives in feature space)
back into input space, we can see the effective decision surface of our model
(see Figure 9). Of course, we may or may not be able to improve generalization
accuracy by choosing a different Φ; however, there is another reason to revisit
our choice of mapping function.

4 The Kernel Trick

Our definition of Φ in Eq. 1 preserved the number of dimensions. In other
words, our input and feature spaces are the same size. However, it is often the
case that in order to effectively separate the data, we must use a feature space
that is of (sometimes very much) higher dimension than our input space. Let
us now consider an alternative mapping function
 
x1
x1
Φ2 =  2 x22 (3)
 
x2

(x1 +x2 )−5
3

which transforms our data from 2-dimensional input space to 3-dimensional

feature space. Using this alternative mapping, the data in the new feature
space looks like

9
Figure 10: The decision surface in input space corresponding Φ2 .

       
 2 2 −2 −2 
 2  ,  −2  ,  −2  ,  2 
1 1 1 1
 

for the positive examples and

       
 1 1 −1 −1 
 1  ,  −1  ,  −1  ,  1 
−1 −1 −1 −1
 

for the negative examples. With a little thought, we realize that in this case, all
1
8 of the examples will be support vectors with αi = 46 for the positive support
−7
vectors and αi = 46 for the negative ones. Note that a consequence of this
mapping is that we do not need to use augmented vectors (though it wouldn’t
hurt to do so) because the hyperplane
 in feature space goes through the origin,
0
y = wx+b, where w =  0  and b = 0. Therefore, the discriminating feature,
1
is x3 , and Eq. 2 reduces to f (x) = σ(x3 ).
Figure 10 shows the decision surface induced in the input space for this new
mapping function.
Kernel trick.

5 Conclusion
What kernel to use? Slack variables. Theory. Generalization. Dual problem.
QP.

10
Support vector Machines

Tapas Kumar Mishra

August 13, 2022

1/19
Intro

The Support Vector Machine (SVM) is a linear classifier that can

be viewed as an extension of the Perceptron developed by
Rosenblatt in 1958.
The Perceptron guaranteed that you find a hyperplane if it exists.
The SVM finds the maximum margin separating hyperplane.
Setting: We define a linear classifier: h(x) = sign(wT x + b) and
we assume a binary classification setting with labels {+1, −1}.

2/19
Tapas Kumar Mishra Support vector Machines
Intro

The Support Vector Machine (SVM) is a linear classifier that can

2/19
Tapas Kumar Mishra Support vector Machines
Intro

The Support Vector Machine (SVM) is a linear classifier that can

2/19
Tapas Kumar Mishra Support vector Machines
Intro

The Support Vector Machine (SVM) is a linear classifier that can

2/19
Tapas Kumar Mishra Support vector Machines
Figure : Figure 1: (Left:) Two different separating hyperplanes for the
same data set. (Right:) The maximum margin hyperplane. The margin,
γ, is the distance from the hyperplane (solid line) to the closest points in
either class (which touch the parallel dotted lines).

3/19
Tapas Kumar Mishra Support vector Machines
Typically, if a data set is linearly separable, there are infinitely
many separating hyperplanes. A natural question to ask is:
Question: What is the best separating hyperplane?
SVM Answer: The one that maximizes the distance to the closest
data points from both classes. We say it is the hyperplane with
maximum margin.

4/19
Tapas Kumar Mishra Support vector Machines
Typically, if a data set is linearly separable, there are infinitely
many separating hyperplanes. A natural question to ask is:
Question: What is the best separating hyperplane?
SVM Answer: The one that maximizes the distance to the closest
data points from both classes. We say it is the hyperplane with
maximum margin.

4/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance

We already saw the definition of a margin in the context of the

Perceptron.
A hyperplane is defined through w, b as a set of points such that
H = x|wT x + b = 0 .

Let the margin γ be defined as the distance from the hyperplane to

the closest point across both classes.

5/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance

We already saw the definition of a margin in the context of the

Perceptron.
A hyperplane is defined through w, b as a set of points such that
H = x|wT x + b = 0 .

Let the margin γ be defined as the distance from the hyperplane to

the closest point across both classes.

5/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance

We already saw the definition of a margin in the context of the

Perceptron.
A hyperplane is defined through w, b as a set of points such that
H = x|wT x + b = 0 .

Let the margin γ be defined as the distance from the hyperplane to

the closest point across both classes.

5/19
Tapas Kumar Mishra Support vector Machines
Margin and Distance

Figure : What is the distance of a point x to the hyperplane H?

Consider some point x. Let d be the vector from H to x of

Figure : What is the distance of a point x to the hyperplane H?

Consider some point x. Let d be the vector from H to x of

Figure : What is the distance of a point x to the hyperplane H?

Consider some point x. Let d be the vector from H to x of

Figure : What is the distance of a point x to the hyperplane H?

Consider some point x. Let d be the vector from H to x of

minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
6/19
Tapas Kumar Mishra Support vector Machines
Consider some point x. Let d be the vector from H to x of
minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
therefore wT xP + b = wT (x − d) + b = wT (x − αw) + b = 0
T
which implies α = wwTx+bw
√ √ |wT x+b| |wT x+b|
The length of d: kdk2 = dT d = α2 wT w = √ T = kwk
w w 2
Margin of H with respect to D:

wT x + b
γ(w, b) = min
x∈D kwk2

By definition, the margin and hyperplane are scale invariant:

γ(βw, βb) = γ(w, b), ∀β 6= 0

7/19
Tapas Kumar Mishra Support vector Machines
Consider some point x. Let d be the vector from H to x of
minimum length. Let xP be the projection of x onto H. It follows
then that: xP = x − d.
d is parallel to w, so d = αw for some α ∈ R
xP ∈ H which implies wT xP + b = 0
therefore wT xP + b = wT (x − d) + b = wT (x − αw) + b = 0
T
which implies α = wwTx+bw
√ √ |wT x+b| |wT x+b|
The length of d: kdk2 = dT d = α2 wT w = √ T = kwk
w w 2
Margin of H with respect to D:

wT x + b
γ(w, b) = min
x∈D kwk2

By definition, the margin and hyperplane are scale invariant:

γ(βw, βb) = γ(w, b), ∀β 6= 0

wT x + b
γ(w, b) = min
x∈D kwk2

By definition, the margin and hyperplane are scale invariant:

γ(βw, βb) = γ(w, b), ∀β 6= 0

wT x + b
γ(w, b) = min
x∈D kwk2

By definition, the margin and hyperplane are scale invariant:

γ(βw, βb) = γ(w, b), ∀β 6= 0

wT x + b
γ(w, b) = min
x∈D kwk2

By definition, the margin and hyperplane are scale invariant:

γ(βw, βb) = γ(w, b), ∀β 6= 0

7/19
Tapas Kumar Mishra Support vector Machines
Note that if the hyperplane is such that γ is maximized, it must lie
right in the middle of the two classes. In other words, γ must be
the distance to the closest point within both classes.
(If not, you could move the hyperplane towards data points of the
class that is further away and increase γ, which contradicts that γ
is maximized.)

8/19
Tapas Kumar Mishra Support vector Machines
Note that if the hyperplane is such that γ is maximized, it must lie
right in the middle of the two classes. In other words, γ must be
the distance to the closest point within both classes.
(If not, you could move the hyperplane towards data points of the
class that is further away and increase γ, which contradicts that γ
is maximized.)

8/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier

We can formulate our search for the maximum margin separating

hyperplane as a constrained optimization problem.
The objective is to maximize the margin under the constraints that
all data points must lie on the correct side of the hyperplane:

max γ(w, b) such that ∀i yi (wT xi + b) ≥ 0

w,b | {z }
| {z } separating hyperplane
maximize margin

If we plug in the definition of γ we obtain:

1
max min wT xi + b s.t. ∀i yi (wT xi + b) ≥ 0
w,b kwk 2 xi ∈D | {z }
| {z } separating hyperplane
γ(w,b)
| {z }
maximize margin

9/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier

We can formulate our search for the maximum margin separating

hyperplane as a constrained optimization problem.
The objective is to maximize the margin under the constraints that
all data points must lie on the correct side of the hyperplane:

max γ(w, b) such that ∀i yi (wT xi + b) ≥ 0

w,b | {z }
| {z } separating hyperplane
maximize margin

If we plug in the definition of γ we obtain:

1
max min wT xi + b s.t. ∀i yi (wT xi + b) ≥ 0
w,b kwk 2 xi ∈D | {z }
| {z } separating hyperplane
γ(w,b)
| {z }
maximize margin

9/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier

We can formulate our search for the maximum margin separating

hyperplane as a constrained optimization problem.
The objective is to maximize the margin under the constraints that
all data points must lie on the correct side of the hyperplane:

max γ(w, b) such that ∀i yi (wT xi + b) ≥ 0

w,b | {z }
| {z } separating hyperplane
maximize margin

If we plug in the definition of γ we obtain:

1
max min wT xi + b s.t. ∀i yi (wT xi + b) ≥ 0
w,b kwk 2 xi ∈D | {z }
| {z } separating hyperplane
γ(w,b)
| {z }
maximize margin

9/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier

Because the hyperplane is scale invariant, we can fix the scale of

w, b anyway we want. Let’s be clever about it, and choose it such
that
min wT x + b = 1.
x∈D

We can add this re-scaling as an equality constraint.

Then our objective becomes:
1
max · 1 = min kwk2 = min w> w
w,b kwk2 w,b w,b

(Where we made use of the fact f (z) = z 2 is a monotonically

increasing function for z ≥ 0 and kwk ≥ 0; i.e. the w that
maximizes kwk2 also maximizes w> w.)

10/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier

Because the hyperplane is scale invariant, we can fix the scale of

w, b anyway we want. Let’s be clever about it, and choose it such
that
min wT x + b = 1.
x∈D

We can add this re-scaling as an equality constraint.

Then our objective becomes:
1
max · 1 = min kwk2 = min w> w
w,b kwk2 w,b w,b

(Where we made use of the fact f (z) = z 2 is a monotonically

increasing function for z ≥ 0 and kwk ≥ 0; i.e. the w that
maximizes kwk2 also maximizes w> w.)

10/19
Tapas Kumar Mishra Support vector Machines
Max Margin Classifier

Because the hyperplane is scale invariant, we can fix the scale of

w, b anyway we want. Let’s be clever about it, and choose it such
that
min wT x + b = 1.
x∈D

We can add this re-scaling as an equality constraint.

Then our objective becomes:
1
max · 1 = min kwk2 = min w> w
w,b kwk2 w,b w,b

(Where we made use of the fact f (z) = z 2 is a monotonically

increasing function for z ≥ 0 and kwk ≥ 0; i.e. the w that
maximizes kwk2 also maximizes w> w.)

10/19
Tapas Kumar Mishra Support vector Machines
The new optimization problem becomes: ¡br¿

min w> w (1)

w,b

∀i, yi (wT xi + b) ≥ 0
s.t. (2)
mini wT xi + b =1

These constraints are still hard to deal with, however luckily we

can show that (for the optimal solution) they are equivalent to a
much simpler formulation.

min wT w (3)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (4)

11/19
Tapas Kumar Mishra Support vector Machines
The new optimization problem becomes: ¡br¿

min w> w (1)

w,b

∀i, yi (wT xi + b) ≥ 0
s.t. (2)
mini wT xi + b =1

These constraints are still hard to deal with, however luckily we

can show that (for the optimal solution) they are equivalent to a
much simpler formulation.

min wT w (3)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (4)

11/19
Tapas Kumar Mishra Support vector Machines
The new optimization problem becomes: ¡br¿

min w> w (1)

w,b

∀i, yi (wT xi + b) ≥ 0
s.t. (2)
mini wT xi + b =1

These constraints are still hard to deal with, however luckily we

can show that (for the optimal solution) they are equivalent to a
much simpler formulation.

min wT w (3)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (4)

11/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

The objective is quadratic and the constraints are all linear.
We can be solve it efficiently with any QCQP (Quadratically
Constrained Quadratic Program) solver.
It has a unique solution whenever a separating hyper plane exists.
It also has a nice interpretation: Find the simplest hyperplane
(where simpler means smaller w> w) such that all inputs lie at
least 1 unit away from the hyperplane on the correct side.

12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

12/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

12/19
Tapas Kumar Mishra Support vector Machines
Support Vectors

For the optimal w, b pair, some training points will have tight
constraints, i.e.
yi (wT xi + b) = 1.
(This must be the case, because if for all training points we had a
strict > inequality, it would be possible to scale down both
parameters w, b until the constraints are tight and obtained an
even lower objective value.)
We refer to these training points as support vectors.
Support vectors are special because they are the training points
that define the maximum margin of the hyperplane to the data set
and they therefore determine the shape of the hyperplane. If you
were to move one of them and retrain the SVM, the resulting
hyperplane would change.

13/19
Tapas Kumar Mishra Support vector Machines
Support Vectors

13/19
Tapas Kumar Mishra Support vector Machines
SVM with soft constraints

If the data is low dimensional it is often the case that there is no

separating hyperplane between the two classes.
In this case, there is no solution to the optimization problems
stated above. We can fix this by allowing the constraints to be
violated ever so slight with the introduction of slack variables:

minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

14/19
Tapas Kumar Mishra Support vector Machines
SVM with soft constraints

If the data is low dimensional it is often the case that there is no

minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

14/19
Tapas Kumar Mishra Support vector Machines
SVM with soft constraints

If the data is low dimensional it is often the case that there is no

minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

14/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

The slack variable ξi allows the input xi to be closer to the

hyperplane (or even be on the wrong side), but there is a penalty
in the objective function for such ”slack”.
If C is very large, the SVM becomes very strict and tries to get all
points to be on the right side of the hyperplane.
If C is very small, the SVM becomes very loose and may ”sacrifice”
some points to obtain a simpler (i.e. lower kwk22 ) solution.

15/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

The slack variable ξi allows the input xi to be closer to the

15/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

The slack variable ξi allows the input xi to be closer to the

15/19
Tapas Kumar Mishra Support vector Machines
minw,b wT w + C ni=1 ξi
P
s.t. ∀i yi (wT xi + b) ≥ 1 − ξi
∀i ξi ≥ 0

The slack variable ξi allows the input xi to be closer to the

15/19
Tapas Kumar Mishra Support vector Machines
soft SVM: Unconstrained Formulation:

Let us consider the value of ξi for the case of C 6= 0.

Because the objective will always try to minimize ξi as much as
possible, the equation must hold as an equality and we have:

1 − yi (wT xi + b) if yi (wT xi + b) < 1

ξi =
0 if yi (wT xi + b) ≥ 1

This is equivalent to the following closed form:

ξi = max(1 − yi (wT xi + b), 0).

16/19
Tapas Kumar Mishra Support vector Machines
soft SVM: Unconstrained Formulation:

Let us consider the value of ξi for the case of C 6= 0.

Because the objective will always try to minimize ξi as much as
possible, the equation must hold as an equality and we have:

1 − yi (wT xi + b) if yi (wT xi + b) < 1

ξi =
0 if yi (wT xi + b) ≥ 1

This is equivalent to the following closed form:

ξi = max(1 − yi (wT xi + b), 0).

16/19
Tapas Kumar Mishra Support vector Machines
soft SVM: Unconstrained Formulation:

Let us consider the value of ξi for the case of C 6= 0.

Because the objective will always try to minimize ξi as much as
possible, the equation must hold as an equality and we have:

1 − yi (wT xi + b) if yi (wT xi + b) < 1

ξi =
0 if yi (wT xi + b) ≥ 1

This is equivalent to the following closed form:

ξi = max(1 − yi (wT xi + b), 0).

16/19
Tapas Kumar Mishra Support vector Machines
If we plug this closed form into the objective of our SVM
optimization problem, we obtain the following unconstrained
version as loss function and regularizer:
n
X h i
T
min |w{zw} +C max 1 − yi (wT x + b), 0
w,b
l2 −regularizer i=1 | {z }
hinge−loss

This formulation allows us to optimize the SVM paramters (w, b)

just like logistic regression (e.g. through gradient descent).
The only difference is that we have the hinge-loss instead of the
logistic loss.

17/19
Tapas Kumar Mishra Support vector Machines
If we plug this closed form into the objective of our SVM
optimization problem, we obtain the following unconstrained
version as loss function and regularizer:
n
X h i
T
min |w{zw} +C max 1 − yi (wT x + b), 0
w,b
l2 −regularizer i=1 | {z }
hinge−loss

This formulation allows us to optimize the SVM paramters (w, b)

just like logistic regression (e.g. through gradient descent).
The only difference is that we have the hinge-loss instead of the
logistic loss.

This formulation allows us to optimize the SVM paramters (w, b)

just like logistic regression (e.g. through gradient descent).
The only difference is that we have the hinge-loss instead of the
logistic loss.

This formulation allows us to optimize the SVM paramters (w, b)

just like logistic regression (e.g. through gradient descent).
The only difference is that we have the hinge-loss instead of the
logistic loss.

17/19
Tapas Kumar Mishra Support vector Machines
18/19
Tapas Kumar Mishra Support vector Machines
19/19
Tapas Kumar Mishra Support vector Machines
min wT w (5)
w,b

s.t. ∀i yi (wT xi + b) ≥ 1 (6)

This new formulation is a quadratic optimization problem.

12/19
Tapas Kumar Mishra Support vector Machines
Kernels

Tapas Kumar Mishra

October 16, 2022

1/22
Linear classifiers are great, but what if there exists no linear
decision boundary? As it turns out, there is an elegant way to
incorporate non-linearities into most linear classifiers.

2/22
Tapas Kumar Mishra Kernels
Linear classifiers are great, but what if there exists no linear
decision boundary? As it turns out, there is an elegant way to
incorporate non-linearities into most linear classifiers.

2/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function

(feature transformations) on the input feature vectors.
Formally, for a data vector x ∈ Rd , we apply the transformation
x → ϕ(x) where ϕ(x) ∈ RD .
Usually D ≫ d because we add dimensions that capture non-linear
interactions among the original features.
Advantage: It is simple, and your problem stays convex and well
behaved. (i.e. you can still use your original gradient descent code,
just with the higher dimensional representation)
Disadvantage: ϕ(x) might be very high dimensional.

3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function

3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function

3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function

3/22
Tapas Kumar Mishra Kernels
Handcrafted Feature Expansion

We can make linear classifiers non-linear by applying basis function

3/22
Tapas Kumar Mishra Kernels
 
x1
 x2 
Consider the following example: x =  . , and define
 
 .. 
xd
 
1

 x 1


 .. 

 . 


 xd 

 x1 x2 
ϕ(x) =  .
 .
. 

 . 

 xd−1 xd 
 
 .. 
 . 
x1 x2 · · · xd
Quiz: What is the dimensionality of ϕ(x)?
Solution 1: In all elements of ϕ(x), there are d0 zero-degree

monomials, d1 one-degree monomials, ..., and dd . As a sum-up,

d d d d d

0 + 1 + 2 + ··· + d = 2 . 4/22
Tapas Kumar Mishra Kernels
 
x1
 x2 
Consider the following example: x =  . , and define
 
 .. 
xd
 
1

 x 1


 .. 

 . 


 xd 

 x1 x2 
ϕ(x) =  .
 .
. 

 . 

 xd−1 xd 
 
 .. 
 . 
x1 x2 · · · xd
Quiz: What is the dimensionality of ϕ(x)?
Solution 1: In all elements of ϕ(x), there are d0 zero-degree

monomials, d1 one-degree monomials, ..., and dd . As a sum-up,

d d d d d

0 + 1 + 2 + ··· + d = 2 . 4/22
Tapas Kumar Mishra Kernels
This new representation, ϕ(x), is very expressive and allows for
complicated non-linear decision boundaries -
but the dimensionality is extremely high. This makes our algorithm
unbearable (and quickly prohibitively) slow.

5/22
Tapas Kumar Mishra Kernels
This new representation, ϕ(x), is very expressive and allows for
complicated non-linear decision boundaries -
but the dimensionality is extremely high. This makes our algorithm
unbearable (and quickly prohibitively) slow.

5/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

The kernel trick is a way to get around this dilemma by learning a

function in the much higher dimensional space, without ever
computing a single vector ϕ(x) or ever computing the full vector
w. It is a little magical.
It is based on the following observation: If we use gradient descent
with any one of our standard loss functions, the gradient is a linear
combination of the input samples.
For example, let us take a look at the squared loss:
n
X
ℓ(w) = (w⊤ xi − yi )2 (1)
i=1

6/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

The kernel trick is a way to get around this dilemma by learning a

6/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

The kernel trick is a way to get around this dilemma by learning a

6/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

The gradient descent rule, with step-size/learning-rate s > 0 (we

denoted this as α > 0 in our previous lectures), updates w over
time,
n n
∂ℓ ∂ℓ X X
wt+1 ← wt −s( ) where: = 2(w⊤ xi − yi ) xi = γi xi
∂w ∂w | {z }
i=1 i=1
γi : function of xi , yi
(2)

7/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1

Since the loss is convex, the final solution is independent of the

initialization, and we can initialize  w0 to be whatever we want.
0
For convenience, let us pick w0 =  ... .
 

0
For this initial choice of w 0 , the linear combination in
w = ni=1 αi xi is trivially α1 = · · · = αn = 0.
P
We now show that throughout the entire gradient descent
optimization such coefficients α1 , . . . , αn must always exist, as we
can re-write the gradient updates entirely in terms of updating the
αi coefficients:
8/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss
We will now show that we can express w as a linear combination of
all input vectors,
Xn
w= αi xi . (3)
i=1

Since the loss is convex, the final solution is independent of the

initialization, and we can initialize  w0 to be whatever we want.
0
For convenience, let us pick w0 =  ... .
 

Since the loss is convex, the final solution is independent of the

initialization, and we can initialize  w0 to be whatever we want.
0
For convenience, let us pick w0 =  ... .
 

Since the loss is convex, the final solution is independent of the

initialization, and we can initialize  w0 to be whatever we want.
0
For convenience, let us pick w0 =  ... .
 

Since the loss is convex, the final solution is independent of the

initialization, and we can initialize  w0 to be whatever we want.
0
For convenience, let us pick w0 =  ... .
 

n
X n
X n
X n
X
w1 =w0 − s 2(w0⊤ xi − yi )xi = αi0 xi −s γi0 xi = αi1 xi
i=1 i=1 i=1 i=1
(with αi1 = αi0 − sγi0 )
X n n
X n
X n
X
w2 =w1 − s 2(w1⊤ xi − yi )xi = αi1 xi − s γi1 xi = αi2 xi
i=1 i=1 i=1 i=1
(with αi2 = αi1 xi − sγi1 )
X n n
X n
X n
X
⊤ 2 2
w3 =w2 − s 2(w2 xi − yi )xi = αi xi − s γi xi = αi3 xi
i=1 i=1 i=1 i=1
(with αi3 = αi2 − sγi2 )
··· ··· ···
(4)
9/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

Formally, the argument is by induction. w is trivially a linear

combination of our training vectors for w0 (base case).
If we apply the inductive hypothesis for wt it follows for wt+1 .

10/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

Formally, the argument is by induction. w is trivially a linear

combination of our training vectors for w0 (base case).
If we apply the inductive hypothesis for wt it follows for wt+1 .

10/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

The update-rule for αit is thus

t−1
X
αit = αit−1 − sγit−1 , and we have αit = −s γir . (5)
r =0

In other words, we can perform the entire gradient descent update

rule without ever expressing w explicitly. We just keep track of the
n coefficients α1 , . . . , αn .

11/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

Now that w can be written as a linear combination of the training

set, we can also express the inner-product of w with any input xi
purely in terms of inner-products between training inputs:
n
X
w ⊤ xj = αi x⊤
i xj .
i=1

12/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

Consequently,
Pn we⊤ can also2 re-write the squared-loss from
ℓ(w) = i=1 (w xi − yi ) entirely in terms of inner-product
between training inputs:
 2
n
X Xn
ℓ(α) =  αj x⊤
j xi − yi
 (6)
i=1 j=1

13/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Gradient Descent with Squared Loss

During test-time we also only need these coefficients to make a

prediction on a test-input xt , and can write the entire classifier in
terms of inner-products between the test point and training points:
n
X
h(xt ) = w⊤ xt = αj x⊤
j xt . (7)
j=1

Do you notice a theme? The only information we ever need in

order to learn a hyper-plane classifier with the squared-loss is
inner-products between all pairs of data vectors.

14/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
 
1



 x1
  ..



 .

 xd 

Let’s go back to the previous example, ϕ(x) =  x1 x2 .
 

 .
.. 

 
 xd−1 xd 
 
 .. 
 . 
x1 x2 · · · xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
ϕ(x)⊤ ϕ(z) =1 · 1 + x1 z1 + x2 z2 + · · · + x1 x2 z1 z2 + · · · + x1 · · · xd z1 · · · zd
d
Y
= (1 + xk zk ).
k=1
We can compute the inner-product from the above formula in time
O(d) instead of O(2d ). 15/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
 
1



 x1
  ..



 .

 xd 

Let’s go back to the previous example, ϕ(x) =  x1 x2 .
 

 .
.. 

 
 xd−1 xd 
 
 .. 
 . 
x1 x2 · · · xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
ϕ(x)⊤ ϕ(z) =1 · 1 + x1 z1 + x2 z2 + · · · + x1 x2 z1 z2 + · · · + x1 · · · xd z1 · · · zd
d
Y
= (1 + xk zk ).
k=1
We can compute the inner-product from the above formula in time
O(d) instead of O(2d ). 15/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation
 
1



 x1
  ..



 .

 xd 

Let’s go back to the previous example, ϕ(x) =  x1 x2 .
 

 .
.. 

 
 xd−1 xd 
 
 .. 
 . 
x1 x2 · · · xd
The inner product ϕ(x)⊤ ϕ(z) can be formulated as:
ϕ(x)⊤ ϕ(z) =1 · 1 + x1 z1 + x2 z2 + · · · + x1 x2 z1 z2 + · · · + x1 · · · xd z1 · · · zd
d
Y
= (1 + xk zk ).
k=1
We can compute the inner-product from the above formula in time
O(d) instead of O(2d ). 15/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

We define the function

k(xi , xj ) = ϕ(xi )⊤ ϕ(xj ). (8)

| {z }
this is called the kernel function

With a finite training set of n samples, inner products are often

pre-computed and stored in a Kernel Matrix:

Kij = ϕ(xi )⊤ ϕ(xj ). (9)

16/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

We define the function

k(xi , xj ) = ϕ(xi )⊤ ϕ(xj ). (8)

| {z }
this is called the kernel function

With a finite training set of n samples, inner products are often

pre-computed and stored in a Kernel Matrix:

Kij = ϕ(xi )⊤ ϕ(xj ). (9)

16/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

If we store the matrix K, we only need to do simple inner-product

look-ups and low-dimensional computations throughout the
gradient descent algorithm.
The final classifier becomes:
n
X
h(xt ) = αj k(xj , xt ). (10)
j=1

17/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

If we store the matrix K, we only need to do simple inner-product

look-ups and low-dimensional computations throughout the
gradient descent algorithm.
The final classifier becomes:
n
X
h(xt ) = αj k(xj , xt ). (10)
j=1

17/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

During training in the new high dimensional space of ϕ(x) we want

to compute γi through kernels, without ever computing any ϕ(xi )
or even w.
We previously established that w = nj=1 αj ϕ(xj ), and
P

γi = 2(w⊤ ϕ(xi ) − yi ).P

It follows that γi = 2( nj=1 αj Kij ) − yi ). The gradient update in
iteration t + 1 becomes
Xn
αit+1 ← αit − 2s( αjt Kij ) − yi ).
j=1

As we have n such updates to do, the amount of work per gradient

update in the transformed space is O(n2 ) — far better than O(2d ).

18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

During training in the new high dimensional space of ϕ(x) we want

to compute γi through kernels, without ever computing any ϕ(xi )
or even w.
We previously established that w = nj=1 αj ϕ(xj ), and
P

γi = 2(w⊤ ϕ(xi ) − yi ).P

It follows that γi = 2( nj=1 αj Kij ) − yi ). The gradient update in
iteration t + 1 becomes
Xn
αit+1 ← αit − 2s( αjt Kij ) − yi ).
j=1

As we have n such updates to do, the amount of work per gradient

update in the transformed space is O(n2 ) — far better than O(2d ).

18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

During training in the new high dimensional space of ϕ(x) we want

to compute γi through kernels, without ever computing any ϕ(xi )
or even w.
We previously established that w = nj=1 αj ϕ(xj ), and
P

γi = 2(w⊤ ϕ(xi ) − yi ).P

It follows that γi = 2( nj=1 αj Kij ) − yi ). The gradient update in
iteration t + 1 becomes
Xn
αit+1 ← αit − 2s( αjt Kij ) − yi ).
j=1

As we have n such updates to do, the amount of work per gradient

update in the transformed space is O(n2 ) — far better than O(2d ).

18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Inner-Product Computation

During training in the new high dimensional space of ϕ(x) we want

to compute γi through kernels, without ever computing any ϕ(xi )
or even w.
We previously established that w = nj=1 αj ϕ(xj ), and
P

γi = 2(w⊤ ϕ(xi ) − yi ).P

It follows that γi = 2( nj=1 αj Kij ) − yi ). The gradient update in
iteration t + 1 becomes
Xn
αit+1 ← αit − 2s( αjt Kij ) − yi ).
j=1

As we have n such updates to do, the amount of work per gradient

update in the transformed space is O(n2 ) — far better than O(2d ).

18/22
Tapas Kumar Mishra Kernels
The Kernel Trick: General Kernels

Below are some popular kernel functions:

Linear: K(x, z) = x⊤ z.
(The linear kernel is equivalent to just using a good old linear
classifier - but it can be faster to use a kernel matrix if the
dimensionality d of the data is high.)
Polynomial: K(x, z) = (1 + x⊤ z)d .

19/22
Tapas Kumar Mishra Kernels
The Kernel Trick: General Kernels

Radial Basis Function (RBF) (aka Gaussian Kernel):

−∥x−z∥2
K(x, z) = e σ2 .
The RBF kernel is the most popular Kernel! It is a Universal
approximator!! Its corresponding feature vector is infinite
dimensional and cannot be computed. However, very effective low
dimensional approximations exist (see this paper
”https://people.eecs.berkeley.edu/ brecht/papers/08.rah.rec.nips.pdf”).

20/22
Tapas Kumar Mishra Kernels
The Kernel Trick: General Kernels

−∥x−z∥
Exponential Kernel: K(x, z) = e 2σ 2
−|x−z|
Laplacian Kernel: K(x, z) = e σ
Sigmoid Kernel: K(x, z) = tanh(ax⊤ + c)

21/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions

Can any function K(·, ·) → R be used as a kernel?

No, the matrix K(xi , xj ) has to correspond to real inner-products
after some transformation x → ϕ(x). This is the case if and only if
K is positive semi-definite.
Definition: A matrix A ∈ Rn×n is positive semi-definite iff
∀q ∈ Rn , q⊤ Aq ≥ 0.
Remember Kij = ϕ(xi )⊤ ϕ(xj ). So K = Φ⊤ Φ, where
Φ = [ϕ(x1 ), . . . , ϕ(xn )]. It follows that K is p.s.d., because
q⊤ Kq = (Φ⊤ q)2 ≥ 0. Inversely, if any matrix A is p.s.d., it can be
decomposed as A = Φ⊤ Φ for some realization of Φ.

22/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions

Can any function K(·, ·) → R be used as a kernel?

22/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions

Can any function K(·, ·) → R be used as a kernel?

22/22
Tapas Kumar Mishra Kernels
The Kernel Trick: Kernel functions

Can any function K(·, ·) → R be used as a kernel?

22/22
Tapas Kumar Mishra Kernels
Kernels

Tapas Kumar Mishra

August 20, 2022

Well-defined kernels

Linear: k(x, z) = x> z

(x−z)2
RBF: k(x, z) = e − σ2

Polynomial: k(x, z) = (1 + x> z)d

2/11
Tapas Kumar Mishra Kernels
Well-defined kernels

k(x, z) = x> z
k(x, z) = ck1 (x, z)
k(x, z) = k1 (x, z) + k2 (x, z)
k(x, z) = g (k(x, z))
k(x, z) = k1 (x, z)k2 (x, z)
k(x, z) = f (x)k1 (x, z)f (z)
k(x, z) = e k1 (x,z)
k(x, z) = x> Az
where k1 , k2 are well-defined kernels, c ≥ 0, g is a polynomial
function with positive coefficients, f is any function and A 0 is
positive semi-definite.

3/11
Tapas Kumar Mishra Kernels
Theorem
−(x−z)2
The RBF kernel k(x, z) = e σ2 is a well-defined kernel matrix.

k1 (x, z) = x> z well defined by rule 1

2 2
k2 (x, z) = 2 k1 (x, z) = 2 x> z well defined by rule 2
σ σ
2x> z
k3 (x, z) = e k2 (x,z) = e σ2 well defined by rule 7
> > x> x 2x> z z> z
− x 2x − z 2z
k4 (x, z) = e σ k3 (x, z)e σ = e − σ2 e σ2 e − σ2 well defined by ru
−x> x+2x> z−z> z −(x−z)2
=e σ2 =e σ2 = kRBF (x, z)

4/11
Tapas Kumar Mishra Kernels
Theorem
The following kernel is defined on any two sets S1 , S2 ⊆ Ω,

k(S1 , S2 ) = e |S1 ∩S2 | .

List out all possible samples Ω and arrange them into a sorted list.
We define a vector xS ∈ {0, 1}|Ω| , where each of its element
indicates whether a corresponding sample is included in the set S.
It is easy to prove that
xS> xS2
k(S1 , S2 ) = e 1 ,

which is a well-defined kernel by rules 1 and 7.

5/11
Tapas Kumar Mishra Kernels
Kernel Machines

(In practice) an algorithm can be kernelized in 3 steps:

Prove thatPthe solution lies in the span of the training points
(i.e. w = ni=1 αi xi for some αi )
Rewrite the algorithm and the classifier so that all training or
testing inputs xi are only accessed in inner-products with
other inputs, e.g. x>
i xj .
Define a kernel function and substitute k(xi , xj ) for x>
i xj .

6/11
Tapas Kumar Mishra Kernels
Kernelized Linear Regression

Vanilla Ordinary Least Squares Regression (OLS) [also referred to

as linear regression] minimizes the following squared loss regression
loss function,
Xn
min (w> xi − yi )2 , (1)
w
i=1

to find the hyper-plane w. The prediction at a test-point is simply

h(x) = w> x.
If we let X = [x1 , . . . , xn ] and y = [y1 , . . . , yn ]> , the solution of
OLS can be written in closed form:

w = (XX> )−1 Xy (5) (2)

7/11
Tapas Kumar Mishra Kernels
Kernelized Linear Regression

Vanilla Ordinary Least Squares Regression (OLS) [also referred to

as linear regression] minimizes the following squared loss regression
loss function,
Xn
min (w> xi − yi )2 , (1)
w
i=1

to find the hyper-plane w. The prediction at a test-point is simply

h(x) = w> x.
If we let X = [x1 , . . . , xn ] and y = [y1 , . . . , yn ]> , the solution of
OLS can be written in closed form:

w = (XX> )−1 Xy (5) (2)

7/11
Tapas Kumar Mishra Kernels
Kernelization

We begin by expressing the solution w as a linear combination of

the training inputs
X n
w= αi xi = X~
α. (3)
i=1

We derived in the previous lecture that such a vector α

~ must
always exist by observing the gradient updates that occur if (5) is
minimized with gradient descent and the initial vector is set to
w0 = ~0 (because the squared loss is convex the solution is
independent of its initialization.)

8/11
Tapas Kumar Mishra Kernels
Kernelization

We begin by expressing the solution w as a linear combination of

the training inputs
X n
w= αi xi = X~
α. (3)
i=1

We derived in the previous lecture that such a vector α

8/11
Tapas Kumar Mishra Kernels
Similarly, during testing a test point is only accessed through
inner-products with training inputs:
n
X
>
h(z) = w z = αi x>
i z. (4)
i=1

We can now immediately kernelize the algorithm by substituting

k(x, z) for any inner-product x> z. It remains to show that we can
also solve for the values of α in closed form. As it turns out, this is
straight-forward.

9/11
Tapas Kumar Mishra Kernels
Theorem
~ = K−1 y.
Kernelized ordinary least squares has the solution α

α = w = (XX> )−1 Xy
X~ — multiply from left by X> XX>
(X> X)(X> X)~
α = X> (XX> (XX> )−1 )Xy —substitute K = X> X
K2 α
~ = Ky —multiply from left by (K−1 )2
~ = K−1 y
α

10/11
Tapas Kumar Mishra Kernels
Kernel regression can be extended to the kernelized version of ridge
regression. The solution then becomes

~ = (K + τ 2 I)−1 y.
α (5)

In practice a small value of τ 2 > 0 increases stability, especially if

K is not invertible. If τ = 0 kernel ridge regression, becomes
kernelized ordinary least squares. Typically kernel ridge regression
is also referred to as kernel regression.

11/11
Tapas Kumar Mishra Kernels
Testing

Remember that we defined w = X~

α. The prediction of a test point
z then becomes

h(z) = z> w = z> |{z}

X~α = k∗ (K + τ 2 I)−1 y = k∗ α
~,
|{z} | {z }
w z> X α
~

or, if everything is in closed form:

h(z) = k∗ (K + τ 2 I)−1 y,

where k∗ is the kernel (vector) of the test point with the training
points, i.e. the i th dimension corresponds to [k∗ ]i = φ(z)> φ(xi ),
the inner-product between the test point z with the training point
xi after the mapping into feature space through φ.

12/11
Tapas Kumar Mishra Kernels
Neural Networks
The Infamous XOR problem
• In 1969, Minsky and Papert showed that Perceptrons cannot solve
the XOR problem.
The Infamous XOR problem
• We know that the problem is not linearly separable: therefore, no
linear classifier like the Perceptron can solve the problem.

• However, we have seen that we can add non-linear basis functions to

the linear model like Perceptron and hence can obtain a non-linear
decision boundary to solve the problem.
The Infamous XOR problem
• Consider the following two basis functions:
𝐴𝑁𝐷(𝑥1 , 𝑥2 ), 𝐴𝑁𝐷 𝑥1 , 𝑥2
Now the problem is linearly separable
Perceptron can solve this problem
What is the basis function doing
• Basis function project the data from the original space to a new space
where it might be easy to learn to classify. In other words, they
change the representation of the data.

• It is very crucial to choose the correct form of the basis function (like
𝐴𝑁𝐷(𝑥1 , 𝑥2 ), 𝐴𝑁𝐷 𝑥1 , 𝑥2 for XOR) so that the problem becomes
easy to solve in the projected space.
What is the basis function doing

• We learn the parameters w1 and w2. We know that Ф1 and Ф2 are

functions of x1 and x2.
• While it is easy to construct basis functions for the XOR problem, it is
very difficult to construct good basis functions for even simple real life
examples.
• Neural networks learn these basis functions along with learning to
classify.
• Learning the basis functions is known as representation learning.
• Along with the classifier parameters, we will also learn the
projection parameters by minimizing the error function.
• Projection layer is also known as the Hidden layer.
• The model is also known as a Multilayered Perceptron MLP
as there is an input layer, hidden layer and an output layer.
• Each unit in the network is also called as a Neuron and this
network is Neural network.
• Note 2: The activation function needs to be differentiable as we will
do gradient descent.
Some common activation functions
Some common activation functions
Some common activation functions
Universal Approximation Theorem (Harnik 1991)
• A single hidden layer neural network with linear output can
approximate any continuous function arbitrarily well given enough
hidden units.
• This is an extraordinary result – but it does not mean there is a
learning algorithm that can fid the necessary parameter values!!!
• The number of hidden units required grows exponentially as the
complexity of the problem grows.
• Instead of widening the network, one can add more hidden layers.
• Non-linear projection of non-linear projections can model complex
functions relatively easily.
Backpropagatio
n: Simple
Example