0% found this document useful (0 votes)
53 views32 pages

CS60010: Deep Learning: Spring 2021

- The document discusses linear models for classification, specifically logistic regression. - It introduces logistic regression as a binary classifier that maps inputs to outputs using a logistic function of a linear combination of the weights and inputs. - Maximum likelihood estimation is used to learn the weights by maximizing the likelihood of the training data given the model. The likelihood function for logistic regression is defined based on the probabilities predicted by the logistic function.

Uploaded by

alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views32 pages

CS60010: Deep Learning: Spring 2021

- The document discusses linear models for classification, specifically logistic regression. - It introduces logistic regression as a binary classifier that maps inputs to outputs using a logistic function of a linear combination of the weights and inputs. - Maximum likelihood estimation is used to learn the weights by maximizing the likelihood of the training data given the model. The likelihood function for logistic regression is defined based on the probabilities predicted by the logistic function.

Uploaded by

alok
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

CS60010: Deep Learning

Spring 2021
Sudeshna Sarkar and Abir Das

Module 2 Part 2
Linear Models for Classification
Sudeshna Sarkar

18 Jan 2021
Announcements

• Class Test 1 on 19th Jan 2021 (tomorrow)

• Assignment 1 due 22nd Jan 12 pm


ML Background and Linear Models
Logistic Regression
Classification
A binary classifier is a mapping from 𝑅𝑅𝑑𝑑 → −1, +1

𝑥𝑥 → ℎ → 𝑦𝑦

Training data set


𝒟𝒟𝑚𝑚 = 𝑥𝑥 (1) , 𝑦𝑦 (1) , … , 𝑥𝑥 (𝑚𝑚) , 𝑦𝑦 (𝑚𝑚)
• Assume that each 𝑥𝑥 (𝑖𝑖) is a 𝑑𝑑 × 1 column vector
• Given a training set 𝒟𝒟𝑚𝑚 and a classifier ℎ, we can define the training error of ℎ to be
𝑚𝑚
1 1 ℎ 𝑥𝑥 (𝑖𝑖) ≠ 𝑦𝑦 (𝑖𝑖)
𝜀𝜀𝑚𝑚 ℎ = ��
𝑚𝑚 0 otherwise
𝑖𝑖=1
• For now, we will try to find a classifier with small training error (and hope it generalizes well
to new data, and has a small test error
Learning algorithm

• A hypothesis class ℋ is a set (finite or infinite) of possible classifiers,


each of which represents a mapping from 𝑅𝑅𝑑𝑑 → {−1, +1}
• A learning algorithm is a procedure that takes a data set 𝒟𝒟𝑛𝑛 as input
and returns an element ℎ ∈ ℋ

𝑥𝑥 → learning alg (ℋ) → 𝑦𝑦

• Choice of ℋ so as to get low test error


Hypothesis class :Linear classifiers
• A linear classifier in 𝑑𝑑 dimensions is defined by
• A vector of parameters 𝜃𝜃 𝜖𝜖 𝑅𝑅𝑑𝑑 and scalar 𝜃𝜃0 𝜖𝜖ℛ
Assume a d × 1 column vector

+1 𝑖𝑖𝑖𝑖𝜃𝜃 𝑇𝑇 𝑥𝑥 + 𝜃𝜃 > 0
ℎ 𝑥𝑥; 𝜃𝜃, 𝜃𝜃0 = sign 𝜃𝜃 𝑇𝑇 𝑥𝑥 + 𝜃𝜃0 =� 0
−1 otherwise

𝜃𝜃, 𝜃𝜃0 specifies a hyperplane (decision


boundary) that divides the instance
space into two half-spaces.
Linear classifiers
ℎ 𝑥𝑥; 𝜃𝜃, 𝜃𝜃0 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝜃𝜃 𝑇𝑇 𝑥𝑥 + 𝜃𝜃0
𝑇𝑇
=� +1 if 𝜃𝜃 𝑥𝑥 + 𝜃𝜃0 > 0
−1 otherwise
• 𝜃𝜃, 𝜃𝜃0 specifies a hyperplane that divides the
instance space into two half-spaces.
• The one that is on the same side as the normal
vector is the positive half-space, and we classify all
points in that space as positive.
• The half-space on the other side is negative and all
points in it are classified as negative.
Linear Classifier with Hard Threshold
• The linear separator in the associated fig
is given by
𝑥𝑥2 = 1.7𝑥𝑥1 − 4.9
→ −4.9 + 1.7𝑥𝑥1 − 𝑥𝑥2 = 0
𝑥𝑥0
→ −4.9 1.7 − 1 𝑥𝑥1 = 0
𝑥𝑥2
𝛉𝛉𝑇𝑇 𝐱𝐱 = 0
Classification Rule:
𝑇𝑇 𝑥𝑥 > 0
𝑦𝑦(𝑥𝑥) = �+1 𝑖𝑖𝑖𝑖𝜃𝜃
−1 otherwise
Linear Classifier with Hard Threshold
Classification Rule:

𝑇𝑇
𝑦𝑦(𝑥𝑥) = � +1 if 𝜃𝜃 𝑥𝑥 > 0
−1 otherwise
We can think 𝑦𝑦 as the result of passing the linear
function 𝜃𝜃 𝑇𝑇 𝑥𝑥 through a threshold function.
• Find the 𝜃𝜃 which minimizes classification error on
the training set.
• We cannot use gradient descent at all points for
the above threshold function
Perceptron
𝑇𝑇 𝑇𝑇
𝜽𝜽 = 𝜃𝜃1 𝜃𝜃2 … 𝜃𝜃𝑑𝑑 and 𝒙𝒙 = 𝑥𝑥1 𝑥𝑥2 … 𝑥𝑥𝑑𝑑
𝑥𝑥1 𝜃𝜃1 𝑑𝑑
𝒙𝒙
𝑧𝑧 𝒙𝒙 = 𝜃𝜃0 + � 𝜃𝜃𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝜽𝜽𝑇𝑇 𝑏𝑏
𝜃𝜃 z(𝒙𝒙) 𝑔𝑔(𝑧𝑧) 1
𝑥𝑥2
2
∑ 𝑦𝑦 𝑖𝑖=1
𝑦𝑦 = 𝑔𝑔(𝑧𝑧 𝒙𝒙 )
𝜃𝜃0 = 𝑏𝑏
Terminologies
𝑥𝑥𝑑𝑑 𝜃𝜃𝑑𝑑
1 𝒙𝒙: input, 𝜽𝜽: weights, 𝒃𝒃: bias
𝑧𝑧: pre-activation (input activation)
𝑔𝑔: activation function
𝑦𝑦: activation (output activation)
Perceptron
𝒙𝒙 ∈ ℛ 𝑑𝑑 and 𝑦𝑦 ∈ {0, 1} for Binary Classification
1, 𝑧𝑧 ≥ 0 (Rosenblatt, 1957)
𝑔𝑔(𝑧𝑧) = �
0, 𝑧𝑧 < 0
Or, the response may be taken as 𝑦𝑦 ∈ {−1, 1}

1, 𝑧𝑧 ≥ 0
𝑔𝑔(𝑧𝑧) = �
−1, 𝑧𝑧 < 0 𝑔𝑔(𝑧𝑧)

𝑧𝑧
The perceptron classification rule, thus, translates to

1, 𝜽𝜽𝑇𝑇 𝒙𝒙 + 𝑏𝑏 ≥ 0 𝜽𝜽𝑇𝑇 𝒙𝒙 + 𝑏𝑏 = 0 represents


𝑦𝑦 = � a hyperplane.
−1, 𝜽𝜽𝑇𝑇 𝒙𝒙 + 𝑏𝑏 < 0
Perceptron (Geometrically)
ℎ𝛉𝛉 𝐱𝐱 = sign(𝛉𝛉𝑇𝑇 𝐱𝐱)

• 𝛉𝛉𝑇𝑇 𝐱𝐱 is the (signed) distance of point x to hyperplane


• Example: http://mathinsight.org/distance_point_plane

12
Perceptron Learning Algorithm

Training Set: 𝑥𝑥 (1) , 𝑦𝑦 (1) , … , 𝑥𝑥 (𝑚𝑚) , 𝑦𝑦 (𝑚𝑚)


𝛉𝛉 + 𝐲𝐲𝐲𝐲
𝑥𝑥 (𝑖𝑖) ∈ 𝑅𝑅𝑑𝑑 ; 𝑦𝑦 (𝑖𝑖) ∈ −1, +1
𝛉𝛉
1. 𝑡𝑡 ← 1; 𝜃𝜃 (𝑡𝑡) = 𝟎𝟎
𝛉𝛉 +x pushes vector 𝛉𝛉 towards x
2. // Loop until all examples are correctly classified
While exists 𝑖𝑖 such that 𝑥𝑥 (𝑚𝑚) is not correctly classified 𝛉𝛉
Pick a 𝑗𝑗 s.t. 𝑦𝑦 (𝑗𝑗) . 𝜃𝜃 (𝑡𝑡) , 𝑥𝑥 (𝑗𝑗) ≤ 0) then
𝛉𝛉(𝑡𝑡+1) = 𝛉𝛉(𝑡𝑡) + 𝑦𝑦 (𝑗𝑗) 𝑥𝑥 (𝑗𝑗)
𝑡𝑡 ← 𝑡𝑡 + 1
𝛉𝛉 −x pushes vector 𝛉𝛉 away from x
3. Output 𝛉𝛉(𝑡𝑡)
Perceptron Learning Algorithm
1. 𝑡𝑡 ← 1; 𝜃𝜃 (𝑡𝑡) = 𝟎𝟎
2. While there exists 𝑖𝑖 such that 𝑥𝑥 (𝑚𝑚) is not
correctly classified
Pick a 𝑗𝑗 s.t. 𝑦𝑦 (𝑗𝑗) . 𝜃𝜃 (𝑡𝑡) , 𝑥𝑥 (𝑗𝑗) ≤ 0) 𝜸𝜸
𝛉𝛉(𝑡𝑡+1) = 𝛉𝛉(𝑡𝑡) + 𝑦𝑦 (𝑗𝑗) 𝑥𝑥 (𝑗𝑗)
𝑡𝑡 ← 𝑡𝑡 + 1
3. Output 𝛉𝛉(𝑡𝑡)
𝜷𝜷

If such a separating hyperplane exists, then the data is known


to be linearly separable.
Convergence Theorem
For a finite and linearly separable set of data, the perceptron
𝛽𝛽2
learning algorithm will find a linear separator in at most 2
𝛾𝛾
iterations where the maximum length of any data point is 𝛽𝛽 and
𝛾𝛾 is the maximum margin of the linear separators.

18-Jan-21
18-Jan-21
Perceptron Rule

• Perceptron Learning Rule can find a linear separator given the data is
linearly separable.
• For data that are not linearly separable, the Perceptron algorithm
fails.
Linear Classifiers by Gradient Descent
For a gradient based optimization approach, we
need to approximate hard threshold function 𝜎𝜎(𝑧𝑧)
with something smooth.
• Logistic regression classifier

1
𝜎𝜎 𝑧𝑧 =
1 + 𝑒𝑒 −𝑧𝑧
𝑦𝑦 = 𝜎𝜎 ℎ𝜃𝜃 (𝐱𝐱) = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱

𝑧𝑧 = 𝛉𝛉𝑇𝑇 𝐱𝐱
Likelihood Function for Logistic Regression

• The probability that an example belongs to class 1 is


𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 1|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 0|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
Thus 𝑧𝑧 (𝑖𝑖) = θ𝑇𝑇 𝑥𝑥 (𝑖𝑖)

𝑃𝑃 𝑦𝑦 (𝑖𝑖) |𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉


𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 1|𝑥𝑥 (𝑖𝑖) 𝜎𝜎 𝑧𝑧 (𝑖𝑖)
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
= 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 0|𝑥𝑥 (𝑖𝑖) = 1 − 𝜎𝜎 𝑧𝑧 (𝑖𝑖)
𝑃𝑃 𝑦𝑦 (𝑖𝑖) |𝑥𝑥 (𝑖𝑖)
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
= 𝜎𝜎 z (𝑖𝑖) 1 − 𝜎𝜎 z (𝑖𝑖)
Maximum Likelihood Estimation of Logistic Regression

The probability that an example belongs to class 1 is 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 1|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 0|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) |𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)

The joint probability of all the labels


𝑚𝑚
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
� 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
𝑖𝑖=1
So the log likelihood for logistic regression is given by
𝑚𝑚

𝑙𝑙 𝜃𝜃 = � 𝑦𝑦 (𝑖𝑖) 𝑙𝑙𝑙𝑙𝑙𝑙 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) + 1 − 𝑦𝑦 𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)


𝑖𝑖=1
Maximum Likelihood Estimation of Logistic Regression

Derivative of log likelihood w.r.t. one component of 𝛉𝛉


m
derivative of sum of terms
derivative of log f (x)
m
chain rule + derivative of σ
m

m
Calculating derivatives
• Since the likelihood function is a sum over all of the data, and in calculus the derivative of
a sum is the sum of derivatives, we can focus on computing the derivative of one
example. The gradient of theta is simply the sum of this term for each training data point.
• The derivative of gradient for one data point (x, y):
• 𝑝𝑝 = 𝜎𝜎(𝜃𝜃 𝑇𝑇 𝑥𝑥)
• 𝑧𝑧 = 𝜃𝜃 𝑇𝑇 𝑥𝑥
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝑝𝑝
= �
𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝑝𝑝 𝜕𝜕𝜃𝜃𝑗𝑗
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕 𝜕𝜕𝑧𝑧
= � �
𝜕𝜕𝜕𝜕 𝜕𝜕𝑧𝑧 𝜕𝜕𝜃𝜃𝑗𝑗
Derivative of logistic function

1
𝜕𝜕 𝜕𝜕 −𝑧𝑧
𝜎𝜎 𝑧𝑧 = 1 + 𝑒𝑒
𝜕𝜕𝑧𝑧 𝜕𝜕𝜕𝜕
−1 −𝑧𝑧
= � 𝑒𝑒 � −1
1 + 𝑒𝑒 −𝑧𝑧 2

1 𝑒𝑒 −𝑧𝑧
=
1 + 𝑒𝑒 −𝑧𝑧 1 + 𝑒𝑒 −𝑧𝑧
𝜎𝜎 𝑧𝑧 1 − 𝜎𝜎 𝑧𝑧
= 𝑦𝑦(1 − 𝑦𝑦)
Calculating derivatives
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
= � � = � �
𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜃𝜃𝑗𝑗
𝑙𝑙 𝜃𝜃 = 𝑦𝑦 log 𝑝𝑝 + 1 − 𝑦𝑦 log(1 − 𝑝𝑝) =
𝑦𝑦

1−𝑦𝑦
� 𝜎𝜎 𝑧𝑧 1 − 𝜎𝜎 𝑧𝑧 � 𝑥𝑥𝑗𝑗
𝑝𝑝 1−𝑝𝑝
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝑦𝑦 1−𝑦𝑦
= − =
𝑦𝑦

1−𝑦𝑦
� 𝑝𝑝 1 − 𝑝𝑝 � 𝑥𝑥𝑗𝑗
𝜕𝜕𝜕𝜕 𝑝𝑝 1−𝑝𝑝
𝑝𝑝 1−𝑝𝑝
𝜕𝜕𝜕𝜕
= 𝜎𝜎 𝑧𝑧 1 − 𝜎𝜎 𝑧𝑧 𝑝𝑝 = 𝜎𝜎 𝑧𝑧 = 𝑦𝑦 1 − 𝑝𝑝 − 𝑝𝑝(1 − 𝑦𝑦) � 𝑥𝑥𝑗𝑗
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕 = 𝑦𝑦 − 𝑝𝑝 � 𝑥𝑥𝑗𝑗
= 𝑥𝑥𝑗𝑗 𝑧𝑧 = 𝜃𝜃 𝑇𝑇 𝑥𝑥
𝜕𝜕𝜃𝜃𝑗𝑗 = 𝑦𝑦 − 𝜎𝜎(𝜃𝜃 𝑇𝑇 𝑥𝑥) � 𝑥𝑥𝑗𝑗 𝑝𝑝 = 𝜎𝜎(𝜃𝜃 𝑇𝑇 𝑥𝑥)
Gradient of Log Likelihood
𝑚𝑚
𝜕𝜕𝜕𝜕(𝜃𝜃)
= � 𝑦𝑦 (𝑖𝑖) − 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 (𝑖𝑖) 𝑥𝑥𝑗𝑗 (𝑖𝑖)
𝜕𝜕𝜃𝜃𝑗𝑗
𝑖𝑖=1
• We need to chose the values of theta that maximize the log-likelihood.
• Unfortunately, if we try just setting the derivative equal to zero, there’s no closed
form for the maximum.
• However, we can find the best values of theta by using an optimization algorithm.
Gradient Ascent Optimization
old
𝜕𝜕 𝑙𝑙 𝜃𝜃
𝜃𝜃𝑗𝑗new = 𝜃𝜃𝑗𝑗old + 𝜂𝜂 �
𝜕𝜕𝜃𝜃𝑗𝑗
𝑚𝑚

= 𝜃𝜃𝑗𝑗old + 𝜂𝜂 � � 𝑦𝑦 (𝑖𝑖) − 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 (𝑖𝑖) 𝑥𝑥𝑗𝑗 (𝑖𝑖)


𝑖𝑖=1
Cross-Entropy Loss Function
• We need a loss function 𝐿𝐿(𝑦𝑦, � 𝑦𝑦) that expresses, for an observation 𝑥𝑥, how close
the classifier output 𝑦𝑦� is to the correct output 𝑦𝑦 (which is 0 or 1).
• A loss function that prefers the correct class labels of the training examples to be
more likely. This is called conditional maximum likelihood estimation: we choose
the parameters that maximize the log probability of the true y labels in the
training data given the observations x.
• The resulting loss function is the negative log likelihood loss, generally called the
cross-entropy loss
• Minimizing the negative of this function (minimizing the negative log likelihood)
corresponds to maximizing the likelihood. This error function 𝐿𝐿(𝑦𝑦, � 𝑦𝑦) is typically
known as the cross-entropy error function (also known as log-loss):
Cross entropy loss

𝑝𝑝 𝑦𝑦 𝑥𝑥 = 𝑦𝑦� 𝑦𝑦 1 − 𝑦𝑦� 1−𝑦𝑦

log 𝑝𝑝 𝑦𝑦 𝑥𝑥 � log likelihood


= 𝑦𝑦 log 𝑦𝑦� + 1 − 𝑦𝑦 log(1 − 𝑦𝑦)
cross-entropy loss:
𝐿𝐿𝐶𝐶𝐶𝐶 𝑦𝑦,
� 𝑦𝑦 = − log 𝑝𝑝 𝑦𝑦 𝑥𝑥 = − 𝑦𝑦 log 𝑦𝑦� + 1 − 𝑦𝑦 log(1 − 𝑦𝑦) �
� 𝑦𝑦 = − 𝑦𝑦 log 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 + 1 − 𝑦𝑦 log(1 − 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 )
𝐿𝐿𝐶𝐶𝐶𝐶 𝑦𝑦,
Cross Entropy v.s. Square Error

Cross
Entropy

Total
Loss

Square
Error

w1 w2

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
Gradient Descent
Multiclass Classification
Multi-class Classification
C1: 𝜃𝜃1 , 𝑏𝑏1 𝑧𝑧1 = 𝜃𝜃1 � 𝑥𝑥 + 𝑏𝑏1 Probability:
• 1 > 𝑦𝑦�𝑖𝑖 > 0
C2: 𝜃𝜃2 , 𝑏𝑏2 𝑧𝑧2 = 𝜃𝜃2 � 𝑥𝑥 + 𝑏𝑏2 • ∑𝑖𝑖 𝑦𝑦�𝑖𝑖 = 1
C3: 𝜃𝜃3 , 𝑏𝑏3 𝑧𝑧3 = 𝜃𝜃3 � 𝑥𝑥 + 𝑏𝑏3 𝑦𝑦𝑖𝑖 = 𝑃𝑃 𝐶𝐶𝑖𝑖 |𝑥𝑥

Softmax
3
𝑧𝑧1 3 𝑒𝑒 𝑧𝑧1 20 0.88
𝑒𝑒 ÷ 𝑦𝑦�1 = 𝑒𝑒 𝑧𝑧1 �� 𝑒𝑒 𝑧𝑧𝑗𝑗
𝑗𝑗=1

𝑧𝑧2 1 𝑒𝑒 2.7
𝑧𝑧2 0.12 3

𝑒𝑒 ÷ 𝑦𝑦�2 = 𝑒𝑒 𝑧𝑧2 �� 𝑒𝑒 𝑧𝑧𝑗𝑗


𝑗𝑗=1

-3 0.05 ≈0 3
𝑧𝑧3 𝑒𝑒 𝑧𝑧3 ÷
𝑒𝑒 𝑦𝑦�3 = 𝑒𝑒 𝑧𝑧3 �� 𝑒𝑒 𝑧𝑧𝑗𝑗
3 𝑗𝑗=1

+ � 𝑒𝑒 𝑧𝑧𝑗𝑗
𝑗𝑗=1
[Bishop, P209-210]
Multi-class Classification
ŷ y
𝑧𝑧1 = 𝜃𝜃1 � 𝑥𝑥 + 𝑏𝑏1 ŷ1 y1
Cross Entropy

Softmax
𝑥𝑥 𝑧𝑧2 = 𝜃𝜃2 � 𝑥𝑥 + 𝑏𝑏2 ŷ2 3
y2

𝑧𝑧3 = 𝜃𝜃3 � 𝑥𝑥 + 𝑏𝑏3 − � 𝑦𝑦𝑖𝑖 𝑙𝑙𝑙𝑙𝑦𝑦�𝑖𝑖


ŷ3 y3
𝑖𝑖=1
target
If x ∈ class 1 If x ∈ class 2 If x ∈ class 3
1 0 0
𝑦𝑦 = 0 𝑦𝑦 = 1 𝑦𝑦 = 0
0 0 1
−𝑙𝑙𝑙𝑙𝑦𝑦�1 −𝑙𝑙𝑙𝑙𝑦𝑦�2 −𝑙𝑙𝑙𝑙𝑦𝑦�3

You might also like