0% found this document useful (0 votes)

53 views32 pages

CS60010: Deep Learning: Spring 2021

- The document discusses linear models for classification, specifically logistic regression. - It introduces logistic regression as a binary classifier that maps inputs to outputs using a logistic function of a linear combination of the weights and inputs. - Maximum likelihood estimation is used to learn the weights by maximizing the likelihood of the training data given the model. The likelihood function for logistic regression is defined based on the probabilities predicted by the logistic function.

Uploaded by

alok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views32 pages

CS60010: Deep Learning: Spring 2021

Uploaded by

alok

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

CS60010: Deep Learning

Spring 2021
Sudeshna Sarkar and Abir Das

Module 2 Part 2
Linear Models for Classification
Sudeshna Sarkar

18 Jan 2021
Announcements

• Class Test 1 on 19th Jan 2021 (tomorrow)

• Assignment 1 due 22nd Jan 12 pm

ML Background and Linear Models
Logistic Regression
Classification
A binary classifier is a mapping from 𝑅𝑅𝑑𝑑 → −1, +1

𝑥𝑥 → ℎ → 𝑦𝑦

Training data set

𝒟𝒟𝑚𝑚 = 𝑥𝑥 (1) , 𝑦𝑦 (1) , … , 𝑥𝑥 (𝑚𝑚) , 𝑦𝑦 (𝑚𝑚)
• Assume that each 𝑥𝑥 (𝑖𝑖) is a 𝑑𝑑 × 1 column vector
• Given a training set 𝒟𝒟𝑚𝑚 and a classifier ℎ, we can define the training error of ℎ to be
𝑚𝑚
1 1 ℎ 𝑥𝑥 (𝑖𝑖) ≠ 𝑦𝑦 (𝑖𝑖)
𝜀𝜀𝑚𝑚 ℎ = ��
𝑚𝑚 0 otherwise
𝑖𝑖=1
• For now, we will try to find a classifier with small training error (and hope it generalizes well
to new data, and has a small test error
Learning algorithm

• A hypothesis class ℋ is a set (finite or infinite) of possible classifiers,

each of which represents a mapping from 𝑅𝑅𝑑𝑑 → {−1, +1}
• A learning algorithm is a procedure that takes a data set 𝒟𝒟𝑛𝑛 as input
and returns an element ℎ ∈ ℋ

𝑥𝑥 → learning alg (ℋ) → 𝑦𝑦

• Choice of ℋ so as to get low test error

Hypothesis class :Linear classifiers
• A linear classifier in 𝑑𝑑 dimensions is defined by
• A vector of parameters 𝜃𝜃 𝜖𝜖 𝑅𝑅𝑑𝑑 and scalar 𝜃𝜃0 𝜖𝜖ℛ
Assume a d × 1 column vector

+1 𝑖𝑖𝑖𝑖𝜃𝜃 𝑇𝑇 𝑥𝑥 + 𝜃𝜃 > 0
ℎ 𝑥𝑥; 𝜃𝜃, 𝜃𝜃0 = sign 𝜃𝜃 𝑇𝑇 𝑥𝑥 + 𝜃𝜃0 =� 0
−1 otherwise

𝜃𝜃, 𝜃𝜃0 specifies a hyperplane (decision

boundary) that divides the instance
space into two half-spaces.
Linear classifiers
ℎ 𝑥𝑥; 𝜃𝜃, 𝜃𝜃0 = 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝜃𝜃 𝑇𝑇 𝑥𝑥 + 𝜃𝜃0
𝑇𝑇
=� +1 if 𝜃𝜃 𝑥𝑥 + 𝜃𝜃0 > 0
−1 otherwise
• 𝜃𝜃, 𝜃𝜃0 specifies a hyperplane that divides the
instance space into two half-spaces.
• The one that is on the same side as the normal
vector is the positive half-space, and we classify all
points in that space as positive.
• The half-space on the other side is negative and all
points in it are classified as negative.
Linear Classifier with Hard Threshold
• The linear separator in the associated fig
is given by
𝑥𝑥2 = 1.7𝑥𝑥1 − 4.9
→ −4.9 + 1.7𝑥𝑥1 − 𝑥𝑥2 = 0
𝑥𝑥0
→ −4.9 1.7 − 1 𝑥𝑥1 = 0
𝑥𝑥2
𝛉𝛉𝑇𝑇 𝐱𝐱 = 0
Classification Rule:
𝑇𝑇 𝑥𝑥 > 0
𝑦𝑦(𝑥𝑥) = �+1 𝑖𝑖𝑖𝑖𝜃𝜃
−1 otherwise
Linear Classifier with Hard Threshold
Classification Rule:

𝑇𝑇
𝑦𝑦(𝑥𝑥) = � +1 if 𝜃𝜃 𝑥𝑥 > 0
−1 otherwise
We can think 𝑦𝑦 as the result of passing the linear
function 𝜃𝜃 𝑇𝑇 𝑥𝑥 through a threshold function.
• Find the 𝜃𝜃 which minimizes classification error on
the training set.
• We cannot use gradient descent at all points for
the above threshold function
Perceptron
𝑇𝑇 𝑇𝑇
𝜽𝜽 = 𝜃𝜃1 𝜃𝜃2 … 𝜃𝜃𝑑𝑑 and 𝒙𝒙 = 𝑥𝑥1 𝑥𝑥2 … 𝑥𝑥𝑑𝑑
𝑥𝑥1 𝜃𝜃1 𝑑𝑑
𝒙𝒙
𝑧𝑧 𝒙𝒙 = 𝜃𝜃0 + � 𝜃𝜃𝑖𝑖 𝑥𝑥𝑖𝑖 = 𝜽𝜽𝑇𝑇 𝑏𝑏
𝜃𝜃 z(𝒙𝒙) 𝑔𝑔(𝑧𝑧) 1
𝑥𝑥2
2
∑ 𝑦𝑦 𝑖𝑖=1
𝑦𝑦 = 𝑔𝑔(𝑧𝑧 𝒙𝒙 )
𝜃𝜃0 = 𝑏𝑏
Terminologies
𝑥𝑥𝑑𝑑 𝜃𝜃𝑑𝑑
1 𝒙𝒙: input, 𝜽𝜽: weights, 𝒃𝒃: bias
𝑧𝑧: pre-activation (input activation)
𝑔𝑔: activation function
𝑦𝑦: activation (output activation)
Perceptron
𝒙𝒙 ∈ ℛ 𝑑𝑑 and 𝑦𝑦 ∈ {0, 1} for Binary Classification
1, 𝑧𝑧 ≥ 0 (Rosenblatt, 1957)
𝑔𝑔(𝑧𝑧) = �
0, 𝑧𝑧 < 0
Or, the response may be taken as 𝑦𝑦 ∈ {−1, 1}

1, 𝑧𝑧 ≥ 0
𝑔𝑔(𝑧𝑧) = �
−1, 𝑧𝑧 < 0 𝑔𝑔(𝑧𝑧)

𝑧𝑧
The perceptron classification rule, thus, translates to

1, 𝜽𝜽𝑇𝑇 𝒙𝒙 + 𝑏𝑏 ≥ 0 𝜽𝜽𝑇𝑇 𝒙𝒙 + 𝑏𝑏 = 0 represents

𝑦𝑦 = � a hyperplane.
−1, 𝜽𝜽𝑇𝑇 𝒙𝒙 + 𝑏𝑏 < 0
Perceptron (Geometrically)
ℎ𝛉𝛉 𝐱𝐱 = sign(𝛉𝛉𝑇𝑇 𝐱𝐱)

• 𝛉𝛉𝑇𝑇 𝐱𝐱 is the (signed) distance of point x to hyperplane

• Example: http://mathinsight.org/distance_point_plane

12
Perceptron Learning Algorithm

Training Set: 𝑥𝑥 (1) , 𝑦𝑦 (1) , … , 𝑥𝑥 (𝑚𝑚) , 𝑦𝑦 (𝑚𝑚)

𝛉𝛉 + 𝐲𝐲𝐲𝐲
𝑥𝑥 (𝑖𝑖) ∈ 𝑅𝑅𝑑𝑑 ; 𝑦𝑦 (𝑖𝑖) ∈ −1, +1
𝛉𝛉
1. 𝑡𝑡 ← 1; 𝜃𝜃 (𝑡𝑡) = 𝟎𝟎
𝛉𝛉 +x pushes vector 𝛉𝛉 towards x
2. // Loop until all examples are correctly classified
While exists 𝑖𝑖 such that 𝑥𝑥 (𝑚𝑚) is not correctly classified 𝛉𝛉
Pick a 𝑗𝑗 s.t. 𝑦𝑦 (𝑗𝑗) . 𝜃𝜃 (𝑡𝑡) , 𝑥𝑥 (𝑗𝑗) ≤ 0) then
𝛉𝛉(𝑡𝑡+1) = 𝛉𝛉(𝑡𝑡) + 𝑦𝑦 (𝑗𝑗) 𝑥𝑥 (𝑗𝑗)
𝑡𝑡 ← 𝑡𝑡 + 1
𝛉𝛉 −x pushes vector 𝛉𝛉 away from x
3. Output 𝛉𝛉(𝑡𝑡)
Perceptron Learning Algorithm
1. 𝑡𝑡 ← 1; 𝜃𝜃 (𝑡𝑡) = 𝟎𝟎
2. While there exists 𝑖𝑖 such that 𝑥𝑥 (𝑚𝑚) is not
correctly classified
Pick a 𝑗𝑗 s.t. 𝑦𝑦 (𝑗𝑗) . 𝜃𝜃 (𝑡𝑡) , 𝑥𝑥 (𝑗𝑗) ≤ 0) 𝜸𝜸
𝛉𝛉(𝑡𝑡+1) = 𝛉𝛉(𝑡𝑡) + 𝑦𝑦 (𝑗𝑗) 𝑥𝑥 (𝑗𝑗)
𝑡𝑡 ← 𝑡𝑡 + 1
3. Output 𝛉𝛉(𝑡𝑡)
𝜷𝜷

If such a separating hyperplane exists, then the data is known

to be linearly separable.
Convergence Theorem
For a finite and linearly separable set of data, the perceptron
𝛽𝛽2
learning algorithm will find a linear separator in at most 2
𝛾𝛾
iterations where the maximum length of any data point is 𝛽𝛽 and
𝛾𝛾 is the maximum margin of the linear separators.

18-Jan-21
18-Jan-21
Perceptron Rule

• Perceptron Learning Rule can find a linear separator given the data is
linearly separable.
• For data that are not linearly separable, the Perceptron algorithm
fails.
Linear Classifiers by Gradient Descent
For a gradient based optimization approach, we
need to approximate hard threshold function 𝜎𝜎(𝑧𝑧)
with something smooth.
• Logistic regression classifier

1
𝜎𝜎 𝑧𝑧 =
1 + 𝑒𝑒 −𝑧𝑧
𝑦𝑦 = 𝜎𝜎 ℎ𝜃𝜃 (𝐱𝐱) = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱

𝑧𝑧 = 𝛉𝛉𝑇𝑇 𝐱𝐱
Likelihood Function for Logistic Regression

• The probability that an example belongs to class 1 is

𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 1|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 0|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
Thus 𝑧𝑧 (𝑖𝑖) = θ𝑇𝑇 𝑥𝑥 (𝑖𝑖)

𝑃𝑃 𝑦𝑦 (𝑖𝑖) |𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉

𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 1|𝑥𝑥 (𝑖𝑖) 𝜎𝜎 𝑧𝑧 (𝑖𝑖)
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
= 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 0|𝑥𝑥 (𝑖𝑖) = 1 − 𝜎𝜎 𝑧𝑧 (𝑖𝑖)
𝑃𝑃 𝑦𝑦 (𝑖𝑖) |𝑥𝑥 (𝑖𝑖)
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
= 𝜎𝜎 z (𝑖𝑖) 1 − 𝜎𝜎 z (𝑖𝑖)
Maximum Likelihood Estimation of Logistic Regression

The probability that an example belongs to class 1 is 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 1|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) = 0|𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
Thus 𝑃𝑃 𝑦𝑦 (𝑖𝑖) |𝑥𝑥 (𝑖𝑖) ; 𝛉𝛉 = 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)

The joint probability of all the labels

𝑚𝑚
𝑦𝑦 (𝑖𝑖) 1−𝑦𝑦 (𝑖𝑖)
� 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)
𝑖𝑖=1
So the log likelihood for logistic regression is given by
𝑚𝑚

𝑙𝑙 𝜃𝜃 = � 𝑦𝑦 (𝑖𝑖) 𝑙𝑙𝑙𝑙𝑙𝑙 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖) + 1 − 𝑦𝑦 𝑖𝑖 𝑙𝑙𝑙𝑙𝑙𝑙 1 − 𝜎𝜎 𝛉𝛉𝑇𝑇 𝐱𝐱 (𝑖𝑖)

𝑖𝑖=1
Maximum Likelihood Estimation of Logistic Regression

Derivative of log likelihood w.r.t. one component of 𝛉𝛉

m
derivative of sum of terms
derivative of log f (x)
m
chain rule + derivative of σ
m

m
Calculating derivatives
• Since the likelihood function is a sum over all of the data, and in calculus the derivative of
a sum is the sum of derivatives, we can focus on computing the derivative of one
example. The gradient of theta is simply the sum of this term for each training data point.
• The derivative of gradient for one data point (x, y):
• 𝑝𝑝 = 𝜎𝜎(𝜃𝜃 𝑇𝑇 𝑥𝑥)
• 𝑧𝑧 = 𝜃𝜃 𝑇𝑇 𝑥𝑥
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝑝𝑝
= �
𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝑝𝑝 𝜕𝜕𝜃𝜃𝑗𝑗
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕 𝜕𝜕𝑧𝑧
= � �
𝜕𝜕𝜕𝜕 𝜕𝜕𝑧𝑧 𝜕𝜕𝜃𝜃𝑗𝑗
Derivative of logistic function

1
𝜕𝜕 𝜕𝜕 −𝑧𝑧
𝜎𝜎 𝑧𝑧 = 1 + 𝑒𝑒
𝜕𝜕𝑧𝑧 𝜕𝜕𝜕𝜕
−1 −𝑧𝑧
= � 𝑒𝑒 � −1
1 + 𝑒𝑒 −𝑧𝑧 2

1 𝑒𝑒 −𝑧𝑧
=
1 + 𝑒𝑒 −𝑧𝑧 1 + 𝑒𝑒 −𝑧𝑧
𝜎𝜎 𝑧𝑧 1 − 𝜎𝜎 𝑧𝑧
= 𝑦𝑦(1 − 𝑦𝑦)
Calculating derivatives
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕(𝜃𝜃) 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕
= � � = � �
𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝜃𝜃𝑗𝑗 𝜕𝜕𝜕𝜕 𝜕𝜕𝜕𝜕 𝜕𝜕𝜃𝜃𝑗𝑗
𝑙𝑙 𝜃𝜃 = 𝑦𝑦 log 𝑝𝑝 + 1 − 𝑦𝑦 log(1 − 𝑝𝑝) =
𝑦𝑦
−
1−𝑦𝑦
� 𝜎𝜎 𝑧𝑧 1 − 𝜎𝜎 𝑧𝑧 � 𝑥𝑥𝑗𝑗
𝑝𝑝 1−𝑝𝑝
𝜕𝜕𝜕𝜕(𝜃𝜃) 𝑦𝑦 1−𝑦𝑦
= − =
𝑦𝑦
−
1−𝑦𝑦
� 𝑝𝑝 1 − 𝑝𝑝 � 𝑥𝑥𝑗𝑗
𝜕𝜕𝜕𝜕 𝑝𝑝 1−𝑝𝑝
𝑝𝑝 1−𝑝𝑝
𝜕𝜕𝜕𝜕
= 𝜎𝜎 𝑧𝑧 1 − 𝜎𝜎 𝑧𝑧 𝑝𝑝 = 𝜎𝜎 𝑧𝑧 = 𝑦𝑦 1 − 𝑝𝑝 − 𝑝𝑝(1 − 𝑦𝑦) � 𝑥𝑥𝑗𝑗
𝜕𝜕𝜕𝜕
𝜕𝜕𝜕𝜕 = 𝑦𝑦 − 𝑝𝑝 � 𝑥𝑥𝑗𝑗
= 𝑥𝑥𝑗𝑗 𝑧𝑧 = 𝜃𝜃 𝑇𝑇 𝑥𝑥
𝜕𝜕𝜃𝜃𝑗𝑗 = 𝑦𝑦 − 𝜎𝜎(𝜃𝜃 𝑇𝑇 𝑥𝑥) � 𝑥𝑥𝑗𝑗 𝑝𝑝 = 𝜎𝜎(𝜃𝜃 𝑇𝑇 𝑥𝑥)
Gradient of Log Likelihood
𝑚𝑚
𝜕𝜕𝜕𝜕(𝜃𝜃)
= � 𝑦𝑦 (𝑖𝑖) − 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 (𝑖𝑖) 𝑥𝑥𝑗𝑗 (𝑖𝑖)
𝜕𝜕𝜃𝜃𝑗𝑗
𝑖𝑖=1
• We need to chose the values of theta that maximize the log-likelihood.
• Unfortunately, if we try just setting the derivative equal to zero, there’s no closed
form for the maximum.
• However, we can find the best values of theta by using an optimization algorithm.
Gradient Ascent Optimization
old
𝜕𝜕 𝑙𝑙 𝜃𝜃
𝜃𝜃𝑗𝑗new = 𝜃𝜃𝑗𝑗old + 𝜂𝜂 �
𝜕𝜕𝜃𝜃𝑗𝑗
𝑚𝑚

= 𝜃𝜃𝑗𝑗old + 𝜂𝜂 � � 𝑦𝑦 (𝑖𝑖) − 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 (𝑖𝑖) 𝑥𝑥𝑗𝑗 (𝑖𝑖)

𝑖𝑖=1
Cross-Entropy Loss Function
• We need a loss function 𝐿𝐿(𝑦𝑦, � 𝑦𝑦) that expresses, for an observation 𝑥𝑥, how close
the classifier output 𝑦𝑦� is to the correct output 𝑦𝑦 (which is 0 or 1).
• A loss function that prefers the correct class labels of the training examples to be
more likely. This is called conditional maximum likelihood estimation: we choose
the parameters that maximize the log probability of the true y labels in the
training data given the observations x.
• The resulting loss function is the negative log likelihood loss, generally called the
cross-entropy loss
• Minimizing the negative of this function (minimizing the negative log likelihood)
corresponds to maximizing the likelihood. This error function 𝐿𝐿(𝑦𝑦, � 𝑦𝑦) is typically
known as the cross-entropy error function (also known as log-loss):
Cross entropy loss

𝑝𝑝 𝑦𝑦 𝑥𝑥 = 𝑦𝑦� 𝑦𝑦 1 − 𝑦𝑦� 1−𝑦𝑦

log 𝑝𝑝 𝑦𝑦 𝑥𝑥 � log likelihood

= 𝑦𝑦 log 𝑦𝑦� + 1 − 𝑦𝑦 log(1 − 𝑦𝑦)
cross-entropy loss:
𝐿𝐿𝐶𝐶𝐶𝐶 𝑦𝑦,
� 𝑦𝑦 = − log 𝑝𝑝 𝑦𝑦 𝑥𝑥 = − 𝑦𝑦 log 𝑦𝑦� + 1 − 𝑦𝑦 log(1 − 𝑦𝑦) �
� 𝑦𝑦 = − 𝑦𝑦 log 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 + 1 − 𝑦𝑦 log(1 − 𝜎𝜎 𝜃𝜃 𝑇𝑇 𝑥𝑥 )
𝐿𝐿𝐶𝐶𝐶𝐶 𝑦𝑦,
Cross Entropy v.s. Square Error

Cross
Entropy

Total
Loss

Square
Error

w1 w2

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf
Gradient Descent
Multiclass Classification
Multi-class Classification
C1: 𝜃𝜃1 , 𝑏𝑏1 𝑧𝑧1 = 𝜃𝜃1 � 𝑥𝑥 + 𝑏𝑏1 Probability:
• 1 > 𝑦𝑦�𝑖𝑖 > 0
C2: 𝜃𝜃2 , 𝑏𝑏2 𝑧𝑧2 = 𝜃𝜃2 � 𝑥𝑥 + 𝑏𝑏2 • ∑𝑖𝑖 𝑦𝑦�𝑖𝑖 = 1
C3: 𝜃𝜃3 , 𝑏𝑏3 𝑧𝑧3 = 𝜃𝜃3 � 𝑥𝑥 + 𝑏𝑏3 𝑦𝑦𝑖𝑖 = 𝑃𝑃 𝐶𝐶𝑖𝑖 |𝑥𝑥

Softmax
3
𝑧𝑧1 3 𝑒𝑒 𝑧𝑧1 20 0.88
𝑒𝑒 ÷ 𝑦𝑦�1 = 𝑒𝑒 𝑧𝑧1 �� 𝑒𝑒 𝑧𝑧𝑗𝑗
𝑗𝑗=1

𝑧𝑧2 1 𝑒𝑒 2.7
𝑧𝑧2 0.12 3

𝑒𝑒 ÷ 𝑦𝑦�2 = 𝑒𝑒 𝑧𝑧2 �� 𝑒𝑒 𝑧𝑧𝑗𝑗

𝑗𝑗=1

-3 0.05 ≈0 3
𝑧𝑧3 𝑒𝑒 𝑧𝑧3 ÷
𝑒𝑒 𝑦𝑦�3 = 𝑒𝑒 𝑧𝑧3 �� 𝑒𝑒 𝑧𝑧𝑗𝑗
3 𝑗𝑗=1

+ � 𝑒𝑒 𝑧𝑧𝑗𝑗
𝑗𝑗=1
[Bishop, P209-210]
Multi-class Classification
ŷ y
𝑧𝑧1 = 𝜃𝜃1 � 𝑥𝑥 + 𝑏𝑏1 ŷ1 y1
Cross Entropy

Softmax
𝑥𝑥 𝑧𝑧2 = 𝜃𝜃2 � 𝑥𝑥 + 𝑏𝑏2 ŷ2 3
y2

𝑧𝑧3 = 𝜃𝜃3 � 𝑥𝑥 + 𝑏𝑏3 − � 𝑦𝑦𝑖𝑖 𝑙𝑙𝑙𝑙𝑦𝑦�𝑖𝑖

ŷ3 y3
𝑖𝑖=1
target
If x ∈ class 1 If x ∈ class 2 If x ∈ class 3
1 0 0
𝑦𝑦 = 0 𝑦𝑦 = 1 𝑦𝑦 = 0
0 0 1
−𝑙𝑙𝑙𝑙𝑦𝑦�1 −𝑙𝑙𝑙𝑙𝑦𝑦�2 −𝑙𝑙𝑙𝑙𝑦𝑦�3

Optimization Methods For Large-Scale Machine Learning - 2021
No ratings yet
Optimization Methods For Large-Scale Machine Learning - 2021
29 pages
6.86x Machine Learning With Python: Linear Classifiers
No ratings yet
6.86x Machine Learning With Python: Linear Classifiers
7 pages
Chapter Classification
No ratings yet
Chapter Classification
12 pages
Classification
No ratings yet
Classification
47 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Notes Chapter Linear Classifiers
No ratings yet
Notes Chapter Linear Classifiers
4 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
No ratings yet
6.036: Intro To Machine Learning: Lecturer: Professor Leslie Kaelbling Notes By: Andrew Lin Fall 2019
50 pages
Notes Chapter Logistic Regression
No ratings yet
Notes Chapter Logistic Regression
6 pages
M02Logistic Regression Logistic RegressioLogistic Regressionn
No ratings yet
M02Logistic Regression Logistic RegressioLogistic Regressionn
19 pages
Lecture13 - ML Linear & Log-Linear Models
No ratings yet
Lecture13 - ML Linear & Log-Linear Models
34 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
cs188 Fa23 Note21
No ratings yet
cs188 Fa23 Note21
8 pages
Lecture Notes 3 Perceptron
No ratings yet
Lecture Notes 3 Perceptron
7 pages
Log Reg Skimed - Ipynb - Colab
No ratings yet
Log Reg Skimed - Ipynb - Colab
10 pages
Binary Classification and Logistic Regression
No ratings yet
Binary Classification and Logistic Regression
7 pages
Lecture 5 - Logistic Regression
No ratings yet
Lecture 5 - Logistic Regression
28 pages
CS229 Supplemental Lecture Notes: 1 Binary Classification
No ratings yet
CS229 Supplemental Lecture Notes: 1 Binary Classification
7 pages
Unit 2 ML - Ver 2
No ratings yet
Unit 2 ML - Ver 2
129 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Lecture 3. Classification
No ratings yet
Lecture 3. Classification
60 pages
Chap 2 Slides
No ratings yet
Chap 2 Slides
74 pages
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
No ratings yet
Linear Discriminant Functions: CS479/679 Pattern Recognition Dr. George Bebis
41 pages
Chapter 2 - Linear Classifiers
No ratings yet
Chapter 2 - Linear Classifiers
4 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
07 - Linear Models for Classification
No ratings yet
07 - Linear Models for Classification
76 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
383 Fall11 Lec19
No ratings yet
383 Fall11 Lec19
30 pages
04 Probability and Learning PDF
No ratings yet
04 Probability and Learning PDF
34 pages
Linear Models
No ratings yet
Linear Models
30 pages
SML Lecture5
No ratings yet
SML Lecture5
45 pages
04 LogisticRegression
No ratings yet
04 LogisticRegression
29 pages
Text Classification Using Logistics Regression
No ratings yet
Text Classification Using Logistics Regression
64 pages
Lect4 Log Reg
No ratings yet
Lect4 Log Reg
20 pages
05 Optimization Basics
No ratings yet
05 Optimization Basics
94 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
CS229 Lecture 3 PDF
100% (1)
CS229 Lecture 3 PDF
35 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
12 - Bài Toán Phân L P - LR - v2
No ratings yet
12 - Bài Toán Phân L P - LR - v2
130 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
NN Theory
No ratings yet
NN Theory
138 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
Multimedia Application L9
No ratings yet
Multimedia Application L9
43 pages
ML Basics Lecture2 Linear Classification
No ratings yet
ML Basics Lecture2 Linear Classification
34 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
Logistic Regression
No ratings yet
Logistic Regression
36 pages
CH 1
No ratings yet
CH 1
24 pages
Mid Sem Solution 2019
No ratings yet
Mid Sem Solution 2019
9 pages
Gradient Descent Based Learners
No ratings yet
Gradient Descent Based Learners
11 pages
ML-chap10 2024 110300
No ratings yet
ML-chap10 2024 110300
29 pages
Lecture 2 Math
No ratings yet
Lecture 2 Math
34 pages
CS115 01
No ratings yet
CS115 01
38 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Lec1 PerceptronPocket Recap
No ratings yet
Lec1 PerceptronPocket Recap
61 pages
Lecture 05 - Logistic Regression
No ratings yet
Lecture 05 - Logistic Regression
10 pages
Logistic Regression
No ratings yet
Logistic Regression
78 pages
04 LinearModels
No ratings yet
04 LinearModels
28 pages
12 Vision and Language
No ratings yet
12 Vision and Language
68 pages
Clustering MIT 15.097 Course Notes
No ratings yet
Clustering MIT 15.097 Course Notes
9 pages
Deep Learning: Computer Science and Engineering
No ratings yet
Deep Learning: Computer Science and Engineering
18 pages
MIT15 097S12 Lec04
No ratings yet
MIT15 097S12 Lec04
6 pages
MV Forced Vibrations Notes
No ratings yet
MV Forced Vibrations Notes
51 pages
Rule Mining and The Apriori Algorithm: M I, 2 I, 3 I 1 I, 5
No ratings yet
Rule Mining and The Apriori Algorithm: M I, 2 I, 3 I 1 I, 5
6 pages
Generative Artificial Intelligence
No ratings yet
Generative Artificial Intelligence
2 pages
Neural Network & Fuzzy Logic Course Manual
No ratings yet
Neural Network & Fuzzy Logic Course Manual
4 pages
K Means Clustering
100% (1)
K Means Clustering
14 pages
Unlocking Economic Potential - The Impact of AI Institutionalization
No ratings yet
Unlocking Economic Potential - The Impact of AI Institutionalization
6 pages
Cost Function
No ratings yet
Cost Function
3 pages
Handwritten Signature Verification Using Instance Based Learning
No ratings yet
Handwritten Signature Verification Using Instance Based Learning
4 pages
Lower Secondary School Curriculum 1
No ratings yet
Lower Secondary School Curriculum 1
10 pages
D.A.V. School: Artificial Intelligence Class: X Quarterly Examination 2020-2021 Time: 2 Hrs
No ratings yet
D.A.V. School: Artificial Intelligence Class: X Quarterly Examination 2020-2021 Time: 2 Hrs
8 pages
JD - Business Analyst of A Company
No ratings yet
JD - Business Analyst of A Company
3 pages
Applications of AI
No ratings yet
Applications of AI
15 pages
AWS Healthcare The Future of Healthcare in The Cloud
No ratings yet
AWS Healthcare The Future of Healthcare in The Cloud
28 pages
Creativity in Talent Development (Donna Porter Nancy Tennant)
No ratings yet
Creativity in Talent Development (Donna Porter Nancy Tennant)
182 pages
Circular - Model 24-25 Odd
No ratings yet
Circular - Model 24-25 Odd
2 pages
Incremental Adversarial Learning For Polymorphic Attack Detection
No ratings yet
Incremental Adversarial Learning For Polymorphic Attack Detection
47 pages
ACS Recognition of Prior Learning (RPL) Form 2024 v2
No ratings yet
ACS Recognition of Prior Learning (RPL) Form 2024 v2
17 pages
Low-Cost Convolutional Neural Network For Tomato Plant Diseases Classification
No ratings yet
Low-Cost Convolutional Neural Network For Tomato Plant Diseases Classification
9 pages
Aryan Sunil Mishra
No ratings yet
Aryan Sunil Mishra
1 page
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
100% (1)
HuggingGPT: Solving AI Tasks With ChatGPT and Its Friends in HuggingFace
18 pages
MBA in Digital Marketing and MBA in AI
No ratings yet
MBA in Digital Marketing and MBA in AI
9 pages
Docu
No ratings yet
Docu
21 pages
Introduction To Deep Learning - Coursera
No ratings yet
Introduction To Deep Learning - Coursera
7 pages
Exploration of Chatgpt in Basic Education Advantages Disadvantages and Its Impact On School Tasks 14615
No ratings yet
Exploration of Chatgpt in Basic Education Advantages Disadvantages and Its Impact On School Tasks 14615
12 pages
CH 3 Descriptive Type Questions
No ratings yet
CH 3 Descriptive Type Questions
2 pages
Shubham Resume
No ratings yet
Shubham Resume
2 pages
Digital Commerce Technologies
No ratings yet
Digital Commerce Technologies
11 pages
Class 10 Ai Sample Paper - 5
No ratings yet
Class 10 Ai Sample Paper - 5
5 pages
Silver Jubilee CMRIT Calendar Final
No ratings yet
Silver Jubilee CMRIT Calendar Final
20 pages
Topic Modeling Using NLP For Student Feedback
No ratings yet
Topic Modeling Using NLP For Student Feedback
4 pages
Deeplearning - Ai Deeplearning - Ai
No ratings yet
Deeplearning - Ai Deeplearning - Ai
36 pages