0% found this document useful (0 votes)

9 views17 pages

03-Linear Classification

The document discusses classification as a supervised learning method used to predict discrete outcomes across various fields. It explains the use of one-hot encoding for categorical labels, the architecture of classification networks, and the softmax operation for interpreting outputs as probabilities. Additionally, it covers loss functions, specifically cross-entropy loss, and introduces concepts from information theory such as entropy and surprisal.

Uploaded by

Kalp Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views17 pages

03-Linear Classification

Uploaded by

Kalp Patel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

CSD456

Deep Learning
Linear
Classification
Classification

• Classification is a type of supervised learning used to predict

discrete outcomes.
• It estimates the relationships among variables.
• Common in various fields like finance, biology, and economics.

Examples
o Does this email belong in the spam folder or the inbox?
o Is this customer more likely to sign up or not to sign up for a
subscription service?
o Does this image depict a donkey, a dog, a cat, or a rooster?
Classification
• A simple image classification problem:
o Each input consists of a 2 × 2 grayscale image.
o We represent each pixel value with a single scalar, giving us four features
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 .
o We assume that each image belongs to one among the categories “cat”,
“chicken”, and “dog”.

W
x2

x1
-b/w1
Representation of Labels

o If the categories had some natural ordering among them

• Cast this problem as regression and keep the labels as in their format.

o If the categories do not have orderings among them.

• Use one-hot encoding, which is a vector with as many components as we
have categories.
• The component corresponding to instance’s category is set to 1 and all
other components are set to 0.
• In our case, a label 𝑦 would be a three-dimensional vector, with (1,0,0) =
“cat”,
(0,1,0) = “chicken”, and (0,0,1) = “dog”:

𝒚 ∈ {(𝟏, 𝟎, 𝟎), (𝟎, 𝟏, 𝟎), (𝟎, 𝟎, 𝟏)}

Classification as a Regression problem
• Deep models a continuous function y = 𝐹 𝑥; 𝑤
• But in classification we need discrete output.
• How to get Discrete output from a continuous output??
• One of the possible solution is to constrain the
continuous output.
• For Binary classification, this constraint could be
sigmoid function.
• For Multi-class classification, it is softmax.
• That’s why, Multi-class classification is also called as
Softmax Regression
Classification Network Architecture

• To estimate the conditional probabilities associated with all the possible classes,
we need a model with multiple outputs, one per class.
• To address classification with linear models, we will need as many affine functions as
we have outputs.
o Each output will correspond to its own affine function.
• In our case, since we have 4 features and 3 possible output categories, we will need
o 12 scalars to represent the weights (𝑤 with subscripts),
o 3 scalars to represent the biases (𝑏 with subscripts).
• We compute these three logits, 𝑜1 , 𝑜2 , and 𝑜3 , for each input:

𝑜1 = 𝑥1 𝑤11 + 𝑥2 𝑤12 + 𝑥3 𝑤13 + 𝑥4 𝑤14 + 𝑏1

𝑜2 = 𝑥1 𝑤21 + 𝑥2 𝑤22 + 𝑥3 𝑤23 + 𝑥4 𝑤24 + 𝑏2
𝑜3 = 𝑥1 𝑤31 + 𝑥2 𝑤32 + 𝑥3 𝑤33 + 𝑥4 𝑤34 + 𝑏3 -- (3.4.2)
Classification Network Architecture
• We can depict the calculation in (3.4.2) with the neural network diagram shown in
Fig. 3.4.1.
o Softmax regression is a single-layer neural network.
o The output layer of softmax regression can be described as fully-connected
layer.

• We can use linear algebra notation: 𝒐 = 𝑾𝒙 + 𝒃

o This form better suited both for mathematics, and for writing code.
Softmax Operation

• To interpret our outputs as probabilities,

o We must guarantee that (even on new data), they will be nonnegative and sum up
to 1.
o We need a training objective that encourages the model to estimate faithfully
probabilities.
• To transform our logits to become nonnegative and sum to 1, while requiring that the
model remains differentiable,
o We first exponentiate each logit (ensuring non-negativity)
o Then divide by their sum (ensuring that they sum to 1)

exp 𝑜𝑗
ෝ = softmax 𝒐
𝒚 where 𝑦ො𝑗 = σ --(3.4.3)
𝑘 exp 𝑜𝑘

o It is easy to see 𝑦ො1 + 𝑦ො2 + 𝑦ො3 = 1 with 0 ≤ 𝑦ො𝑗 ≤ 1 for all 𝑗.

Softmax Operation

• The softmax operation does not change the ordering among the logits 𝑜.
o Therefore, during prediction we can still pick out the most likely class by
argmaxj 𝑦ො𝑗 = argmaxj oj --(3.4.4)

• Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an
affine transformation of input features; thus, softmax regression is a linear model.
Vectorization of Minibatches
• Assume we are given a minibatch 𝑿 of examples with feature dimensionality (number of
inputs) 𝑑 and batch size 𝑛.
• Moreover, assume that we have 𝑞 categories in the output. Then we have:
o The minibatch features 𝑿 are in ℝ𝑛×𝑑 .
o The weights 𝑾 ∈ ℝ𝑑×𝑞 .
o The bias satisfies 𝑏 ∈ ℝ1×𝑞 .
𝑶 = 𝑿𝑾 + 𝒃,
෡ = softmax(𝐎)
𝒀 -- (3.4.5)
• This accelerates the dominant operation into a matrix-matrix product 𝑿𝑾 vs. the matrix-
vector products we would be executing if we processed one example at a time.
• Since each row in 𝑿 represents a data, the SoftMax operation itself can be computed row-
wise:
o for each row of 𝑶, exponentiate all entries and then normalize them by the sum.
• Triggering broadcasting during the summation 𝑿𝑾 + 𝒃, both the minibatch logits 𝑶 and
output probabilities 𝒀෡ are 𝑛 × 𝑞 matrices.
Loss function

• Next, we need a loss function to measure the quality of our predicted probabilities.
• We will rely on maximum likelihood estimation
o The very same concept that we encountered when providing a probabilistic
justification for the mean squared error objective in linear regression
• The softmax function gives us a vector 𝑦,ො which we can interpret as estimated
conditional probabilities of each class given any input 𝑥, e.g.

𝑦ො1 = 𝑃(𝑦 = 𝑐𝑎𝑡 ∣ 𝒙).

Loss function
• Suppose that the entire dataset {𝑿, 𝒀} has 𝑛 examples, where the example indexed
by 𝑖 consists of a feature vector 𝑥 𝑖 and a one-hot label vector 𝑦 𝑖 .
• We can compare the estimates with reality by checking how probable the actual
classes are according to our model, given the features:
𝑃(𝒀 ∣ 𝑿) = ς𝑛𝑖=1 𝑃(𝑦 𝑖 ∣ 𝑥 𝑖 ) -- (3.4.6)
• According to maximum likelihood estimation, we maximize 𝑃(𝒀 ∣ 𝑿), which is
equivalent to minimizing the
negative log-likelihood:
−𝑙𝑜𝑔𝑃(𝒀 ∣ 𝑿) = σ𝑛𝑖=1 −𝑙𝑜𝑔𝑃 𝑦 𝑖 𝑥 𝑖 = σ𝑛𝑖=1 𝑙(𝑦 𝑖 , 𝑦ො 𝑖 ) , -- (3.4.7)

where for any pair of label 𝑦 and model prediction 𝑦ො over 𝑞 classes, the loss
function 𝑙 is
𝑞
ො = − σ𝑗=1 𝑦𝑗 log 𝑦ො𝑗
𝑙(𝑦, 𝑦) -- (3.4.8)
Loss function : Log Likelihood

• The loss function in (3.4.8) is commonly called the cross-entropy loss.

• Since 𝑦 is a one-hot vector of length 𝑞, the sum over all its coordinates 𝑗 vanishes for
all but one term.

• Since all 𝑦ො𝑗 are predicted probabilities, their logarithm is never larger than 0.
o Consequently, the loss function cannot be minimized any further if we correctly
predict the actual label with certainty, i.e., if the predicted probability 𝑃(𝒚 ∣ 𝒙)
= 1 for the actual label 𝒚.
o Note that this is often impossible. For example, there might be label noise in
the dataset.
o It may also not be possible when the input features are not sufficiently
informative to classify every example perfectly.
Loss function: Softmax derivatives
Plugging (3.4.3) into the definition of the loss in (3.4.8) and using the definition of the
softmax we obtain:
exp 𝑜𝑗
ෝ = − σ𝑞𝑗=1 𝑦𝑗 log 𝑞
𝑙 𝒚, 𝒚 σ 𝑘=1 exp 𝑜𝑘

𝑞 𝑞 𝑞
= σ𝑗=1 𝑦𝑗 log σ𝑘=1 exp(𝑜𝑘 ) − σ𝑗=1 𝑦𝑗 𝑜𝑗

𝑞 𝑞
= log σ𝑘=1 exp(𝑜𝑘 ) − σ𝑗=1 𝑦𝑗 𝑜𝑗 --(3.4.9)
• Consider the derivative with respect to any logit 𝑜𝑗 . We get

exp 𝑜𝑗
ෝ = 𝑞
𝜕𝑜𝑗 𝑙 𝒚, 𝒚 − 𝑦𝑗 = softmax 𝒐 𝑗 − 𝑦𝑗 --(3.4.10)
σ 𝑘=1 exp(𝑜 𝑘 )
Cross Entropy Loss
• Consider the case where we observe not just a single outcome but an entire
distribution over outcomes.
o We can use the same representation as before for the label 𝑦.
o The math that we used previously to define the loss 𝑙 in (3.4.8) still works out
fine,
just that the interpretation is slightly more general.
o The only difference is that rather than a vector containing only binary entries,
say (0,0,1),
we now have a generic probability vector, say (0.1,0.2,0.7) .

• The expected value of the loss for a distribution over labels is called the cross-
entropy loss and it is one of the most commonly used losses for classification
problems
Information Theory Basics: Entropy
• The central idea in information theory is to quantify the information content in data.
o This quantity places a hard limit on our ability to compress the data.
o This quantity is called the entropy of a distribution P , and it is captured by the
following equation:

𝐻[𝑃] = σ𝑗 −𝑃 𝑗 log 𝑃(𝑗) -- (3.4.11)

• One of the fundamental theorems of information theory states that in order to
encode data drawn randomly from the
distribution 𝑃, we need at least 𝐻[𝑃] “nats” to encode it.
o A “nat” is the equivalent of bit but when using a code with base 𝑒 rather than
one with base 2.
1
o Thus, one nat is ≈ 1.44 bit.
log 2
Information Theory Basics: Surprisal

• What compression has to do with prediction? Imagine that we have a stream of data
that we want to compress.
o If it is always easy for us to predict the next token, then this data is easy to
compress!
o If we cannot perfectly predict every event, then we might sometimes be
surprised. Our surprise is greater when we assigned an event lower probability.

https://medium.com/udacity/shannon-entropy-information-
gain-and-picking-balls-from-buckets-5810d35d54b4

1
• Claude Shannon settled on log = −log𝑃(𝑗) to quantify one’s surprisal at
𝑃 𝑗
observing an event 𝑗 having assigned it a (subjective) probability 𝑃(𝑗).

Deploying ML Production (Flask - API)
No ratings yet
Deploying ML Production (Flask - API)
27 pages
Unit1 BASICS OF PCS and GSM
No ratings yet
Unit1 BASICS OF PCS and GSM
39 pages
Monitoring and Evaluation of Ict Guidelines
No ratings yet
Monitoring and Evaluation of Ict Guidelines
10 pages
Tora
100% (3)
Tora
14 pages
Web Based Inventory Feb
No ratings yet
Web Based Inventory Feb
45 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
CS115 01
No ratings yet
CS115 01
38 pages
Lec 4
No ratings yet
Lec 4
24 pages
CM20315 05 Loss
No ratings yet
CM20315 05 Loss
100 pages
d2l en 165 218
No ratings yet
d2l en 165 218
35 pages
DL 02 Basics
No ratings yet
DL 02 Basics
94 pages
10 Gradient Based Learning 10-08-2024
No ratings yet
10 Gradient Based Learning 10-08-2024
22 pages
Lecture3 Logistic Regression Classifier V0
No ratings yet
Lecture3 Logistic Regression Classifier V0
41 pages
Ds 2
No ratings yet
Ds 2
27 pages
3a Variations
No ratings yet
3a Variations
17 pages
Lecture04 VDL
No ratings yet
Lecture04 VDL
93 pages
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
No ratings yet
Mathematical Foundations of Computational Linguistics: Manfred Klenner and Jannis Vamvas
32 pages
Practice QuestionsV1
No ratings yet
Practice QuestionsV1
7 pages
Lect 8
No ratings yet
Lect 8
117 pages
Practice QuestionsV1
No ratings yet
Practice QuestionsV1
7 pages
02 - Linear Models - D (Multiclass Classification)
No ratings yet
02 - Linear Models - D (Multiclass Classification)
9 pages
Lec 05
No ratings yet
Lec 05
54 pages
Lecture 19
No ratings yet
Lecture 19
8 pages
DL145611 03 Shallow
No ratings yet
DL145611 03 Shallow
92 pages
W02 MLOptDL
No ratings yet
W02 MLOptDL
23 pages
Notes6 Classification
No ratings yet
Notes6 Classification
10 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
HODL Lec 2 Training NNs Intro TF
No ratings yet
HODL Lec 2 Training NNs Intro TF
83 pages
Module 1 - Problems in Neural Network
No ratings yet
Module 1 - Problems in Neural Network
20 pages
Ethio Tele Report
No ratings yet
Ethio Tele Report
58 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
practicalMachineLearning Lecture3
No ratings yet
practicalMachineLearning Lecture3
25 pages
Deep Learning
No ratings yet
Deep Learning
15 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Neural Networks
No ratings yet
Neural Networks
63 pages
1 Intro
No ratings yet
1 Intro
5 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
cs231n Github Io Neural Networks Case Study
No ratings yet
cs231n Github Io Neural Networks Case Study
17 pages
Cours1 ML
No ratings yet
Cours1 ML
41 pages
CH 1
No ratings yet
CH 1
24 pages
Crashcourse DL Pytorch Parr
No ratings yet
Crashcourse DL Pytorch Parr
39 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
Types of Neural Networks
No ratings yet
Types of Neural Networks
7 pages
Loss Functions
No ratings yet
Loss Functions
15 pages
2021 Logistic Regression
No ratings yet
2021 Logistic Regression
33 pages
Solution 5
No ratings yet
Solution 5
4 pages
Cross Interopy
No ratings yet
Cross Interopy
7 pages
Lecture Notes 6 Logistic Regression
No ratings yet
Lecture Notes 6 Logistic Regression
8 pages
Main
No ratings yet
Main
9 pages
Practical-5 - 2CEIT606 - Artificial Intelligence
No ratings yet
Practical-5 - 2CEIT606 - Artificial Intelligence
14 pages
Lec 04 Deep Networks 2
No ratings yet
Lec 04 Deep Networks 2
78 pages
Supervised Learning
No ratings yet
Supervised Learning
5 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
ECE 449 Notes
No ratings yet
ECE 449 Notes
5 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Intermediate Javascript: Nanodegree Program Syllabus
No ratings yet
Intermediate Javascript: Nanodegree Program Syllabus
10 pages
C2 W2 SoftMax
No ratings yet
C2 W2 SoftMax
7 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
MSTest Vs NUnit
No ratings yet
MSTest Vs NUnit
4 pages
06 Lectureslides LinearClassification Fixed
No ratings yet
06 Lectureslides LinearClassification Fixed
52 pages
Buku Teks BM Tahun 4 KSSR Semakan
No ratings yet
Buku Teks BM Tahun 4 KSSR Semakan
102 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
412-Report-Induction Motor Protection Systeam
No ratings yet
412-Report-Induction Motor Protection Systeam
97 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
2 Softmaxregression
No ratings yet
2 Softmaxregression
4 pages
Poly Aml
No ratings yet
Poly Aml
76 pages
ML 01
No ratings yet
ML 01
24 pages
Rev-Trac One: Accelerate SAP Change
No ratings yet
Rev-Trac One: Accelerate SAP Change
3 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Task Manager
No ratings yet
Task Manager
5 pages
Notes (1 3)
No ratings yet
Notes (1 3)
19 pages
FR 250127 en Shaftalign Data Sheet
No ratings yet
FR 250127 en Shaftalign Data Sheet
4 pages
FYP I Proposal
No ratings yet
FYP I Proposal
11 pages
Reverse Engineering Vehicles Burpsuite Style
No ratings yet
Reverse Engineering Vehicles Burpsuite Style
29 pages
Python Lab Manual
No ratings yet
Python Lab Manual
17 pages
Asset PDF 25493214
No ratings yet
Asset PDF 25493214
58 pages
Lutron L-hwlv2-Wifi Installation Manual
No ratings yet
Lutron L-hwlv2-Wifi Installation Manual
42 pages
LAMMPS Tutorial SC22
No ratings yet
LAMMPS Tutorial SC22
50 pages
Unit-IV Management Information System
No ratings yet
Unit-IV Management Information System
29 pages
Toshiba e Studio 5518a 6518a 7518a 8518 Brochure
No ratings yet
Toshiba e Studio 5518a 6518a 7518a 8518 Brochure
6 pages
14 New Students Are Ready To Join Your Classroom 6th Grade-Movers
No ratings yet
14 New Students Are Ready To Join Your Classroom 6th Grade-Movers
15 pages
4th Year DW& DM Kai075 Unit 1
No ratings yet
4th Year DW& DM Kai075 Unit 1
25 pages
Solutions Part I - Logistic Regression Backpropagation With A Single Training Example
No ratings yet
Solutions Part I - Logistic Regression Backpropagation With A Single Training Example
6 pages
P Seminar
No ratings yet
P Seminar
26 pages
JSS 1 - 3
No ratings yet
JSS 1 - 3
6 pages
Compare Laptop
No ratings yet
Compare Laptop
7 pages
Human-Centered Machine Learning Implementation in Banking Case Study in BRILink BRI Branchless Banking Agent Acquisition Upgrade and Activation
No ratings yet
Human-Centered Machine Learning Implementation in Banking Case Study in BRILink BRI Branchless Banking Agent Acquisition Upgrade and Activation
7 pages
Agilent Vacuum Station
No ratings yet
Agilent Vacuum Station
8 pages
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet

03-Linear Classification

Uploaded by

03-Linear Classification

Uploaded by

CSD456

• Classification is a type of supervised learning used to predict

o If the categories had some natural ordering among them

o If the categories do not have orderings among them.

𝒚 ∈ {(𝟏, 𝟎, 𝟎), (𝟎, 𝟏, 𝟎), (𝟎, 𝟎, 𝟏)}

𝑜1 = 𝑥1 𝑤11 + 𝑥2 𝑤12 + 𝑥3 𝑤13 + 𝑥4 𝑤14 + 𝑏1

• We can use linear algebra notation: 𝒐 = 𝑾𝒙 + 𝒃

• To interpret our outputs as probabilities,

o It is easy to see 𝑦ො1 + 𝑦ො2 + 𝑦ො3 = 1 with 0 ≤ 𝑦ො𝑗 ≤ 1 for all 𝑗.

𝑦ො1 = 𝑃(𝑦 = 𝑐𝑎𝑡 ∣ 𝒙).

• The loss function in (3.4.8) is commonly called the cross-entropy loss.

𝐻[𝑃] = σ𝑗 −𝑃 𝑗 log 𝑃(𝑗) -- (3.4.11)

You might also like