0% found this document useful (0 votes)
9 views17 pages

03-Linear Classification

The document discusses classification as a supervised learning method used to predict discrete outcomes across various fields. It explains the use of one-hot encoding for categorical labels, the architecture of classification networks, and the softmax operation for interpreting outputs as probabilities. Additionally, it covers loss functions, specifically cross-entropy loss, and introduces concepts from information theory such as entropy and surprisal.

Uploaded by

Kalp Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

03-Linear Classification

The document discusses classification as a supervised learning method used to predict discrete outcomes across various fields. It explains the use of one-hot encoding for categorical labels, the architecture of classification networks, and the softmax operation for interpreting outputs as probabilities. Additionally, it covers loss functions, specifically cross-entropy loss, and introduces concepts from information theory such as entropy and surprisal.

Uploaded by

Kalp Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

CSD456

Deep Learning
Linear
Classification
Classification

• Classification is a type of supervised learning used to predict


discrete outcomes.
• It estimates the relationships among variables.
• Common in various fields like finance, biology, and economics.

Examples
o Does this email belong in the spam folder or the inbox?
o Is this customer more likely to sign up or not to sign up for a
subscription service?
o Does this image depict a donkey, a dog, a cat, or a rooster?
Classification
• A simple image classification problem:
o Each input consists of a 2 × 2 grayscale image.
o We represent each pixel value with a single scalar, giving us four features
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 .
o We assume that each image belongs to one among the categories “cat”,
“chicken”, and “dog”.

W
x2

x1
-b/w1
Representation of Labels

o If the categories had some natural ordering among them


• Cast this problem as regression and keep the labels as in their format.

o If the categories do not have orderings among them.


• Use one-hot encoding, which is a vector with as many components as we
have categories.
• The component corresponding to instance’s category is set to 1 and all
other components are set to 0.
• In our case, a label 𝑦 would be a three-dimensional vector, with (1,0,0) =
“cat”,
(0,1,0) = “chicken”, and (0,0,1) = “dog”:

𝒚 ∈ {(𝟏, 𝟎, 𝟎), (𝟎, 𝟏, 𝟎), (𝟎, 𝟎, 𝟏)}


Classification as a Regression problem
• Deep models a continuous function y = 𝐹 𝑥; 𝑤
• But in classification we need discrete output.
• How to get Discrete output from a continuous output??
• One of the possible solution is to constrain the
continuous output.
• For Binary classification, this constraint could be
sigmoid function.
• For Multi-class classification, it is softmax.
• That’s why, Multi-class classification is also called as
Softmax Regression
Classification Network Architecture

• To estimate the conditional probabilities associated with all the possible classes,
we need a model with multiple outputs, one per class.
• To address classification with linear models, we will need as many affine functions as
we have outputs.
o Each output will correspond to its own affine function.
• In our case, since we have 4 features and 3 possible output categories, we will need
o 12 scalars to represent the weights (𝑤 with subscripts),
o 3 scalars to represent the biases (𝑏 with subscripts).
• We compute these three logits, 𝑜1 , 𝑜2 , and 𝑜3 , for each input:

𝑜1 = 𝑥1 𝑤11 + 𝑥2 𝑤12 + 𝑥3 𝑤13 + 𝑥4 𝑤14 + 𝑏1


𝑜2 = 𝑥1 𝑤21 + 𝑥2 𝑤22 + 𝑥3 𝑤23 + 𝑥4 𝑤24 + 𝑏2
𝑜3 = 𝑥1 𝑤31 + 𝑥2 𝑤32 + 𝑥3 𝑤33 + 𝑥4 𝑤34 + 𝑏3 -- (3.4.2)
Classification Network Architecture
• We can depict the calculation in (3.4.2) with the neural network diagram shown in
Fig. 3.4.1.
o Softmax regression is a single-layer neural network.
o The output layer of softmax regression can be described as fully-connected
layer.

• We can use linear algebra notation: 𝒐 = 𝑾𝒙 + 𝒃


o This form better suited both for mathematics, and for writing code.
Softmax Operation

• To interpret our outputs as probabilities,


o We must guarantee that (even on new data), they will be nonnegative and sum up
to 1.
o We need a training objective that encourages the model to estimate faithfully
probabilities.
• To transform our logits to become nonnegative and sum to 1, while requiring that the
model remains differentiable,
o We first exponentiate each logit (ensuring non-negativity)
o Then divide by their sum (ensuring that they sum to 1)

exp 𝑜𝑗
ෝ = softmax 𝒐
𝒚 where 𝑦ො𝑗 = σ --(3.4.3)
𝑘 exp 𝑜𝑘

o It is easy to see 𝑦ො1 + 𝑦ො2 + 𝑦ො3 = 1 with 0 ≤ 𝑦ො𝑗 ≤ 1 for all 𝑗.


Softmax Operation

• The softmax operation does not change the ordering among the logits 𝑜.
o Therefore, during prediction we can still pick out the most likely class by
argmaxj 𝑦ො𝑗 = argmaxj oj --(3.4.4)

• Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an
affine transformation of input features; thus, softmax regression is a linear model.
Vectorization of Minibatches
• Assume we are given a minibatch 𝑿 of examples with feature dimensionality (number of
inputs) 𝑑 and batch size 𝑛.
• Moreover, assume that we have 𝑞 categories in the output. Then we have:
o The minibatch features 𝑿 are in ℝ𝑛×𝑑 .
o The weights 𝑾 ∈ ℝ𝑑×𝑞 .
o The bias satisfies 𝑏 ∈ ℝ1×𝑞 .
𝑶 = 𝑿𝑾 + 𝒃,
෡ = softmax(𝐎)
𝒀 -- (3.4.5)
• This accelerates the dominant operation into a matrix-matrix product 𝑿𝑾 vs. the matrix-
vector products we would be executing if we processed one example at a time.
• Since each row in 𝑿 represents a data, the SoftMax operation itself can be computed row-
wise:
o for each row of 𝑶, exponentiate all entries and then normalize them by the sum.
• Triggering broadcasting during the summation 𝑿𝑾 + 𝒃, both the minibatch logits 𝑶 and
output probabilities 𝒀෡ are 𝑛 × 𝑞 matrices.
Loss function

• Next, we need a loss function to measure the quality of our predicted probabilities.
• We will rely on maximum likelihood estimation
o The very same concept that we encountered when providing a probabilistic
justification for the mean squared error objective in linear regression
• The softmax function gives us a vector 𝑦,ො which we can interpret as estimated
conditional probabilities of each class given any input 𝑥, e.g.

𝑦ො1 = 𝑃(𝑦 = 𝑐𝑎𝑡 ∣ 𝒙).


Loss function
• Suppose that the entire dataset {𝑿, 𝒀} has 𝑛 examples, where the example indexed
by 𝑖 consists of a feature vector 𝑥 𝑖 and a one-hot label vector 𝑦 𝑖 .
• We can compare the estimates with reality by checking how probable the actual
classes are according to our model, given the features:
𝑃(𝒀 ∣ 𝑿) = ς𝑛𝑖=1 𝑃(𝑦 𝑖 ∣ 𝑥 𝑖 ) -- (3.4.6)
• According to maximum likelihood estimation, we maximize 𝑃(𝒀 ∣ 𝑿), which is
equivalent to minimizing the
negative log-likelihood:
−𝑙𝑜𝑔𝑃(𝒀 ∣ 𝑿) = σ𝑛𝑖=1 −𝑙𝑜𝑔𝑃 𝑦 𝑖 𝑥 𝑖 = σ𝑛𝑖=1 𝑙(𝑦 𝑖 , 𝑦ො 𝑖 ) , -- (3.4.7)

where for any pair of label 𝑦 and model prediction 𝑦ො over 𝑞 classes, the loss
function 𝑙 is
𝑞
ො = − σ𝑗=1 𝑦𝑗 log 𝑦ො𝑗
𝑙(𝑦, 𝑦) -- (3.4.8)
Loss function : Log Likelihood

• The loss function in (3.4.8) is commonly called the cross-entropy loss.


• Since 𝑦 is a one-hot vector of length 𝑞, the sum over all its coordinates 𝑗 vanishes for
all but one term.

• Since all 𝑦ො𝑗 are predicted probabilities, their logarithm is never larger than 0.
o Consequently, the loss function cannot be minimized any further if we correctly
predict the actual label with certainty, i.e., if the predicted probability 𝑃(𝒚 ∣ 𝒙)
= 1 for the actual label 𝒚.
o Note that this is often impossible. For example, there might be label noise in
the dataset.
o It may also not be possible when the input features are not sufficiently
informative to classify every example perfectly.
Loss function: Softmax derivatives
Plugging (3.4.3) into the definition of the loss in (3.4.8) and using the definition of the
softmax we obtain:
exp 𝑜𝑗
ෝ = − σ𝑞𝑗=1 𝑦𝑗 log 𝑞
𝑙 𝒚, 𝒚 σ 𝑘=1 exp 𝑜𝑘

𝑞 𝑞 𝑞
= σ𝑗=1 𝑦𝑗 log σ𝑘=1 exp(𝑜𝑘 ) − σ𝑗=1 𝑦𝑗 𝑜𝑗

𝑞 𝑞
= log σ𝑘=1 exp(𝑜𝑘 ) − σ𝑗=1 𝑦𝑗 𝑜𝑗 --(3.4.9)
• Consider the derivative with respect to any logit 𝑜𝑗 . We get

exp 𝑜𝑗
ෝ = 𝑞
𝜕𝑜𝑗 𝑙 𝒚, 𝒚 − 𝑦𝑗 = softmax 𝒐 𝑗 − 𝑦𝑗 --(3.4.10)
σ 𝑘=1 exp(𝑜 𝑘 )
Cross Entropy Loss
• Consider the case where we observe not just a single outcome but an entire
distribution over outcomes.
o We can use the same representation as before for the label 𝑦.
o The math that we used previously to define the loss 𝑙 in (3.4.8) still works out
fine,
just that the interpretation is slightly more general.
o The only difference is that rather than a vector containing only binary entries,
say (0,0,1),
we now have a generic probability vector, say (0.1,0.2,0.7) .

• The expected value of the loss for a distribution over labels is called the cross-
entropy loss and it is one of the most commonly used losses for classification
problems
Information Theory Basics: Entropy
• The central idea in information theory is to quantify the information content in data.
o This quantity places a hard limit on our ability to compress the data.
o This quantity is called the entropy of a distribution P , and it is captured by the
following equation:

𝐻[𝑃] = σ𝑗 −𝑃 𝑗 log 𝑃(𝑗) -- (3.4.11)


• One of the fundamental theorems of information theory states that in order to
encode data drawn randomly from the
distribution 𝑃, we need at least 𝐻[𝑃] “nats” to encode it.
o A “nat” is the equivalent of bit but when using a code with base 𝑒 rather than
one with base 2.
1
o Thus, one nat is ≈ 1.44 bit.
log 2
Information Theory Basics: Surprisal

• What compression has to do with prediction? Imagine that we have a stream of data
that we want to compress.
o If it is always easy for us to predict the next token, then this data is easy to
compress!
o If we cannot perfectly predict every event, then we might sometimes be
surprised. Our surprise is greater when we assigned an event lower probability.

https://medium.com/udacity/shannon-entropy-information-
gain-and-picking-balls-from-buckets-5810d35d54b4

1
• Claude Shannon settled on log = −log𝑃(𝑗) to quantify one’s surprisal at
𝑃 𝑗
observing an event 𝑗 having assigned it a (subjective) probability 𝑃(𝑗).

You might also like