03-Linear Classification
03-Linear Classification
Deep Learning
Linear
Classification
Classification
Examples
o Does this email belong in the spam folder or the inbox?
o Is this customer more likely to sign up or not to sign up for a
subscription service?
o Does this image depict a donkey, a dog, a cat, or a rooster?
Classification
• A simple image classification problem:
o Each input consists of a 2 × 2 grayscale image.
o We represent each pixel value with a single scalar, giving us four features
𝑥1 , 𝑥2 , 𝑥3 , 𝑥4 .
o We assume that each image belongs to one among the categories “cat”,
“chicken”, and “dog”.
W
x2
x1
-b/w1
Representation of Labels
• To estimate the conditional probabilities associated with all the possible classes,
we need a model with multiple outputs, one per class.
• To address classification with linear models, we will need as many affine functions as
we have outputs.
o Each output will correspond to its own affine function.
• In our case, since we have 4 features and 3 possible output categories, we will need
o 12 scalars to represent the weights (𝑤 with subscripts),
o 3 scalars to represent the biases (𝑏 with subscripts).
• We compute these three logits, 𝑜1 , 𝑜2 , and 𝑜3 , for each input:
exp 𝑜𝑗
ෝ = softmax 𝒐
𝒚 where 𝑦ො𝑗 = σ --(3.4.3)
𝑘 exp 𝑜𝑘
• The softmax operation does not change the ordering among the logits 𝑜.
o Therefore, during prediction we can still pick out the most likely class by
argmaxj 𝑦ො𝑗 = argmaxj oj --(3.4.4)
• Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an
affine transformation of input features; thus, softmax regression is a linear model.
Vectorization of Minibatches
• Assume we are given a minibatch 𝑿 of examples with feature dimensionality (number of
inputs) 𝑑 and batch size 𝑛.
• Moreover, assume that we have 𝑞 categories in the output. Then we have:
o The minibatch features 𝑿 are in ℝ𝑛×𝑑 .
o The weights 𝑾 ∈ ℝ𝑑×𝑞 .
o The bias satisfies 𝑏 ∈ ℝ1×𝑞 .
𝑶 = 𝑿𝑾 + 𝒃,
= softmax(𝐎)
𝒀 -- (3.4.5)
• This accelerates the dominant operation into a matrix-matrix product 𝑿𝑾 vs. the matrix-
vector products we would be executing if we processed one example at a time.
• Since each row in 𝑿 represents a data, the SoftMax operation itself can be computed row-
wise:
o for each row of 𝑶, exponentiate all entries and then normalize them by the sum.
• Triggering broadcasting during the summation 𝑿𝑾 + 𝒃, both the minibatch logits 𝑶 and
output probabilities 𝒀 are 𝑛 × 𝑞 matrices.
Loss function
• Next, we need a loss function to measure the quality of our predicted probabilities.
• We will rely on maximum likelihood estimation
o The very same concept that we encountered when providing a probabilistic
justification for the mean squared error objective in linear regression
• The softmax function gives us a vector 𝑦,ො which we can interpret as estimated
conditional probabilities of each class given any input 𝑥, e.g.
where for any pair of label 𝑦 and model prediction 𝑦ො over 𝑞 classes, the loss
function 𝑙 is
𝑞
ො = − σ𝑗=1 𝑦𝑗 log 𝑦ො𝑗
𝑙(𝑦, 𝑦) -- (3.4.8)
Loss function : Log Likelihood
• Since all 𝑦ො𝑗 are predicted probabilities, their logarithm is never larger than 0.
o Consequently, the loss function cannot be minimized any further if we correctly
predict the actual label with certainty, i.e., if the predicted probability 𝑃(𝒚 ∣ 𝒙)
= 1 for the actual label 𝒚.
o Note that this is often impossible. For example, there might be label noise in
the dataset.
o It may also not be possible when the input features are not sufficiently
informative to classify every example perfectly.
Loss function: Softmax derivatives
Plugging (3.4.3) into the definition of the loss in (3.4.8) and using the definition of the
softmax we obtain:
exp 𝑜𝑗
ෝ = − σ𝑞𝑗=1 𝑦𝑗 log 𝑞
𝑙 𝒚, 𝒚 σ 𝑘=1 exp 𝑜𝑘
𝑞 𝑞 𝑞
= σ𝑗=1 𝑦𝑗 log σ𝑘=1 exp(𝑜𝑘 ) − σ𝑗=1 𝑦𝑗 𝑜𝑗
𝑞 𝑞
= log σ𝑘=1 exp(𝑜𝑘 ) − σ𝑗=1 𝑦𝑗 𝑜𝑗 --(3.4.9)
• Consider the derivative with respect to any logit 𝑜𝑗 . We get
exp 𝑜𝑗
ෝ = 𝑞
𝜕𝑜𝑗 𝑙 𝒚, 𝒚 − 𝑦𝑗 = softmax 𝒐 𝑗 − 𝑦𝑗 --(3.4.10)
σ 𝑘=1 exp(𝑜 𝑘 )
Cross Entropy Loss
• Consider the case where we observe not just a single outcome but an entire
distribution over outcomes.
o We can use the same representation as before for the label 𝑦.
o The math that we used previously to define the loss 𝑙 in (3.4.8) still works out
fine,
just that the interpretation is slightly more general.
o The only difference is that rather than a vector containing only binary entries,
say (0,0,1),
we now have a generic probability vector, say (0.1,0.2,0.7) .
• The expected value of the loss for a distribution over labels is called the cross-
entropy loss and it is one of the most commonly used losses for classification
problems
Information Theory Basics: Entropy
• The central idea in information theory is to quantify the information content in data.
o This quantity places a hard limit on our ability to compress the data.
o This quantity is called the entropy of a distribution P , and it is captured by the
following equation:
• What compression has to do with prediction? Imagine that we have a stream of data
that we want to compress.
o If it is always easy for us to predict the next token, then this data is easy to
compress!
o If we cannot perfectly predict every event, then we might sometimes be
surprised. Our surprise is greater when we assigned an event lower probability.
https://medium.com/udacity/shannon-entropy-information-
gain-and-picking-balls-from-buckets-5810d35d54b4
1
• Claude Shannon settled on log = −log𝑃(𝑗) to quantify one’s surprisal at
𝑃 𝑗
observing an event 𝑗 having assigned it a (subjective) probability 𝑃(𝑗).