02 - Linear Models - D (Multiclass Classification)
02 - Linear Models - D (Multiclass Classification)
label 𝑦! = 0 𝑦" = 1 𝑦# = 0
It is easy to see ŷ1 + ŷ2 + ŷ3 = 1 with 0 ≤ ŷi ≤ 1 for all i. Thus, ŷ is a proper probability
Make sure yis sum to 1distribu-
tion and the values of ŷ can be interpreted accordingly. Note that the softmax operation does not
change the ordering among the logits, and thus we can still pick out the most likely class by:
ı̂(o) = argmax oi = argmax ŷi . (3.4.4)
i i
*Why it is called “softmax”
exp 𝑜! /𝜏
𝑦! =
∑" exp 𝑜" /𝜏
For reasons explained later on, this loss function is commonly called the cross-entropy loss. Here,
we used that by construction ŷ is a discrete probability distribution and that the vector y is a one-
hot vector. Hence the the sum over 𝑦all
label ! = 0 𝑦# =for
𝑦" =j 1vanishes
coordinates 0 all but one term. Since all ŷj are
probabilities, their logarithm is never larger than 0. Consequently, the loss function cannot be
Cross Entropy
minimized any prediction
further if we correctly
𝑦"! = 0.12 = 0.64
predict 𝑦y""with 𝑦"# =0.24
certainty, i.e., if P (y | x) = 1 for the correct
label. Note that this is often not possible. For example, there might be label noise in theminimize dataset
(some examples may be mislabeled). It may also Softmaxnot be possible when the input features are not
sufficiently informative to classify every example perfectly.
+
1 (!) (!)
∇%+ 𝐿 𝒘 = . 𝑦#; − 𝑦; 𝒙(!)
𝑁
(&'
Decision Boundaries