0% found this document useful (0 votes)
20 views

2 Softmaxregression

This document provides an introduction to softmax regression, including how it computes class probabilities and scores, its loss function, and the F1 score metric. Softmax regression first calculates scores for each class and then estimates class probabilities using the softmax function. It aims to maximize the probability of the true class by minimizing the cross entropy loss. The F1 score considers both precision and recall to evaluate classification performance, especially for imbalanced datasets.

Uploaded by

Hidden character
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

2 Softmaxregression

This document provides an introduction to softmax regression, including how it computes class probabilities and scores, its loss function, and the F1 score metric. Softmax regression first calculates scores for each class and then estimates class probabilities using the softmax function. It aims to maximize the probability of the true class by minimizing the cross entropy loss. The F1 score considers both precision and recall to evaluate classification performance, especially for imbalanced datasets.

Uploaded by

Hidden character
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Chapter 1

Classification using Softmax


Regression

Softmax regression, or multinomial logistic regression, is a widely used algorithm for multiclass
classification. This chapter provides an introduction to the basic concepts of softmax regression.

Figure 1.1: Logistic Regression vs Softmax Regression

1.1. Softmax Classifier


When given an instance x, the Softmax Regression model first computes a score sk (x) for each
class k, then estimates the probability of each class by applying the softmax function (also called
the normalized exponential) to the scores [1]:

sk (x) = xT θk

Each class has its own dedicated parameter vector θk . All these vectors are typically stored as
rows in a parameter matrix Θ.
Once the score of every class for the instance x is calculated, we can estimate the probability p̂k
that the instance belongs to class k by running the scores through the softmax function [1]:

exp(sk (x))
p̂k = σk (s(x)) = K
P
exp(sj (x))
j=1

In the above equation:


• K is the number of classes.

1
• s(x) is a vector containing the scores of each class for the instance x.
• σk (s(x)) is the estimated probability that the instance x belongs to class k, given the scores
of each class for that instance.
Just like the Logistic Regression, the Softmax Regression predicts the class with the highest
estimated probability (which is simply the class with the highest score), as shown in the equation
below [1]:
ŷ = arg max σk (s(x)) = arg max sk (x)
k k
If the dataset is two dimensional, Softmax Regression partitions the plane into multiple polygonal
regions using lines, as depicted in figure 1.1.

1.2. Loss function


When dealing with percentages, the function f (x) = −log(x) is commonly used to quantify the
proximity of the current percentage to 1.

Figure 1.2: − log(x) graph

The objective of training is to have a model that estimates a high probability for the target
class (and consequently a low probability for the other classes). Minimizing the cost function
below, called the cross entropy, should lead to this objective because it penalizes the model when
it estimates a low probability for a target class. Cross entropy is frequently used to measure how
well a set of estimated class probabilities matches the target classes [1].
m K
1 X X (i) (i)
J(Θ) = − yk log(p̂k )
m i=1
k=1

(i)
In the above equation, yk is the target probability that the ith instance belongs to class k. In
general, it is either equal to 1 or 0, depending on whether the instance belongs to the class or
not. When there are just two classes (K = 2), this cost function is equivalent to the Logistic
Regression’s cost function (logloss).
In practice, it is common to apply regularization techniques to the loss function in order to
penalize large values of θ. This is necessary to prevent the model from becoming overly sensitive
to new data. We use a technique known as the Elastic Net, the Elastic Net attempts to minimize:
K
X K
X
L(Θ) = J(Θ) + λ1 ∥θ(i) ∥ + λ2 ∥θ(i) ∥2
i=1 i=1

Since J(Θ) is a convex function [2], L(Θ) also a convex function, and therefore has a unique
minimum which make Gradient Descent (or any other optimization algorithms) guarenteed to
approach the global minimum.

2
1.3. F1 score
The F1 score is often considered a better metric than normal accuracy score to measure model
classification performance, especially in scenarios where class imbalance exists. Accuracy is a
metric that measures the overall correctness of predictions by dividing the number of correct
predictions by the total number of predictions. As a consequence, if a dataset has 95% of
instances belonging to class A and only 5% belonging to class B, a naive classifier that always
predicts class A would achieve 95% accuracy. This classifier completely fails to identify instances
from class B, which may be the class of interest. The F1 score, on the other hand, takes into
account both precision and recall. In a binary classification problem, F1 score is calculated by
using the concept of true positive (TP), false negative (FN) and false positive (FP):

TP
Precesion =
TP + FP
TP
Recall =
TP + FN
2 × Precision × Recall
F1 score =
Precision + Recall
In multi-class classification, F1 score is calculated for each class separately in one-vs-rest
manner [3].

3
References

[1] A. G. Sager, Hands-On Machine Learning with Python and TensorFlow. O’Reilly Media.
[2] JoramSoch, “Proof: Convexity of the cross-entropy,” https://statproofbook.github.io/P/
entcross-conv.html, 2020, the Book of Statistical Proofs.
[3] Baeldung, “F-1 Score for Multi-Class Classification,” https://www.baeldung.com/cs/
multi-class-f1-score, Baeldung on Computer Science, 2022, accessed on February 14, 2024.

You might also like