Detailed Sigmoid and Softmax Activation Function
Detailed Sigmoid and Softmax Activation Function
1
g(z) = (1)
1 + e−z
where z = mx + b
(a) Let’s do a quick concept check. In what situation would we use logistic regression instead of linear
regression?
Linear regression assumes the data follows a linear function, while logistic regression models the data
using a sigmoid function. We can also use logistic regression as a classification technique (when labels
are binary), while we use linear regression when we are predicting some linear function on our data.
(b) We see that g(z) falls strictly between (0, 1). Given what we have learned in class and discussed so far,
what probability distribution does this graph represent?
1
(c) Now, let’s consider a R3 space. For weight vector θ = 4, define (i) some x such that θ T x > 0. What
3
is the resulting g(z)? Now, (ii) some x such that θ T x = 0. What is the resulting g(z)? Explain the
overall relationship between g(z) and θ T x.
There are multiple correct values of x. We consider one specific example and show that we are correct.
(i)
1
x = 1. So, θ T x = 8 > 0. The resulting g(z) is 1
1+ e18
. This value is extremely close to 1, but still
1
less than 1 (0.99967).
(ii)
1
10-315: Introduction to Machine Learning Recitation 3
7
x = −1. So, θ T x = 0. The resulting g(z) is 1
1+1 = 0.5.
−1
Overall, we see that our value of g(z) relies on if z >, <, = 0. If z > 0, then g(z) > 0.5. If z < 0, then
g(z) < 0.5. If z = 0, then g(z) = 0.5. Based on the value of g(z), we choose the appropriate binary
class. So, because z = θ T x, we see that θ T x = 0 represents our decision boundary, and when θ T x = 0,
g(z) = 0.5.
2
10-315: Introduction to Machine Learning Recitation 3
In a K-class classification setting, we have training set D = {(x(i) , y(i) ) | i = 1, . . . , n} where x(i) ∈ RM
is a feature vector and y(i) ∈ {1, 2, ..., K}.
2.1 One-versus-All
Using a one-vs-all classifier (with logistic regression) to predict the label for a new data includes the following
two steps:
1. Learn one binary classifier for each class. For each 1 ≤ k ≤ K, treat samples of class k as positive
examples and samples from all other classes as negative samples. Perform logistic regression on this
dataset to learn:
1
p(y = k | x; W, b) =
1 + e−(wk T x+bk )
2. Majority Vote:
ŷ = argmax p(y (i) = 1 | x; W, b)
k
Note: this method can be used with any binary classifier (including binary logistic regression, binary SVM
classifier, etc).
For a vector z = [z1 , z2 , ..., zK ]T ∈ RK , the softmax function outputs a vector of the same dimension,
sof tmax(z) ∈ RK , where each of its entries is defined as:
ezk
sof tmax(z)k = PK , for all k = 1, 2, ..., K
c=1 ezc
Therefore, the softmax function is useful for converting a vector of arbitrary real numbers into a discrete
probability density consisting of K probabilities proportional to the exponentials of the input vector compo-
nents. Note that, for example, the larger input components correspond to larger probabilities.
Softmax is often used as the last layer of a neural networks, to map the non-normalized output of a
network to a probability distribution over predicted output classes.
3
10-315: Introduction to Machine Learning Recitation 3
Softmax Regression
For K-class classification, Softmax Regression has a parametric model of the form:
exp(wkT x(i) + bk )
p(y (i) = k | x(i) ; W, b) = PK . (2)
T (i) + b )
c=1 exp(wc x c
Therefore, the output of the softmax model looks like: ŷ = argmaxk p(y (i) = k | x(i) ; W, b)
The intermediate result (a vector) outputted by the softmax function is:
Note: now W is a matrix! Let W be the M × K matrix obtained by concatenating w1 , w2 ,...,wK , where
each wi ∈ RM .
4
10-315: Introduction to Machine Learning Recitation 3
exp(wkT x(i) + bk )
p(y (i) = k | x(i) ; W, b) = P2 . (3)
T (i) + b )
c=1 exp(wc x k
1
1+exp(−(w1T x(i) +b1 ))
, if k = 1
= 1 exp(−(w1T x(i) +b1 )) (4)
1 − = , if k = 2
1+exp(−(wT x(i) +b1 ))
1 1+exp(−(w1T x(i) +b1 ))
Note that the softmax model contains two ”sets” of weights, whereas the logistic regression output
only contains one ”set” of weight. Therefore, this simple example with K = 2 not only shows that
softmax regression is a generalization of logistic regression, but also shows that softmax regression has
a ”redundant” set of parameters.
(i)
p(y = 1 | x(i) ; W, b) exp(w1T x(i) + b1 )
1
=
p(y (i) = 2 | x(i) ; W, b) exp(w1T x(i) + b1 ) + exp(w2T x(i) + b2 ) exp(w2T x(i) + b2 )
exp(w1T x(i) + b1 )
p(y (i) = 1 | x(i) ; W, b) =
exp(w1T x(i) + b1 ) + exp(w2T x(i) + b2 )
exp(w1T x(i) + b1 )/ exp(w1T x(i) + b1 )
=
(exp(w1T x(i) + b1 ) + exp(w2T x(i) + b2 ))/ exp(w1T x(i) + b1 )
exp((w1T − w1T )x(i) + (b1 − b1 ))
=
exp((w1T − w1T )x(i) + (b1 − b1 )) + exp((w2T − w1T )x(i) + (b2 − b1 ))
1
=
1 + exp((w2T − w1T )x(i) + (b2 − b1 ))
1
= T x(i) + β))
, where wα = −(w2 − w1 ), β = −(b2 − b1 )
1 + exp(−(wα
(i)
p(y (i) = 2 | x(i) ; W, b) = 1 − p(y1 = 1 | x(i) ; w)
1
=1− T x(i) + β))
1 + exp(−(wα
T (i)
exp(−(wα x + β))
= T x(i) + β))
, where wα = −(w2 − w1 ), β = −(b2 − b1 )
1 + exp(−(wα