0% found this document useful (0 votes)
19 views5 pages

Detailed Sigmoid and Softmax Activation Function

Uploaded by

shubhodippal01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views5 pages

Detailed Sigmoid and Softmax Activation Function

Uploaded by

shubhodippal01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

10-315: Introduction to Machine Learning Recitation 3

1 Sigmoid Function and Logistic Regression


In lecture, we discussed sigmoid functions - a class of functions characterized by an ”S” shaped curve. This
”S” shape is especially useful in machine learning, where we are constantly finding gradients and relying on
our graphs being differentiable! Often times, as discussed in class, our choice of the sigmoid function is the
logistic function:

1
g(z) = (1)
1 + e−z

where z = mx + b

(a) Let’s do a quick concept check. In what situation would we use logistic regression instead of linear
regression?

Linear regression assumes the data follows a linear function, while logistic regression models the data
using a sigmoid function. We can also use logistic regression as a classification technique (when labels
are binary), while we use linear regression when we are predicting some linear function on our data.

(b) We see that g(z) falls strictly between (0, 1). Given what we have learned in class and discussed so far,
what probability distribution does this graph represent?

P (y = 1 | x), where y represents the output class.


1 1 e−z
So, P (y = 1 | x) = 1+e−z , and P (y = 0 | x) = 1 − 1+e−z = 1+e−z

 
1
(c) Now, let’s consider a R3 space. For weight vector θ = 4, define (i) some x such that θ T x > 0. What
3
is the resulting g(z)? Now, (ii) some x such that θ T x = 0. What is the resulting g(z)? Explain the
overall relationship between g(z) and θ T x.

There are multiple correct values of x. We consider one specific example and show that we are correct.

(i)
 
1
x = 1. So, θ T x = 8 > 0. The resulting g(z) is 1
1+ e18
. This value is extremely close to 1, but still
1
less than 1 (0.99967).

(ii)

1
10-315: Introduction to Machine Learning Recitation 3

 
7
x = −1. So, θ T x = 0. The resulting g(z) is 1
1+1 = 0.5.
−1
Overall, we see that our value of g(z) relies on if z >, <, = 0. If z > 0, then g(z) > 0.5. If z < 0, then
g(z) < 0.5. If z = 0, then g(z) = 0.5. Based on the value of g(z), we choose the appropriate binary
class. So, because z = θ T x, we see that θ T x = 0 represents our decision boundary, and when θ T x = 0,
g(z) = 0.5.

2
10-315: Introduction to Machine Learning Recitation 3

2 Multinomial Logistic Regression: Multi-class Classification


So far, we have seen how to use logistic regression to model a binary variable, such as pass/fail, healthy/sick,
etc. Now, what if we have k classes? How can we learn such a classifier?

In a K-class classification setting, we have training set D = {(x(i) , y(i) ) | i = 1, . . . , n} where x(i) ∈ RM
is a feature vector and y(i) ∈ {1, 2, ..., K}.

2.1 One-versus-All
Using a one-vs-all classifier (with logistic regression) to predict the label for a new data includes the following
two steps:

1. Learn one binary classifier for each class. For each 1 ≤ k ≤ K, treat samples of class k as positive
examples and samples from all other classes as negative samples. Perform logistic regression on this
dataset to learn:
1
p(y = k | x; W, b) =
1 + e−(wk T x+bk )

2. Majority Vote:
ŷ = argmax p(y (i) = 1 | x; W, b)
k

Note: this method can be used with any binary classifier (including binary logistic regression, binary SVM
classifier, etc).

2.2 Generalization of Logistic Regression: Softmax Regression


The Softmax Function

For a vector z = [z1 , z2 , ..., zK ]T ∈ RK , the softmax function outputs a vector of the same dimension,
sof tmax(z) ∈ RK , where each of its entries is defined as:

ezk
sof tmax(z)k = PK , for all k = 1, 2, ..., K
c=1 ezc

which guarantees two things:


• Each entry of the resulting vector sof tmax(z) is a value in the range (0, 1)
PK
• k=1 sof tmax(z)k = 1

Therefore, the softmax function is useful for converting a vector of arbitrary real numbers into a discrete
probability density consisting of K probabilities proportional to the exponentials of the input vector compo-
nents. Note that, for example, the larger input components correspond to larger probabilities.

Softmax is often used as the last layer of a neural networks, to map the non-normalized output of a
network to a probability distribution over predicted output classes.

3
10-315: Introduction to Machine Learning Recitation 3

Figure 1: Procedure of Softmax Regression with 3-dimensional Features

Softmax Regression

For K-class classification, Softmax Regression has a parametric model of the form:

exp(wkT x(i) + bk )
p(y (i) = k | x(i) ; W, b) = PK . (2)
T (i) + b )
c=1 exp(wc x c

Therefore, the output of the softmax model looks like: ŷ = argmaxk p(y (i) = k | x(i) ; W, b)
The intermediate result (a vector) outputted by the softmax function is:

p(y (i) = 1 | x(i) ; W) exp(w1T x(i) + b1 )


   
 p(y (i) = 2 | x(i) ; W)  1  exp(w2T x(i) + b2 ) 
=
   
 ..  PK 
T x(i) + b ) 
.. 
 .  exp(w
k=1 k k . 
p(y (i) = K | x(i) ; W) T (i)
exp(wK x + bk )

Note: now W is a matrix! Let W be the M × K matrix obtained by concatenating w1 , w2 ,...,wK , where
each wi ∈ RM .

4
10-315: Introduction to Machine Learning Recitation 3

1. Exercise: Relationship to Logistic Regression


In the special case where K = 2 , one can show that softmax regression reduces to logistic regression.
This shows that softmax regression is a generalization of logistic regression.
Specifically, show the equivalence between the two equations (2) and (3).

exp(wkT x(i) + bk )
p(y (i) = k | x(i) ; W, b) = P2 . (3)
T (i) + b )
c=1 exp(wc x k

1

1+exp(−(w1T x(i) +b1 ))
, if k = 1
= 1 exp(−(w1T x(i) +b1 )) (4)
1 − = , if k = 2
1+exp(−(wT x(i) +b1 ))
1 1+exp(−(w1T x(i) +b1 ))

Note that the softmax model contains two ”sets” of weights, whereas the logistic regression output
only contains one ”set” of weight. Therefore, this simple example with K = 2 not only shows that
softmax regression is a generalization of logistic regression, but also shows that softmax regression has
a ”redundant” set of parameters.

Note: the intermediate output vector of the softmax function is:

 (i)
p(y = 1 | x(i) ; W, b) exp(w1T x(i) + b1 )
  
1
=
p(y (i) = 2 | x(i) ; W, b) exp(w1T x(i) + b1 ) + exp(w2T x(i) + b2 ) exp(w2T x(i) + b2 )

exp(w1T x(i) + b1 )
p(y (i) = 1 | x(i) ; W, b) =
exp(w1T x(i) + b1 ) + exp(w2T x(i) + b2 )
exp(w1T x(i) + b1 )/ exp(w1T x(i) + b1 )
=
(exp(w1T x(i) + b1 ) + exp(w2T x(i) + b2 ))/ exp(w1T x(i) + b1 )
exp((w1T − w1T )x(i) + (b1 − b1 ))
=
exp((w1T − w1T )x(i) + (b1 − b1 )) + exp((w2T − w1T )x(i) + (b2 − b1 ))
1
=
1 + exp((w2T − w1T )x(i) + (b2 − b1 ))
1
= T x(i) + β))
, where wα = −(w2 − w1 ), β = −(b2 − b1 )
1 + exp(−(wα

(i)
p(y (i) = 2 | x(i) ; W, b) = 1 − p(y1 = 1 | x(i) ; w)
1
=1− T x(i) + β))
1 + exp(−(wα
T (i)
exp(−(wα x + β))
= T x(i) + β))
, where wα = −(w2 − w1 ), β = −(b2 − b1 )
1 + exp(−(wα

You might also like