0% found this document useful (0 votes)
108 views103 pages

Deep Feedforward Networks

Feed forward neural networks are artificial neural networks in which nodes do not form loops. This type of neural network is also known as a multi-layer neural network as all information is only passed forward

Uploaded by

20010700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views103 pages

Deep Feedforward Networks

Feed forward neural networks are artificial neural networks in which nodes do not form loops. This type of neural network is also known as a multi-layer neural network as all information is only passed forward

Uploaded by

20010700
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 103

Deep Learning Srihari

Deep Feedforward Networks:


Overview

1
Deep Learning Srihari

Topics in DFF Networks


1. Overview
2. Example: Learning XOR
3.Hidden Units
4. Architecture Design
5. Backpropagation and Other Differentiation
6. Historical Notes

2
Deep Learning Srihari

Sub-topics in Overview of DFF


1. Goal of a Feed-Forward Network
2. Feedforward vs Recurrent Networks
3. Function Approximation as Goal
4. Extending Linear Models (SVM)
5. Example of XOR

3
Deep Learning Srihari

Goal of a feedforward network


• Feedforward Nets are
quintessential deep learning models
• Deep Feedforward Networks are also called as
– Feedforward neural networks or
– Multilayer Perceptrons (MLPs)
• Their Goal is to approximate some function f *
– E.g., classifier y = f * (x) maps input x to category y
– Feedforward Network defines a mapping
y = f * (x ; θ )
•and learns the values of the parameters θ that result in the
best function approximation 4
Deep Learning Srihari

Feedforward network for MNIST

MNIST 28x28 images

Source: https://towardsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-learning-ed1509b2eb3f
5
Deep Learning Srihari

Another View of 2-hidden layers

https://www.easy-tensorflow.com/tf-tutorials/neural-networks/two-layer-neural-network

6
Deep Learning Srihari

Flow of Information
• Models are called Feedforward because: y=f (x)
– To evaluate f (x): information flows one-way from
x through computations defining f s to outputs y
• There are no feedback connections
– No outputs of model are fed back into itself

7
Deep Learning Srihari

Feedforward Net: US Election


• US Presidential Election y=f (x)
• Output: y={y1, y2}
• votes of electoral college for candidate
• Input: X={x1,..x50}
• are vote vectors cast for 2 candidates
• W converts votes to electoral votes
– E.g., Winner takes all or proportionate
h is defined for each state
• h is electoral college as shown in map
• Each state has fixed no of electors
• w maps 50 states to 2 outputs
• Simple addition
8
Deep Learning Srihari

Importance of Feedforward Networks


• They are extremely important to ML practice
• Form basis for many commercial applications
1. CNNs are a special kind of feedforward networks
• They are used for recognizing objects from photos
2. They are a conceptual stepping stones to RNNs
• RNNs power many NLP applications

9
Deep Learning Srihari

Feedforward vs. Recurrent

• When feedforward neural networks are


extended to include feedback connections they
are called Recurrent Neural Networks (RNNs)

RNN Unrolled RNN


RNN with
learning
component

10
Deep Learning Srihari

Feedforward Neural Network Structures

• They are called networks because they are


composed of many different functions
• Model is associated with a directed acyclic
graph describing how functions composed
– E.g., functions f (1), f (2), f (3) connected in a chain to
form f (x)= f (3) [ f (2) [ f (1)(x)]]
• f (1) is called the first layer of network (which is a vector)
• f (2) is called the second layer, etc
• These chain structures are the most commonly
used structures of neural networks 11
Deep Learning Srihari

Definition of Depth
• Overall length of the chain is the depth of the
model
– Ex: the composite function f (x)= f (3) [ f (2) [ f (1)(x)]]
has depth of 3
• The name deep learning arises from this
terminology
• Final layer of a feedforward network, ex f (3), is
called the output layer

12
Deep Learning Srihari

Training the Network


• In network training we drive f (x) to match f* (x)
• Training data provides us with noisy,
approximate examples of f* (x) evaluated at
different training points
• Each example accompanied by label y ≈ f*(x)
• Training examples specify directly what the
output layer must do at each point x
– It must produce a value that is close to y

13
Deep Learning Srihari

Definition of Hidden Layer


• Behavior of other layers is not directly specified
by the data
• Learning algorithm must decide how to use
those layers to produce value that is close to y
• Training data does not say what individual
layers should do
• Since the desired output for these layers is not
shown, they are called hidden layers

14
Deep Learning Srihari

A net with depth 2: one hidden layer

K outputs y1,..yK for a given input x


Hidden layer consists of M units

⎛ M (2) ⎛ D (1) ⎞ ⎞
yk (x,w) = σ ⎜ ∑ wkj h ⎜ ∑ w ji x i + w (1)
j0 ⎟
+ w (2)
k0 ⎟
⎝ j =1 ⎝ i =1 ⎠ ⎠

f (x)= f (2) [ f (1)(x)]


f (1) is a vector of M dimensions and
f (2) is a vector of K dimensions

fm (1) =zm= h(xTw(1)), m=1,..M


fk (2) = σ (zTw(2)), k=1,..K

15
Deep Learning Srihari

Feedforward net with depth 2


• Recognition of printed characters (OCR)
f (x)= f (2) [ f (1)(x)]
– Hidden layer f (1) compares raw pixel inputs to
component patterns

16
Deep Learning Srihari

Width of Model
• Each hidden layer is typically vector-valued
• Dimensionality of hidden layer vector is width of
the model

17
Deep Learning Srihari

Units of a model
• Each element of vector viewed as a neuron
– Instead of thinking of it as a vector-vector function,
they are regarded as units in parallel
• Each unit receives inputs from many other units
and computes its own activation value

18
Deep Learning Srihari

Depth versus Width


• Going deeper makes network more expressive
– It can capture variations of the data better.
– Yields expressiveness more efficiently than width
• Tradeoff for more expressiveness is increased
tendency to overfit
– You will need more data or additional regularization
• network should be as deep as training data allows.
– But you can only determine a suitable depth by
experiment.
• Also computation increases with no. of layers.
Deep Learning Srihari

Very Deep CNNs


CNNs with depth 11 to 19
Depth increases from left (A) to right (E)
as more layers are added
(the added layers are shown in bold)

Convolutional layer parameters denoted


“conv (receptive field size) –(no. of channels)”

ReLU activation not shown for brevity

20
Deep Learning Srihari

Why are they neural networks?


• These networks are loosely inspired by
neuroscience
• Each unit resembles a neuron
– Receives input from many other units
– Computes its own activation value
• Choice of functions f (i)(x):
– Loosely guided by neuroscientific observations
about biological neurons
• Modern neural networks are guided by many
mathematical and engineering disciplines
• Not perfectly model the brain
21
Deep Learning Srihari

Function Approximation is goal


• Think of feedforward networks as function
approximation machines
– Designed to achieve statistical generalization
• Occasionally draw insights from what we know
about the brain
– Rather than as models of brain function

22
Deep Learning Srihari

Understanding Feedforward Nets


• Begin with linear networks and understand their
limitations
• Linear models such as logistic regression and
linear regression can be fit reliably and
efficiently using either
– Closed-form solution
– Convex optimization
• Limitation

23
Deep Learning Srihari

Extending Linear Models


• To represent non-linear functions of x
– apply linear model to transformed input ϕ(x)
• where ϕ is non-linear
– Equivalently kernel trick of SVM obtains nonlinearity
SVM Kernel trick
Deep Learning Srihari

• Many ML algos can be rewritten as dot


products between examples:
f (x)=wTx+b written as b + Σi αi xTx(i)
where x(i) is a training example and α is a vector of coeffts
– This allows us to replace x with a feature function ϕ(x) and
dot product with function k(x,x(i))=ϕ(x)Ÿϕ(x(i)) called a kernel
•The Ÿ operator represents an inner product analogous to ϕ(x)Tϕ(x(i))
•For some feature spaces we may not literally use an inner product
– In continuous spaces an inner product based on integration
– Gaussian kernel
•Consider k(u,v) = exp (-||u-v||2/2σ2)
– By expanding the square ||u-v||2 = uTu + vTv - 2uTv
– we get k(u,v)=exp(-uTu/2σ2)exp(-uTv/σ2)exp(-vTv/2σ2)
•Validity follows from kernel construction rules
SVM Prediction
Deep Learning Srihari

• Use linear regression on Lagrangian for


determining the weights αi
• We can make predictions using
– f (x)= b + Σiαi k(x,x(i))
– Function is nonlinear wrt x but relationship between
ϕ(x) and f (x) is linear
– Also the relationship between α and f (x) is linear
– We can think of ϕ as providing a set of features
• describing x or providing a new representation for x
Deep Learning Srihari

Disadvantages of Kernel Methods


• Cost of decision function evaluation: linear in m
– Because the ith example contributes term αi k(x, x(i))
to the decision function
– Can mitigate this by learning an α with mostly zeros
• Classification requires evaluating the kernel function only
for training examples that have a nonzero αi
• These are known as support vectors
• Cost of training: high with large data sets
• Generic kernels struggle to generalize well
– Neural net outperformed RBF-SVM on MNIST
• Also, how to choose the mapping ϕ? 27
Deep Learning Srihari

Options for choosing mapping ϕ


1. Generic feature function ϕ (x)
– Radial basis function
2. Manually engineer ϕ
– Feature engineering
3. Principle of Deep Learning: Learn ϕ

28
Deep Learning Srihari

Option 1 to choose the mapping ϕ


• Generic feature function ϕ (x)
– Infinite-dimensional ϕ that is implicitly used by
kernel machines based on RBF
• RBF: N(x ; x(i), σ2I) centered at x(i)
σ =mean distance
x(i):
From between
k-means each unit j and its
clustering closest neighbor

– If ϕ(x) is of high enough dimension we can have


enough capacity to fit the training set
• Generalization to test set remains poor
• Generic feature mappings are based on smoothness
– Do not include prior information to solve advanced problems 29
Deep Learning Srihari

Option 2 to choose the mapping ϕ


• Manually engineer ϕ
• This was the dominant approach until arrival of
deep learning
• Requires decades of effort
– e.g., speech recognition, computer vision
• Little transfer between domains

30
Deep Learning Srihari

Option 3 to choose the mapping ϕ


• Strategy of Deep learning: Learn ϕ
• Model is y=f (x; θ,w) = ϕ(x; θ)T w
– θ used to learn ϕ from broad class of functions
– Parameters w map from ϕ (x) to output
– Defines FFN where ϕ define a hidden layer
• Unlike other two (basis functions, manual
engineering), this approach gives-up on
convexity of training
– But its benefits outweigh harms
31
Deep Learning Srihari

Extend Linear Methods to Learn ϕ


ϕM K outputs y1,..yK for a given input x
θMD
wKM Hidden layer consists of M units

M ⎛D ⎞
yk (x; θ,w) = ∑ wkj φj ⎜⎜∑ θji x i + θj 0 ⎟⎟⎟ + wk 0
j =1 ⎜⎝ i=1 ⎟⎠

ϕ1 w10
yk = fk (x;θ,w) = ϕ (x;θ)T w
ϕ0
Can be viewed as a generalization of linear models
• Nonlinear function fk with M+1 parameters wk= (wk0 ,..wkM ) with
• M basis functions, ϕj j=1,..M each with D parameters θj= (θj1,..θjD)
• Both wk and θj are learnt from data

32
Deep Learning Srihari

Approaches to Learning ϕ
• Parameterize the basis functions as ϕ(x;θ)
– Use optimization to find θ that corresponds to a
good representation
• Approach can capture benefit of first approach
(fixed basis functions) by being highly generic
– By using a broad family for ϕ(x;θ)
• Can also capture benefits of second approach
– Human practitioners design families of ϕ(x;θ) that
will perform well
– Need only find right function family rather than
precise right function 33
Deep Learning Srihari

Importance of Learning ϕ
• Learning ϕ is discussed beyond this first
introduction to feed-forward networks
– It is a recurring theme throughout deep learning
applicable to all kinds of models
• Feedforward networks are application of this
principle to learning deterministic mappings
form x to y without feedback
• Applicable to
– learning stochastic mappings
– functions with feedback
– learning probability distributions over a single vector34
Deep Learning Srihari

Plan of Discussion: Feedforward Networks


1. A simple example: learning XOR
2. Design decisions for a feedforward network
– Many are same as for designing a linear model
• Basics of gradient descent
– Choosing the optimizer, Cost function, Form of output units
– Some are unique
• Concept of hidden layer
– Makes it necessary to have activation functions
• Architecture of network
– How many layers , How are they connected to each other, How
many units in each later
• Learning requires gradients of complicated functions
– Backpropagation and modern generalizations 35
Deep Learning Srihari

1. Ex: XOR problem


• XOR: an operation on binary variables x1 and x2
– When exactly one value equals 1 it returns 1
otherwise it returns 0
– Target function is y=f *(x) that we want to learn
• Our model is y =f ([x1, x2] ; θ) which we learn, i.e., adapt
parameters θ to make it similar to f *
• Not concerned with statistical generalization
– Perform correctly on four training points:
• X={[0,0]T, [0,1]T,[1,0]T, [1,1]T}
– Challenge is to fit the training set
• We want f ([0,0]T; θ) = f ([1,1]T; θ) = 0
• f ([0,1]T; θ) = f ([1,0]T; θ) = 1 36
Deep Learning Srihari

ML for XOR: linear model doesn’t fit


• Treat it as regression with MSE loss function
1 1 4
J(θ) = ∑ ( f * (x)− f (x;θ)) = ∑ ( f * (x n )− f (x n ;θ))
2 2

4 x∈X 4 n=1
Alternative is Cross-entropy J(θ)
– Usually not used for binary data J(θ) = −ln p(t | θ)
N
= −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
– But math is simple n=1
yn= σ (θTxn)

• We must choose the form of the model


• Consider a linear model with θ ={w,b} where
f (x;w,b) = x T w +b
1 4
(
) to get closed-form solution
2

– Minimize J(θ) = ∑ tn −x nT w - b)
4 n=1
• Differentiate wrt w and b to obtain w = 0 and b=½
– Then the linear model f(x;w,b)=½ simply outputs 0.5 everywhere
– Why does this happen? 37
Deep Learning Srihari

Linear model cannot solve XOR


• Bold numbers are values system must output
• When x1=0, output has to increase with x2
• When x1=1, output has to decrease with x2

• Linear model f (x;w,b)= x1w1+x2w2+b has to assign a single


weight to x2, so it cannot solve this problem
• A better solution:
– use a model to learn a different representation
• in which a linear model is able to represent the solution
– We use a simple feedforward network
• one hidden layer containing two hidden units

38
Deep Learning Srihari

Feedforward Network for XOR

• Introduce a simple feedforward


network
– with one hidden layer containing two
units
• Same network drawn in two different
styles
– Matrix W describes mapping from x to h
– Vector w describes mapping from h to y
– Intercept parameters b are omitted
39
Deep Learning Srihari

Functions computed by Network


• Layer 1 (hidden layer): vector of hidden
units h computed by function f (1)(x; W,c)
– c are bias variables
• Layer 2 (output layer) computes
f (2)(h; w,b)
– w are linear regression weights
– Output is linear regression applied to h
rather than to x
• Complete model is
f (x; W,c,w,b)=f (2)(f (1)(x)) 40
Deep Learning Srihari

Linear vs Nonlinear functions


• If we choose both f (1) and f (2) to be linear, the
total function will still be linear f (x)=xTw’
– Suppose f (1)(x)= WTx and f (2)(h)=hTw
– Then we could represent this function as
f (x)=xTw’
f (x)=xTw’ where w’=Ww
• Since linear is insufficient, we must use a
nonlinear function to describe the features
– We use the strategy of neural networks
– by using a nonlinear activation function
h=g(WTx+c) 41
Deep Learning Srihari

Activation Function
• In linear regression we used a vector of weights
w and scalar bias b T
f (x;w,b) = x w +b

– to describe an affine transformation from an input


vector to an output scalar
• Now we describe an affine transformation from
a vector x to a vector h, so an entire vector of
bias parameters is needed
• Activation function g is typically chosen to be
applied element-wise hi=g(xTW:,i+ci)

42
Deep Learning Srihari

Default Activation Function


• Activation: g(z)=max{0,z}
– Applying this to the output of a
linear transformation yields a
nonlinear transformation
– However function remains close A principle of CS:
to linear Build complicated
systems from
• Piecewise linear with two pieces minimal components.
A Turing Machine
• Therefore preserve properties that Memory needs only 0
make linear models easy to and 1 states.
optimize with gradient-based
We can build Universal
methods Function approximator
• Preserve many properties that from ReLUs
make linear models generalize
well
Deep Learning Srihari

Specifying the Network using ReLU


• Activation: g(z)=max{0,z}
• We can now specify the complete network as
f (x; W,c,w,b)=f (2)(f (1)(x))=wT max {0,WTx+c}+b
Deep Learning Srihari

We can now specify XOR Solution


f (x; W,c,w,b)=
• Let ⎡ ⎤
⎢ 1 1 ⎥
⎡ ⎤
⎢ −1 ⎥
⎡ ⎤
W = ⎢ 1 1 ⎥, c = ⎢ 0 ⎥, w = ⎢ 1 ⎥, b = 0
⎢ −2 ⎥ wT max {0,WTx+c}+b
⎣ ⎦ ⎣ ⎦ ⎣ ⎦
• Now walk through how model processes a
batch of inputs ⎡
⎢ 0 0


⎢ ⎥
0 1
• Design matrix X of all four points: ⎡
⎢ 0 0


X = ⎢

⎢ 1 0



⎢ ⎥ ⎢ ⎥
⎢ ⎥
• First step is XW: ⎡
⎢ 0 −1
⎤ XW = ⎢⎢
⎥ ⎢
1 1
1 1



⎣ 1 1 ⎦
In this space all points lie ⎢ ⎥ ⎢ ⎥
1 0
XW +c = ⎢⎢ ⎥
• Adding c: along a line with slope 1. Cannot be
implemented by a linear model
⎢ 1 0



⎣ 2 2 ⎥

⎢ ⎥
⎤ ⎢⎣ 2 1 ⎥
• Compute h Using ReLU ⎡
⎢ 0 0 ⎥

1 0


⎢ ⎥
Has changed relationship among examples. max{0, XW +c} = ⎢⎢ ⎥
1 0 ⎥
They no longer lie on a single line. ⎢ ⎥
⎢ 2 1 ⎥
A linear model suffices ⎣ ⎦

• Finish by multiplying by w: ⎡ ⎤
⎢ 0 ⎥
• Network has obtained ⎢
f (x) = ⎢⎢

1
1




⎢ ⎥
⎢ 0 ⎥
⎣ ⎦
correct answer for all 4 examples

45
Deep Learning Srihari

Learned representation for XOR


• Two points that must have When x1=0, output has to
output 1 have been increase with x2
When x1=1, output has to
collapsed into one decrease with x2

• Points x=[0,1]T and


x=[1,0]T have been
mapped into h=[0,1]T

When h1=0, output is constant 0


• Described in linear model with h2
When h1=1, output is constant 1
– For fixed h2, output with h2
When h1=2, output is constant 0
increases in h1 with h2
46
Deep Learning Srihari

About the XOR example


• We simply specified the solution
– Then showed that it achieves zero error
• In real situations there might be billions of
parameters and billions of training examples
– So one cannot simply guess the solution
• Instead gradient descent optimization can find
parameters that produce very little error
– The solution described is at the global minimum
• Gradient descent could converge to this solution
• Convergence depends on initial values
• Would not always find easily understood integer solutions
47
Deep Learning Srihari

Gradient-Based Learning

1
Deep Learning Srihari

Topics
• Overview
1.Example: Learning XOR
2.Gradient-Based Learning
3.Hidden Units
4.Architecture Design
5.Backpropagation and Other Differentiation
6.Historical Notes

2
Deep Learning Srihari

Topics in Gradient-based Learning


• Overview
1.Cost Functions
1. Learning Conditional Distributions with Max
Likelihood
2. Learning Conditional Statistics
2.Output Units
1. Linear Units for Gaussian Output Distributions
2. Sigmoid Units for Bernoulli Output Distributions
3. Softmax Units for Multinoulli Output Distributions
4. Other Output Types
3
Deep Learning Srihari

Overview of Gradient-based Learning

4
Deep Learning Srihari

Neural network Training


• It is not much different from training any other
ML model with gradient descent. Need
1. optimization procedure
2. cost function
3. model family
• In the case of logistic regression:
1. optimization procedure is gradient descent
2. cost function is cross entropy
3. model family is Bernoulli distribution

5
Deep Learning Srihari

Standard ML Training vs NN Training


• Largest difference between simple ML models
and neural networks is:
– Nonlinearity of neural network causes interesting
loss functions to be non-convex
Logistic Regression Neural Network
Loss: Loss:
Linear Regression with Basis Functions:
2
J(θ) = −Ex,y∼p̂ log pmodel (y | x)
data
1 N
{
E D (w) = ∑ t n −wT ϕ(x n )
2 n=1
}

– Use iterative gradient-based optimizers that


merely drives cost to low value, rather than
• Exact linear equation solvers used for linear regression or
• convex optimization algorithms used for logistic
regression or SVMs 6
Deep Learning Srihari

Convex vs Non-convex
• Convex methods:
– Converge from any initial parameters
– Robust-- but can encounter numerical problems
• SGD with non-convex:
– Sensitive to initial parameters
– For feedforward networks, important to initialize
• Weights to small values, Biases to zero or small positives
– SGD can also train Linear Regression and SVM
Especially with large training sets
– Training neural net not much different from training
other models
7
• Except computing gradient is more complex
Deep Learning Srihari

Choices for gradient learning


• We must choose a cost function
• We must choose how to represent the output of
the model
• We now visit these design considerations

8
Deep Learning Srihari

Cost Functions

9
Deep Learning Srihari

Cost Functions for Deep Learning


• Important aspect of design of deep neural
networks is the cost function
– They are similar to those for parametric models
such as linear models
• A parametric model defines distribution p(y|x;θ)
– We use the principle of maximum likelihood
– This means we use the cross entropy between the
training data and the model’s prediction
– We will see how maximum likelihood leads to cross-
entropy in the case of logistic regression
10
Deep Learning Srihari

Logistic Regression & Cross entropy


• Binary-valued output
– Training data: {xn, tn}, tn ε {0,1}

• Model is a conditional of form p(y |x;θ)


y = σ(θT x)

– Model assigns probability to observation p(t | x ) = y n n


tn
n {1 −y }
n
1−tn

– Likelihood function ∏ { } t 1−tn


p(t | θ) = ynn 1 −yn ,yn = σ(θT x n )
n=1

• We use maximum likelihood to define cost


N

• Negative log-likelihood is J(θ) = −ln p(t | θ) = −∑ {t ln y + (1 −t )ln(1 −y )} n=1


n n n n

• This is cross-entropy between data tn and prediction yn


– Definition of cross-entropy is given next

11
Deep Learning Srihari

Definition of Cross entropy


• If p is the training set distribution and q the
learned one, their cross-entropy is
H(p,q) = E p ⎡⎢−logq ⎤⎥ = −∑ p(x)logq(x)
⎣ ⎦ x

– Compare this to logistic regression cost


N
J(θ) = H ⎡⎢y,t ⎤⎥ = −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
⎣ ⎦
y = σ(θT x)
n=1

– Furthermore, there is an interesting connection of


cross-entropy to K-L divergence

12
Minimizing cross entropy is same as making learned and data distributions the same
Deep Learning Srihari

Gradient descent cross-entropy


minimization

13
Deep Learning Srihari

Learning Conditional Distributions with


maximum likelihood
• Most modern neural networks are trained using
maximum likelihood
– This means cost is simply negative log-likelihood
• Equivalently, cross-entropy between training set and
model distribution
• This cost function is given by
J(θ) = −Ex,y∼p̂ log pmodel (y | x)
data

– Specific form of cost function changes from model


to model depending on form of log pmodel
• It may yield terms that do not depend on model
parameters that we may discard
14
Deep Learning Srihari

Cost Function with Gaussian model


• If pmodel(y|x) =N ( y| f (x ; θ), I)
– then using maximum likelihood the mean squared
error cost is
1 2
J(θ) = − Ex,y∼p̂ y − f (x;θ) +const
2 data

– upto a scaling factor ½ and a term independent of θ


• const depends on the variance of Gaussian which we
chose not to parameterize

15
Deep Learning Srihari

Advantage of approach to cost


• Deriving cost from maximum likelihood removes
the burden of designing cost functions for each
model
• Specifying the model p(y |x) automatically
determines a cost function log p(y |x)

16
Deep Learning Srihari

Desirable Property of Gradient


• Recurring theme in neural network design is:
– Gradient must be large and predictable enough to
serve as good guide to the learning algorithm
• Functions that saturate (become very flat)
undermine this objective
– Because the gradient becomes very small
• Happens when activation functions producing output of
hidden/output units saturate

17
Deep Learning Srihari

Keeping the Gradient Large


• Negative log-likelihood helps avoid saturation
problem for many models
– Many output units involve exp functions that
saturate when its argument is very negative
– log function in Negative log-likelihood cost function
undoes exp of some units

N N
p(t | θ) = ∏ynn {1 −yn }
1−tn
J(θ) = −ln p(t | θ) = −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
t
,yn = σ(θT x n )
n=1 n=1

18
Deep Learning Srihari

Cross Entropy and Regularization


• A property of cross-entropy cost used for MLE
is that it does not have a minimum value
– For discrete output variables, they cannot represent
probability of zero or one but come arbitrarily close
• Logistic Regression is an example
– For real-valued output variables it becomes
possible to assign extremely high density to correct
training set outputs, e.g, by learning variance
parameter of Gaussian output, and the resulting
cross-entropy approaches negative infinity
• Regularization modifies learning problem so
model cannot reap unlimited reward this way 19
Deep Learning Srihari

Learning Conditional Statistics


• Instead of learning a full probability distribution,
we often want to learn just one conditional
statistic of y given x
– E.g., a predictor f (x;θ) which we wish to employ to
predict the mean of y
• E.g., predict mean salary, predict mean gas pedal

y = annual salary

20
Deep Learning Srihari

Learning a function
• If we have sufficiently powerful neural network,
we can think of it as being powerful to
determine any function f
– This function is limited only by
1. Boundedness and
2. Continuity
3. Rather than by having a specific parametric form
– From this point of view, cost function is a functional
rather than a function
• Choose a function that is best for the task
21
Deep Learning Srihari

What is a Functional?

• View Cost as a functional, not a function


– A functional is a mapping from functions to real nos
• E.g., Entropy is a functional H[p] = −∑ p(x)log p(x)
x

– Maximum entropy is to find the p(x) that maximizes H[p]

• We can think of learning as a task of choosing a


function rather than a set of parameters
– We can design our cost functional to have its
minimum occur at a specific function we desire
• E.g., Design the cost functional to have its minimum lie on
the function that maps x to the expected value of y given
x
22
Definition of Functionals
Deep Learning Srihari

Examples
of
Functionals
Entropy
For Bernoulli
AUC With
Curve Different
Length P(x=1) 23
Deep Learning Srihari

Optimization via Calculus of Variations


• Solving an optimization problem with respect to
a function requires a mathematical tool called
calculus of variations
• For the present, only need to know that calculus
of variations is used to derive two results
1. First concerns predicting the mean of a distribution
2. Second one concerns predicting the median

24
Deep Learning Srihari

Predicting the mean of a distribution


• We want predictor f(x) to predict mean of y
• From calculus of variations, solving the
optimization problem 2
f * = arg min Ex,y∼p̂
f data
y - f(x) Note: optimization is over functions

yields f * (x) = E ⎡y ⎤
⎢⎣ ⎥⎦ y∼pdata ( y|x)

– Which means if we could train infinitely many


samples from the true data generating distribution,
– minimizing MSE cost function gives a function that
predicts the mean of y for each value of x
25
Deep Learning Srihari

Predicting the median of a distribution


• Different cost functions give different statistics
• Second result from calculus of variations is that
f * = arg min E x,y~p || y - f(x) ||1
f data

– yields a function that predicts the median of y for


each each x
– This cost function is referred to as mean absolute
error

26
Deep Learning Srihari

MSE/MAE vs Cross Entropy Cost


• Mean squared error and mean absolute error
often lead to poor results when used with
gradient-based optimization
– Some output units saturate produce very small
gradients when combined with these cost functions
• This is one reason cross-entropy cost is more
popular than mean squared error and mean
absolute error
– Even when it is not necessary to estimate the entire
distribution p(y |x)
27
Deep Learning Srihari

Output Units

28
Deep Learning Srihari

Output Units
• Choice of cost function is tightly coupled with
choice of output unit
– Most of the time we use cross-entropy between
data distribution and model distribution
• Choice of how to represent the output then determines
the form of the cross-entropy function
• In logistic regression output is binary-valued

Cross-entropy in logistic regression θ={w,b}

J(θ) = −ln p(t | θ)


N
= −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
n=1
yn= σ (θTxn)

29
Deep Learning Srihari

Role of Output Units


• Any output unit is also usable as a hidden unit

• Our focus is units as output, not internally


– Revisit it when discussing hidden units
• A feedforward network provides a hidden set of
features h =f (x ; θ)
• Role of output layer is to provide some
additional transformation from the features to
the task that network must perform 30
Deep Learning Srihari

Types of output units


1. Linear units: for mean of a Gaussian output
2. Sigmoid units for Bernoulli Output Distributions
3. Softmax units for Multinoulli Output
4. Other Output Types
– Not direct prediction of y but provide parameters of
distribution over y

31
Deep Learning Srihari

Linear Units for Gaussian Output Distributions


• Linear unit: simple output based on affine
transformation with no nonlinearity
– Given features h, a layer of linear output units
produces a vector
ŷ =W T h +b
• Linear units are often used to produce mean ŷ
of a conditional Gaussian distribution
P(y | x) = N(y; ŷ,I )

• Maximizing the log-likelihood is equivalent to


minimizing the mean squared error
32
Deep Learning Srihari

Linear units for Gaussian covariance


• Linear units can be used to learn the
covariance of a Gaussian too, or the covariance
to be a function of the input
• However covariance needs to be constrained to
be a positive definite matrix for all inputs
– Typically difficult to satisfy, so other output units
used to parameterize covariance

33
Deep Learning Srihari

Advantage of linear units


• Because they do not saturate, they pose little
difficulty for gradient-based optimization and
may be used with a variety of optimization
techniques
• Possible use in VAEs

34
Deep Learning Srihari

Sigmoid Units for Bernoulli Output Distributions


• Task of predicting value of binary variable y
– Classification problem with two classes
• Maximum likelihood approach is to define a
Bernoulli distribution over y conditioned on x
• Neural net needs to predict p(y=1|x)
– For a probability it needs to lies in the interval [0,1]
– Satisfying Constraint needs careful design
• If we use P(y=1 | x) = max {0, min {1,w h +b }}
T

• We would define a valid conditional distribution, but


cannot train it effectively with gradient descent
• When wT+h strays outside unit interval, gradient= 0è
35
– learning algorithm cannot be guided
Deep Learning Srihari

Sigmoid and Logistic Regression


• To ensure a strong gradient whenever model
has the wrong answer, use sigmoid output units
– combined with maximum likelihood
(
ŷ = σ wT h + b ) σ (x ) =
1
1 + exp(−x)
• where σ (x) is the logistic sigmoid function
• A sigmoid output unit has two components:
1. A linear layer to compute z = wT h + b
2. Use sigmoid activation function to convert z into a
probability

36
Deep Learning Srihari

Probability distribution using Sigmoid


• Describe probability distribution over y using z
z = wT h + b y is output, z is input
!
– Construct unnormalized probability distribution P(y)
• Assuming unnormalized log probability is linear in y and z
! = yz
log P(y)
! = exp(yz)
P(y)
• Normalizing yields a Bernoulli distribution controlled by σ
exp(yz)
P(y) = 1

∑ exp(y ' z)
y '=0

=σ((2y -1)z)

– Probability distributions based on exponentiation


and normalization are common throughout
statistical modeling
• z variable defining such a distribution over binary
variables is called a logit 37
Deep Learning Srihari

Max Likelihood Loss Function


• Given binary y and some z, the normalized
probability distribution over y is
! = yz
log P(y) P(y) =
exp(yz)
= σ((2y −1)z)
1
! = exp(yz)
P(y) ∑ exp(yz)
y '=0

• We can use this approach in maximum


likelihood learning
– Loss for max likelihood learning is –log P(y|x)
J(θ) = −log P(y | x)
= −log σ((2y −1)z) ζ is the softplus function
=ζ((1 - 2y)z)

• This is for a single sample


Deep Learning Srihari

Softplus function
• Sigmoid saturates when its argument is very
positive or very negative
– i.e., function is insensitive to small changes in input
• Compare it to the softplus function
ζ(x) = log(1+ exp(x))

– Its range is (0,∞). It arises in expressions involving


sigmoids.
– Its name comes from its being a smoothed or
softened version of x+=max(0, x) 39
Deep Learning Srihari

Properties of Sigmoid & Softplus

Justification for the name ‘softplus’


Smoothed version of positive part function
x+=max{0,x}
The positive part function is the counterpart
of the negative part function x -=max{0,-x}

40
Deep Learning Srihari

Loss Function for Bernoulli MLE


J(θ) = −log P(y | x)
= −log σ((2y −1)z)
=ζ((1 - 2y)z)
– By rewriting the loss in terms of the softplus
function, we can see that it saturates only when (1-
2y)z <<0.
– Saturation occurs only when model already has the
right answer
• i.e., when y=1 and z>>0 or y=0 and z <<0
• When z has the wrong sign (1-2y)z can be simplified to |z|
– As |z| becomes large while z has the wrong sign, softplus
asymptotes towards simply returning argument |z| & derivative
wrt z asymptotes to sign(z), so, in the limit of extremely incorrect z
softplus does not shrink the gradient at all
– This is a useful property because gradient-based learning can act
quickly to correct a mistaken z
Deep Learning Srihari

Cross-Entropy vs Softplus Loss


N
J(θ) = −log P(y | x)
{ }
1−yn
p(y | θ) = ∏ σ(θT x n ) n 1 − σ(θT x n )
y

= −log σ((2y −1)z)


z = θT x + b
n=1

J(θ) = −ln p(y | θ)


N =ζ((1 - 2y)z)
{ ( ) }
= −∑ yn ln σ(θT x n ) + (1 −yn )ln(1 − σ(θT x n ))
n=1

– Cross-entropy loss can saturate anytime σ(z)


saturates
• Sigmoid saturates to 0 when z becomes very negative
and saturates to 1 when z becomes very positive
– Gradient can shrink to too small to be useful for
learning, whether model has correct or incorrect
answer
– We have provided an alternative implementation of
logistic regression!
42
Deep Learning Srihari

Softmax units for Multinoulli Output


• Any time we want a probability distribution over
a discrete variable with n values we may use
the softmax function
– Can be seen as a generalization of sigmoid function
used to represent probability distribution over a
binary variable
• Softmax most often used for output of classifier
to represent distribution over n classes
– Also inside the model itself when we wish to choose
between one of n options
43
Deep Learning Srihari

From Sigmoid to Softmax


• Binary case: we wished to produce a single no.
ŷ = P(y = 1 | x)
• Since (i) this number needed to lie between 0 and 1 and
(ii) because we wanted its logarithm to be well-behaved
for a gradient-based optimization of log-likelihood, we
chose instead to predict a number
p(C 1 | φ) = y(φ) = σ(θT φ) z = wT h + b
• Exponentiating and normalizing, gave us a Bernoulli
distribution controlled by the sigmoidal transformation of z
! = yz
log P(y) P(y) =
exp(yz)
= σ((2y − 1)z)
1
! = exp(yz)
P(y) ∑ exp(yz)
y '=0

• Case of n values: need to produce vector ŷ


• with values ŷi = P(y = i | x)
44
Deep Learning Srihari

Softmax definition
• We need to produce a vector ŷ with values
ŷi = P(y = i | x)
• We need elements of ŷ lie in [0,1] and they sum to 1
• Same approach as with Bernoulli works for
Multinoulli distribution
• First a linear layer predicts unnormalized log probabilities
z =WTh+b
– where zi = log P̂(y = i | x)
• Softmax can then exponentiate and normalize z
to obtain the desired ŷ
exp(z )
• Softmax is given by: softmax(z) = ∑ exp(z )
i
i
45
j j
Softmax Regression
Deep Learning Srihari

Generalization of Logistic Regression to multivalued output


Softmax definition
y = softmax(z)i
exp(zi )
=
∑ j
exp(z j )

Network
Computes

In matrix
multiplication z =WTx+b
notation

An example

46
Deep Learning Srihari

Intuition of Log-likelihood Terms


exp(zi )
softmax(z)i =
• The exp within softmax ∑ exp(z ) works j j

very well when training using log-likelihood


– Log-likelihood can undo the exp of softmax
log softmax(z)i = z i − log ∑ j exp(z j )

– Input zi always has a direct contribution to cost


• Because this term cannot saturate, learning can proceed
even if second term becomes very small
– First term encourages zi to be pushed up
– Second term encourages all z to be pushed down

47
Deep Learning Srihari

Intuition of second term of likelihood


• Log likelihood is log softmax(z) = z − log ∑ exp(z )
i i j j

• Consider second term: log ∑ exp(z ) j j

• It can be approximated by maxj zj


– Based on the idea that exp(zk) is insignificant for any
zk noticeably less that maxj zj
• Intuition gained:
– Cost penalizes most active incorrect prediction
– If the correct answer already has the largest input to
softmax, then -zi term and log ∑ exp(z ) ≈ max z = z
j j j j i

terms will roughly cancel. This example will then


contribute little to overall training cost
48
• Which will be dominated by other incorrect examples
Deep Learning Srihari

Generalization to Training Set


• So far we discussed only a single example
• Overall, unregularized maximum likelihood will
drive the model to learn parameters that drive
the softmax to predict a fraction of counts of
each outcome observed in training set

m
1
j =1 y ( j ) =i,x ( j ) =x
softmax(z(x;θ))i ≈
∑ 1
m
j =1 x ( j ) =x

49
Deep Learning Srihari

Softmax and Objective Functions


• Objective functions that do not use a log to
undo the exp of softmax fail to learn when
argument of exp becomes very negative,
causing gradient to vanish
• Squared error is a poor loss function for
softmax units
– Fail to train model change its output even when the
model makes highly incorrect predictions

50
Deep Learning Srihari

Saturation of Sigmoid and Softmax


• Sigmoid has a single output that saturates
– When input is extremely negative or positive

• Like sigmoid, softmax activation can saturate


– In case of softmax there are multiple output values
• These output values can saturate when the differences
between input values become extreme
– Many cost functions based on softmax also saturate
51
Deep Learning Srihari

Softmax & Input Difference


• Softmax invariant to adding the same scalar to
all inputs:
softmax(z) = softmax(z+c)
• Using this property we can derive a numerically
stable variant of softmax
softmax(z) = softmax(z – maxi zi)
• Reformulation allows us to evaluate softmax
– With only small numerical errors even when z
contains extremely large/small numbers
– It is driven by amount that its inputs deviate from
maxi zi 52
Deep Learning Srihari

Saturation of Softmax
• An output softmax(z)i saturates to 1 when the
corresponding input is maximal (zi= maxi zi) and
zi is much greater than all the other inputs
• The output can also saturate to 0 when is not
maximal and the maximum is much greater
• This is a generalization of the way the sigmoid
units saturate
– They can cause similar difficulties in learning if the
loss function is not designed to compensate for it

53
Deep Learning Srihari

Other Output Types


• Linear, Sigmoid and Softmax output units are
the most common
• Neural networks can generalize to any kind of
output layer
• Principle of maximum likelihood provides a
guide for how to design a good cost function for
any output layer
– If we define conditional distribution p(y |x), principle
of maximum likelihood suggests we use
log p(y |x) for our cost function 54
Deep Learning Srihari

Determining Distribution Parameters


• We can think of the neural network as
representing a function f (x ; θ)
• Outputs are not direct predictions of value of y
• Instead f (x ; θ)=ω provides the parameters for a
distribution over y
• Our loss function can then be interpreted as
-log p(y ; ω(x))

55
Deep Learning Srihari

Ex: Learning a Distribution Parameter


• We wish to learn the variance of a conditional
Gaussian of y given x
• Simple case: variance σ2 is constant
– Has closed-form expression: empirical mean of
squared difference between observations y and
their expected value
– Computationally more expensive approach
• Does not require writing special-case code
• Include variance as one of the properties of distribution
p(y |x) that is controlled by ω = f (x ; θ)
• Negative log-likelihood -log p(y ; ω(x)) will then provide
cost function with appropriate terms to learn variance 56

You might also like