Deep Feedforward Networks
Deep Feedforward Networks
1
Deep Learning Srihari
2
Deep Learning Srihari
3
Deep Learning Srihari
Source: https://towardsdatascience.com/probability-and-statistics-explained-in-the-context-of-deep-learning-ed1509b2eb3f
5
Deep Learning Srihari
https://www.easy-tensorflow.com/tf-tutorials/neural-networks/two-layer-neural-network
6
Deep Learning Srihari
Flow of Information
• Models are called Feedforward because: y=f (x)
– To evaluate f (x): information flows one-way from
x through computations defining f s to outputs y
• There are no feedback connections
– No outputs of model are fed back into itself
7
Deep Learning Srihari
9
Deep Learning Srihari
10
Deep Learning Srihari
Definition of Depth
• Overall length of the chain is the depth of the
model
– Ex: the composite function f (x)= f (3) [ f (2) [ f (1)(x)]]
has depth of 3
• The name deep learning arises from this
terminology
• Final layer of a feedforward network, ex f (3), is
called the output layer
12
Deep Learning Srihari
13
Deep Learning Srihari
14
Deep Learning Srihari
⎛ M (2) ⎛ D (1) ⎞ ⎞
yk (x,w) = σ ⎜ ∑ wkj h ⎜ ∑ w ji x i + w (1)
j0 ⎟
+ w (2)
k0 ⎟
⎝ j =1 ⎝ i =1 ⎠ ⎠
15
Deep Learning Srihari
16
Deep Learning Srihari
Width of Model
• Each hidden layer is typically vector-valued
• Dimensionality of hidden layer vector is width of
the model
17
Deep Learning Srihari
Units of a model
• Each element of vector viewed as a neuron
– Instead of thinking of it as a vector-vector function,
they are regarded as units in parallel
• Each unit receives inputs from many other units
and computes its own activation value
18
Deep Learning Srihari
20
Deep Learning Srihari
22
Deep Learning Srihari
23
Deep Learning Srihari
28
Deep Learning Srihari
30
Deep Learning Srihari
M ⎛D ⎞
yk (x; θ,w) = ∑ wkj φj ⎜⎜∑ θji x i + θj 0 ⎟⎟⎟ + wk 0
j =1 ⎜⎝ i=1 ⎟⎠
ϕ1 w10
yk = fk (x;θ,w) = ϕ (x;θ)T w
ϕ0
Can be viewed as a generalization of linear models
• Nonlinear function fk with M+1 parameters wk= (wk0 ,..wkM ) with
• M basis functions, ϕj j=1,..M each with D parameters θj= (θj1,..θjD)
• Both wk and θj are learnt from data
32
Deep Learning Srihari
Approaches to Learning ϕ
• Parameterize the basis functions as ϕ(x;θ)
– Use optimization to find θ that corresponds to a
good representation
• Approach can capture benefit of first approach
(fixed basis functions) by being highly generic
– By using a broad family for ϕ(x;θ)
• Can also capture benefits of second approach
– Human practitioners design families of ϕ(x;θ) that
will perform well
– Need only find right function family rather than
precise right function 33
Deep Learning Srihari
Importance of Learning ϕ
• Learning ϕ is discussed beyond this first
introduction to feed-forward networks
– It is a recurring theme throughout deep learning
applicable to all kinds of models
• Feedforward networks are application of this
principle to learning deterministic mappings
form x to y without feedback
• Applicable to
– learning stochastic mappings
– functions with feedback
– learning probability distributions over a single vector34
Deep Learning Srihari
4 x∈X 4 n=1
Alternative is Cross-entropy J(θ)
– Usually not used for binary data J(θ) = −ln p(t | θ)
N
= −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
– But math is simple n=1
yn= σ (θTxn)
– Minimize J(θ) = ∑ tn −x nT w - b)
4 n=1
• Differentiate wrt w and b to obtain w = 0 and b=½
– Then the linear model f(x;w,b)=½ simply outputs 0.5 everywhere
– Why does this happen? 37
Deep Learning Srihari
38
Deep Learning Srihari
Activation Function
• In linear regression we used a vector of weights
w and scalar bias b T
f (x;w,b) = x w +b
42
Deep Learning Srihari
• Finish by multiplying by w: ⎡ ⎤
⎢ 0 ⎥
• Network has obtained ⎢
f (x) = ⎢⎢
⎢
1
1
⎥
⎥
⎥
⎥
⎢ ⎥
⎢ 0 ⎥
⎣ ⎦
correct answer for all 4 examples
45
Deep Learning Srihari
Gradient-Based Learning
1
Deep Learning Srihari
Topics
• Overview
1.Example: Learning XOR
2.Gradient-Based Learning
3.Hidden Units
4.Architecture Design
5.Backpropagation and Other Differentiation
6.Historical Notes
2
Deep Learning Srihari
4
Deep Learning Srihari
5
Deep Learning Srihari
Convex vs Non-convex
• Convex methods:
– Converge from any initial parameters
– Robust-- but can encounter numerical problems
• SGD with non-convex:
– Sensitive to initial parameters
– For feedforward networks, important to initialize
• Weights to small values, Biases to zero or small positives
– SGD can also train Linear Regression and SVM
Especially with large training sets
– Training neural net not much different from training
other models
7
• Except computing gradient is more complex
Deep Learning Srihari
8
Deep Learning Srihari
Cost Functions
9
Deep Learning Srihari
11
Deep Learning Srihari
12
Minimizing cross entropy is same as making learned and data distributions the same
Deep Learning Srihari
13
Deep Learning Srihari
15
Deep Learning Srihari
16
Deep Learning Srihari
17
Deep Learning Srihari
N N
p(t | θ) = ∏ynn {1 −yn }
1−tn
J(θ) = −ln p(t | θ) = −∑ {tn ln yn + (1 −tn )ln(1 −yn )}
t
,yn = σ(θT x n )
n=1 n=1
18
Deep Learning Srihari
y = annual salary
20
Deep Learning Srihari
Learning a function
• If we have sufficiently powerful neural network,
we can think of it as being powerful to
determine any function f
– This function is limited only by
1. Boundedness and
2. Continuity
3. Rather than by having a specific parametric form
– From this point of view, cost function is a functional
rather than a function
• Choose a function that is best for the task
21
Deep Learning Srihari
What is a Functional?
Examples
of
Functionals
Entropy
For Bernoulli
AUC With
Curve Different
Length P(x=1) 23
Deep Learning Srihari
24
Deep Learning Srihari
yields f * (x) = E ⎡y ⎤
⎢⎣ ⎥⎦ y∼pdata ( y|x)
26
Deep Learning Srihari
Output Units
28
Deep Learning Srihari
Output Units
• Choice of cost function is tightly coupled with
choice of output unit
– Most of the time we use cross-entropy between
data distribution and model distribution
• Choice of how to represent the output then determines
the form of the cross-entropy function
• In logistic regression output is binary-valued
29
Deep Learning Srihari
31
Deep Learning Srihari
33
Deep Learning Srihari
34
Deep Learning Srihari
36
Deep Learning Srihari
∑ exp(y ' z)
y '=0
=σ((2y -1)z)
Softplus function
• Sigmoid saturates when its argument is very
positive or very negative
– i.e., function is insensitive to small changes in input
• Compare it to the softplus function
ζ(x) = log(1+ exp(x))
40
Deep Learning Srihari
Softmax definition
• We need to produce a vector ŷ with values
ŷi = P(y = i | x)
• We need elements of ŷ lie in [0,1] and they sum to 1
• Same approach as with Bernoulli works for
Multinoulli distribution
• First a linear layer predicts unnormalized log probabilities
z =WTh+b
– where zi = log P̂(y = i | x)
• Softmax can then exponentiate and normalize z
to obtain the desired ŷ
exp(z )
• Softmax is given by: softmax(z) = ∑ exp(z )
i
i
45
j j
Softmax Regression
Deep Learning Srihari
Network
Computes
In matrix
multiplication z =WTx+b
notation
An example
46
Deep Learning Srihari
47
Deep Learning Srihari
49
Deep Learning Srihari
50
Deep Learning Srihari
Saturation of Softmax
• An output softmax(z)i saturates to 1 when the
corresponding input is maximal (zi= maxi zi) and
zi is much greater than all the other inputs
• The output can also saturate to 0 when is not
maximal and the maximum is much greater
• This is a generalization of the way the sigmoid
units saturate
– They can cause similar difficulties in learning if the
loss function is not designed to compensate for it
53
Deep Learning Srihari
55
Deep Learning Srihari