Lecture8 DeepLearning
Lecture8 DeepLearning
DEEP LEARNING
Big Data Science (Master in Statistical Data Analysis)
TURING AWARD 2018
3
A MODEL THAT CAN LEARN ANYTHING?
4
A MODEL THAT CAN LEARN ANYTHING?
5
A MODEL THAT CAN LEARN ANYTHING?
̶ Most machine learning models:
̶ Make assumptions on the data
6
̶ "A model that can learn anything"
7
NEURAL NETWORKS
8
THE ARTIFICIAL NEURON
• Biological analogy
9
THE ARTIFICIAL NEURON
Inputs Activation
function
Weights
x1 (non-linear)
w1 Output
x2
w2 𝑛
... y 𝑔 𝑤𝑖 𝑥𝑖 + 𝑏
... wn 𝑖=1
xn
11
A SINGLE NEURON MODEL
Geometrical interpretation
̶ weights determine the slope
of the line (direction)
̶ bias determines the offset of
the line (shift)
12
ACTIVATION FUNCTION
Rectified linear unit (ReLU) Exponential linear unit (ELU) Scaled exponential
linear unit (SELU)
13
MANY NEURONS MAKE A LAYER (OR A PERCEPTRON)
A (single-layer) perceptron is a very simple type of feed-forward network
𝑛
x1 y1 𝑔 𝑤1𝑖 𝑥𝑖 + 𝑏1
𝑖=1
x2
...
𝑛
xn y2 𝑔 𝑤2𝑖 𝑥𝑖 + 𝑏2
𝑖=1
14
FEED-FORWARD LAYER
̶ A fully connected feed-forward layer can also be
expressed as a matrix multiplication:
d Layer i+1
Layer i
wi b z y
𝐲 = 𝑔(𝐖𝐱 + 𝐛)
g
d x W b z y
15
PERCEPTRON LEARNING (TRAINING)
̶ Adapting the weights and bias of the perceptron
̶ The goal is to minimize the errors on the training samples
̶ Weights and biases are updated in an iterative procedure:
16
PERCEPTRON
̶ How to compute 𝑤𝑖 (𝑡) and (𝑡) ?
̶ Three learning rules:
̶ Hebbian rule
‒ 𝑤𝑖 𝑡 = 𝛾𝑦𝑥𝑖
‒ If two units are activated simultaneously, their connection should be strengthened
(always update)
̶ Perceptron rule
‒ 𝑤𝑖 𝑡 = 𝑑 𝑥 𝑥𝑖
‒ Weights are only updated if y d(x)
̶ Delta rule (Widrow-Hoff rule)
‒ 𝑤𝑖 𝑡 = 𝛾 𝑑 𝑥 − 𝑦 𝑥𝑖
‒ Uses the difference between the actual and the desired activation to adapt the
connection strength
17
PERCEPTRON
New York Times:
embryo of an electronic
computer today that it expects
will be able to walk, talk, see,
write, reproduce itself and be
conscious of its existence."
18
MANY LAYERS MAKE A MULTI-LAYER PERCEPTRON
Hidden layer
x1 y1
h1
x2
y2
...
h2
xn y3
AN MLP WITH SEVERAL HIDDEN LAYERS IS A "DEEP NEURAL NETWORK"
x1 y1
x2
y2
...
xn y3
MULTI-LAYER PERCEPTRON (MLP)
21
UNIVERSAL APPROXIMATION THEOREM
̶ A feed-forward network with a single hidden layer
containing a finite number of neurons can approximate
continuous functions on compact subsets of ℝ 𝑛
Cybenko, 1989
Hornik, 1991
Daniel Peralta <[email protected]> 22
SO WHAT NOW?
̶ Deep learning was invented in the 1960's
̶ It was stated in the 1980's that a single-layer MLP is
enough (?)
̶ The theorem still restricts functions that can be
learned
̶ We don't know:
‒ Size of the hidden layer needed
‒ Learning algorithm needed
25
LOSS FUNCTIONS
̶ Loss function for classification (𝑀 classes):
̶ Typically, the outputs are 1-hot encoded:
‒ 𝑀 output neurons 𝑦ො1 , … , 𝑦ො𝑀
‒ σ𝑀 𝑗=1 𝑦
ො𝑗 = 1
‒ Desired output for an instance of class 𝑐:
‒ Neuron 𝑐 should be 1: 𝑦𝑐 = 1
‒ All other neurons should be 0: 𝑦𝑗 = 0 ∀𝑗 ≠ 𝑐
26
FORMULATION AS AN OPTIMIZATION PROBLEM
̶ Minimize 𝐽(𝜃) with respect to the parameters of the model 𝜃
Training set
{(𝐱 𝟏 , 𝐲𝟏 ), …, (𝐱 𝐍 , 𝐲𝐍 )}
27
OPTIMIZING THE LOSS
Model Loss function Optimization
28
OPTIMIZING THE LOSS
Loss function: 𝐽(𝜃)
Loss for a single input 𝐱: 𝐽(𝐱; 𝜃)
1 𝐿
0
First layer: 𝐡 = 𝐱 𝜃 = {𝐖 , … , 𝐖 }
𝑙
Hidden layers: 𝐡 = 𝑔(𝐖 𝐡 ) 𝐥 𝑙−1
Last layer: 𝐡 𝐿 =𝑔 𝐋
𝐖 𝐡𝐿−1 =𝑔 𝐋
𝐖 𝑔 𝐋−𝟏
𝐖 𝑔(… ) = 𝐲ො
𝑙 𝛿𝐽(𝐱; 𝜃)
For an instance 𝐱 : ∆𝑤𝑗𝑘 = −𝛾 𝑙
𝛿𝑤𝑗𝑘
29
BACKPROPAGATION
̶ In a multi-layered network:
̶ The gradient of the cost is easily computed for the
last layer
̶ The gradient in previous layer is computed using the
chain rule of calculus
30
Introduced in 1986 by Rumelhart
BACKPROPAGATION
𝐽(𝜃) = 𝐲𝐢 − 𝐲ො𝑖 2
Output 2𝑁
𝑖=1
layer
1 2 3 Cross entropy loss:
𝐖 𝐖 𝐖 𝑁 𝑀
1
x y ŷ 𝐽(𝜃) = 𝑦𝑖𝑗 log 𝑦ො𝑖𝑗
𝑁
𝑖=1 𝑗=1
𝛻𝜃 𝐽(𝜃)
Gradient of the cost:
31
START WITH THE OUTPUT NODES
̶ Take output neuron 𝑘, for an input pair (𝐱,𝐲):
𝑦ො𝑘 = ℎ𝑘𝐿 = 𝑔 𝑧𝑘𝐿 Mean-Squared Error:
𝑛𝐿−1 𝑁
𝐿 𝐿−1 1
𝑧𝑘𝐿 = 𝐰𝑘𝐿 𝐡𝐿−1 = 𝑤𝑘𝑗 ℎ𝑗 𝐽 𝜃 = 𝐲𝑖 − 𝐲ො𝑖 2
𝑗=1
2𝑁
𝑖=1
32
INTERPRETATION OF THE GRADIENT
𝐿 ′ 𝐿 𝐿−1
∆𝑤𝑘𝑗 = −𝛾 𝑦ො𝑘 − 𝑦𝑘 𝑔 𝑧𝑘 ℎ𝑗
33
CONTINUE WITH THE HIDDEN NODES
̶ Take hidden node ℎ𝑗𝐿−1 , for an input pair (𝐱,𝐲):
ℎ𝑗𝐿−1 = 𝑔 𝑧𝑗𝐿−1
𝑛𝐿−2
𝛿𝐽 𝐱; 𝜃
∆𝑤𝑗𝑖𝐿−1 = −𝛾
𝛿𝑤𝑗𝑖𝐿−1
𝐿−1 𝐿−1
𝛿𝐽 𝛿ℎ𝑗 𝛿𝑧𝑗
= −𝛾 𝐿−1 𝐿−1 ℎ𝑖𝐿−2
𝛿ℎ𝑗 𝛿𝑧𝑗 𝛿𝑤𝑗𝑖𝐿−1
𝑔′ 𝑧𝑗𝐿−1
Trickier!
34
CONTINUE WITH THE HIDDEN NODES
̶ We don’t know directly the contribution of ℎ𝑗𝐿−1 to
𝐽 𝐱𝑖 ; 𝜃
̶ But we can write the error as a function of the weighted
Hidden layer Output layer sums of the inputs from the hidden layer:
𝐽 𝐱; 𝜃 = 𝐽(𝑧1𝐿 , … , 𝑧𝑁𝐿 𝐿 )
̶ We can now calculate the derivative:
𝐿
𝑦ො1
𝑤1𝑗 𝛿𝐽
𝑛𝐿
𝛿𝐽 𝛿𝑧𝑙𝐿
𝐿−1 =
𝛿ℎ𝑗 𝛿𝑧𝑙𝐿 𝛿ℎ𝑗𝐿−1
ℎ𝑗𝐿−1 …
𝑙=1
𝑛𝐿 𝑛𝐿−1 𝐿 𝐿−1
σ
𝛿𝐽 𝛿 𝑡=1 𝑤𝑙𝑡 ℎ𝑡
= 𝐿
𝑤𝑁𝐿 𝐿 𝑗 𝛿𝑧𝑙 𝛿ℎ𝑗𝐿−1
𝑦ො𝑁𝐿 𝑙=1
𝑛𝐿
𝛿𝐽 𝐿
We computed this for the = 𝐿 𝑤𝑙𝑗
𝛿𝑧𝑙
gradient of the output layer! 𝑙=1
𝛿𝐽 ′ 𝑧𝐿 𝑛𝐿
𝐿 = 𝑦ො𝑙 − 𝑦𝑙 𝑔 𝑙 𝐿
𝛿𝑧𝑙 = 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝑙=1
35
CONTINUE WITH THE HIDDEN NODES
̶ Take hidden node ℎ𝑗𝐿−1 , for an input pair (𝐱,𝐲):
ℎ𝑗𝐿−1 = 𝑔 𝑧𝑗𝐿−1
𝑛𝐿−2
𝛿𝐽 𝐱; 𝜃
∆𝑤𝑗𝑖𝐿−1 = −𝛾
𝛿𝑤𝑗𝑖𝐿−1
𝐿−1 𝐿−1
𝛿𝐽 𝛿ℎ𝑗 𝛿𝑧𝑗
= −𝛾 𝐿−1 𝐿−1 ℎ𝑖𝐿−2
𝛿ℎ𝑗 𝛿𝑧𝑗 𝛿𝑤𝑗𝑖𝐿−1
𝑔′ 𝑧𝑗𝐿−1
𝑛𝐿
𝐿
𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝑛𝐿 𝑙=1
∆𝑤𝑗𝑖𝐿−1 = −𝛾𝑔′ 𝑧𝑗𝐿−1 ℎ𝑖𝐿−2 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝐿
36
𝑙=1
WEIGHT UPDATE FOR THE HIDDEN NODES
𝑛𝐿
𝐿−1 𝐿−1 𝐿−2 ′ 𝐿 𝐿
∆𝑤𝑗𝑖 = −𝛾𝑔′ 𝑧𝑗 ℎ𝑖 𝑦ො𝑙 − 𝑦𝑙 𝑔 𝑧𝑙 𝑤𝑙𝑗
𝑙=1
Influence on node of
the next layer
Step size of the
gradient descent Output of previous layer
Contribution to the error on
all nodes of the next layer,
weighted by influence
37
VANISHING AND EXPLODING GRADIENT
̶ The gradient of the loss in one layer is computed as
the product of the gradients in all subsequent layers
̶ Vanishing gradient: if one of the elements is close
to zero, the product is close to zero
̶ Exploding gradient: if one of the elements is very
large, the product becomes very large
39
EXAMPLE: MNIST
Handwritten digit classification:
̶ 28x28 images (784 pixels)
̶ 10 classes
40
EXAMPLE: MNIST
Hidden
Imagine a network with: layer
̶ 784 inputs Input (1000)
̶ A single hidden layer with 1000 units (784) Output
̶ 10 output units layer
(10)
𝐖1 ∈ ℝ1000×784 𝐖2
𝐖1
𝐖 2 ∈ ℝ10×1000 ŷ
41
HOW TO GO DEEPER
42
DEEP NEURAL NETWORKS
̶ By adding more layers:
̶ The number of parameters is smaller, if the width of
the layers is not too large
̶ Complex non-linearities can be modeled
̶ Deeper layers in the model represent more abstract
features based on the input data
43
CHALLENGES
̶ Neural networks can have huge numbers of
parameters
̶ This requires enormous training datasets
̶ The availability of such data is limited
̶ The computational effort to train such a network is
considerable
46
STOCHASTIC GRADIENT DESCENT
̶ Instead of computing the total gradient for the entire training
set, compute it for a “mini-batch” containing a few instances
̶ Update the weights
𝜽 ⟵ 𝜽 − 𝜖𝒈
̶ Repeat many, many times
̶ Iteration: gradient descent update for a single mini-batch
̶ Epoch: run throughout the entire training set
̶ Data augmentation
̶ Sometimes it is easy to generate new training data
by adding realistic variations to available instances
‒ Images can be translated, rotated, flipped
‒ Adding noise
̶ Noisy training data reduces the chance of overfitting
55
CONVOLUTIONAL
NEURAL NETWORKS
56
LEARNING FROM IMAGES
̶ A very simplistic way of working with images is serializing all
pictures into a vector
57
CONVOLUTION
̶ A convolution is an operation on two functions
̶ At each point, the “input” function f is weighted across its
entire domain, by using a “kernel” function g
58
CONVOLUTION=LOCAL RECEPTIVE FIELD
59
EXAMPLES OF CONVOLUTIONS
̶ A convolutional layer has few trainable weights, no matter the size of its
input
̶ This allows to stack many convolutional layers!!
61
CONVOLUTION
https://cs231n.github.io/convolutional-networks/
62
SHARED WEIGHTS AND BIASES
̶ Each hidden neuron has a bias and 5×5 weights connected to its local
receptive field
̶ We are going to use the same weights and bias for each of the 24×24
hidden neurons
̶ The convolution defines a filter or kernel, that is applied to different local
regions in the input
yj,k=
Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep 65
convolutional neural networks. In NIPS, pp. 1106–1114, 2012
HYPERPARAMETERS
66
LIST OF HYPERPARAMETERS
̶ Architecture:
̶ Number (and type) of layers
̶ Size of the fully connected layers
̶ Size of the convolution (and pooling) kernels
̶ Activation functions
̶ Non-standard connections (e.g. skip connections)
̶ …
̶ Parameters for stochastic gradient descent:
̶ Learning rate
̶ Learning rate updates
̶ Mini-batch size
̶ Number of iterations
̶ Momentum
̶ …
̶ Regularization parameters:
̶ Weight decay
̶ Dropout rate
̶ …
̶ …
67
HOW TO CHOOSE HYPERPARAMETER VALUES
̶ Deep neural networks are extremely prone to over-fitting
̶ They may also take a lot of iterations before they reach a good area of
the parameter space and start converging
̶ A few general guidelines:
̶ Larger networks have more capacity to learn, but require more:
‒ Memory
‒ Time
‒ Training data
̶ Minibatches should be as large as possible, for more robust
convergence
68
DEEP NEURAL NETS
AND ABSTRACTION
69
TRADITIONAL ML APPROACH
Hand-crafted
Trainable
feature
classifier
extractor
Input Feature-based Supervised
representation model
70
EXAMPLE: SPEECH RECOGNITION
Gaussian
Mixture Classifier
MFCC (MLP)
model
(GMM)
Fixed Unsupervised Supervised
71
EXAMPLE: OBJECT RECOGNITION
K-Means
SIFT sparse Pooling
HoG coding
Fixed Unsupervised
Low-level Mid-level
features features Classifier
(MLP)
Supervised
72
REPRESENTATION LEARNING
Trainable
Trainable
feature
classifier
extractor
Input Supervised
model
73
INSPIRATION FROM THE VISUAL CORTEX
74
LEARNING HIERARCHICAL REPRESENTATIONS
̶ It's deep if it has more than one stage of non-linear feature
transformation
75
TRAINABLE FEATURE HIERARCHIES
̶ Hierarchy of representations with increasing level of abstraction
̶ Each stage is a kind of trainable feature transform
̶ Image recognition
‒ Pixel → edge → texton → motif → part → object
̶ Text
‒ Character → word → word group → clause → sentence → story
̶ Speech
‒ Sample → spectral band → sound → ... → phone → phoneme → word →
76
TRAINABLE FEATURE HIERARCHIES
77
DEEP LEARNING
̶ Each layer corresponds to a ‘‘distributed representation’’
̶ units in layer are not mutually exclusive
̶ each unit is a separate feature of the input
̶ two units can be ‘‘active’’ at the same time
̶ They do not correspond to a partitioning (clustering) of the
inputs
̶ in clustering, an input can only belong to a single cluster
78
DEEP LEARNING ARCHITECTURES
79
TRAINING DEEP NETWORKS
̶ Purely Supervised
̶ Initialize parameters randomly
̶ Train in supervised mode, typically with SGD, using backprop
to compute gradients
̶ Used in most practical systems for speech and image
recognition
80
TRAINING DEEP NETWORKS
̶ Unsupervised, layerwise + supervised classifier on top
̶ Train each layer unsupervised, one after the other
̶ Train a supervised classifier on top, keeping the other layers
fixed
̶ Good when very few labeled samples are available
81
TRAINING DEEP NETWORKS
̶ Unsupervised, layerwise + global supervised fine-tuning
̶ Train each layer unsupervised, one after the other
̶ Add a classifier layer, and retrain the whole thing
supervised
̶ Good when label set is poor (e.g. pedestrian detection)
̶ Unsupervised pre-training often uses regularized auto-
encoders
82
ADVERSARIAL
EXAMPLES
83
ADVERSARIAL EXAMPLES
?? ??
84
ADVERSARIAL EXAMPLES
85
ADVERSARIAL EXAMPLES
86
WHY DOES THIS HAPPEN?
̶ Neural networks are built out of ~linear building blocks
̶ In high dimensions, the value of a linear function can
change rapidly:
̶ With a perturbation of 𝜖, a linear function with
weights 𝒘 can change up to 𝜖 𝒘 1
̶ This value can be very large!
87
TAKING ADVANTAGE OF ADVERSARIAL EXAMPLES
̶ Adversarial training:
̶ Including adversarial examples in the training set
̶ Good for regularization!
̶ Encourages the learned function to be locally
constant (instead of locally linear)
88
VIRTUAL ADVERSARIAL TRAINING
̶ A variant of adversarial training for semi-supervised learning
89
ADVERSARIAL TRAINING
̶ Assumption:
̶ Different classes lie on disconnected manifolds
90
APPLICATIONS: EVADING RECOGNITION
Xu, K., Zhang, G., Liu, S., Fan, Q., Sun, M., Chen, H., Chen, P. Y., Wang, Y., & Lin, X.
(2019). Adversarial T-shirt! Evading Person Detectors in A Physical World. Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 12350 LNCS, 665–681. https://doi.org/10.48550/arxiv.1910.11099
91
APPLICATIONS: EVADING RECOGNITION
https://www.deeplearningbook.org/
deeplearning.ai
coursera.org
udacity.com
http://neuralnetworksanddeeplearning.com/
94