0% found this document useful (0 votes)
13 views

Lecture8 DeepLearning

Uploaded by

Kassa Derbie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture8 DeepLearning

Uploaded by

Kassa Derbie
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 94

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

DEEP LEARNING
Big Data Science (Master in Statistical Data Analysis)
TURING AWARD 2018

Yann LeCun Geoffrey Hinton Yoshua Bengio

“for conceptual and engineering breakthroughs that have


made deep neural networks a critical component of
computing”
A MODEL THAT CAN LEARN ANYTHING?
̶ Translating text
̶ Moving a robot
̶ Detecting cancer from a picture
̶ Driving a car
̶ Playing computer games (and beating human players)

3
A MODEL THAT CAN LEARN ANYTHING?

4
A MODEL THAT CAN LEARN ANYTHING?

5
A MODEL THAT CAN LEARN ANYTHING?
̶ Most machine learning models:
̶ Make assumptions on the data

̶ Rely on expert-driven feature engineering

6
̶ "A model that can learn anything"

̶ "A model that can approximate


continuous functions on compact
𝑛
subsets of ℝ "

7
NEURAL NETWORKS
8
THE ARTIFICIAL NEURON
• Biological analogy

9
THE ARTIFICIAL NEURON

Inputs Activation
function
Weights
x1 (non-linear)
w1 Output

x2
w2 𝑛
... y 𝑔 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
... wn 𝑖=1

xn

First proposed in 1943 (McCulloch and Pitts)


10
A SINGLE NEURON MODEL
Geometrical interpretation
̶ Output can be written as

̶ Separating line is given by

̶ This can be rewritten as:

11
A SINGLE NEURON MODEL
Geometrical interpretation
̶ weights determine the slope
of the line (direction)
̶ bias determines the offset of
the line (shift)

12
ACTIVATION FUNCTION

Identity Sigmoid (Logistic) Hyperbolic tangent (tanh)

Rectified linear unit (ReLU) Exponential linear unit (ELU) Scaled exponential
linear unit (SELU)

13
MANY NEURONS MAKE A LAYER (OR A PERCEPTRON)
A (single-layer) perceptron is a very simple type of feed-forward network
𝑛
x1 y1 𝑔 ෍ 𝑤1𝑖 𝑥𝑖 + 𝑏1
𝑖=1
x2

...

𝑛
xn y2 𝑔 ෍ 𝑤2𝑖 𝑥𝑖 + 𝑏2
𝑖=1

Introduced in 1958 by Frank Rosenblatt

14
FEED-FORWARD LAYER
̶ A fully connected feed-forward layer can also be
expressed as a matrix multiplication:
d Layer i+1
Layer i

wi b z y
𝐲 = 𝑔(𝐖𝐱 + 𝐛)
g
d x W b z y

15
PERCEPTRON LEARNING (TRAINING)
̶ Adapting the weights and bias of the perceptron
̶ The goal is to minimize the errors on the training samples
̶ Weights and biases are updated in an iterative procedure:

̶ How to compute 𝑤𝑖 (𝑡) and (𝑡) ?

16
PERCEPTRON
̶ How to compute 𝑤𝑖 (𝑡) and (𝑡) ?
̶ Three learning rules:
̶ Hebbian rule
‒ 𝑤𝑖 𝑡 = 𝛾𝑦𝑥𝑖
‒ If two units are activated simultaneously, their connection should be strengthened
(always update)
̶ Perceptron rule
‒ 𝑤𝑖 𝑡 = 𝑑 𝑥 𝑥𝑖
‒ Weights are only updated if y  d(x)
̶ Delta rule (Widrow-Hoff rule)
‒ 𝑤𝑖 𝑡 = 𝛾 𝑑 𝑥 − 𝑦 𝑥𝑖
‒ Uses the difference between the actual and the desired activation to adapt the
connection strength

17
PERCEPTRON
New York Times:

The Navy revealed the


"

embryo of an electronic
computer today that it expects
will be able to walk, talk, see,
write, reproduce itself and be
conscious of its existence."

18
MANY LAYERS MAKE A MULTI-LAYER PERCEPTRON

Input layer Output layer

Hidden layer
x1 y1
h1
x2
y2
...
h2
xn y3
AN MLP WITH SEVERAL HIDDEN LAYERS IS A "DEEP NEURAL NETWORK"

Input layer Hidden layers Output layer

x1 y1

x2
y2
...

xn y3
MULTI-LAYER PERCEPTRON (MLP)

21
UNIVERSAL APPROXIMATION THEOREM
̶ A feed-forward network with a single hidden layer
containing a finite number of neurons can approximate
continuous functions on compact subsets of ℝ 𝑛

Cybenko, 1989
Hornik, 1991
Daniel Peralta <[email protected]> 22
SO WHAT NOW?
̶ Deep learning was invented in the 1960's
̶ It was stated in the 1980's that a single-layer MLP is
enough (?)
̶ The theorem still restricts functions that can be
learned
̶ We don't know:
‒ Size of the hidden layer needed
‒ Learning algorithm needed

Daniel Peralta <[email protected]> 23


TRAINING A NEURAL
NETWORK
24
LOSS FUNCTIONS
̶ As for other models (e.g. Linear regression) we choose
a cost function that is minimized during the training

̶ Examples of cost function for regression:


̶ Mean Square Error (MSE):
𝑁
1 2
𝐽(𝜃) = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑁
𝑖=1

25
LOSS FUNCTIONS
̶ Loss function for classification (𝑀 classes):
̶ Typically, the outputs are 1-hot encoded:
‒ 𝑀 output neurons 𝑦ො1 , … , 𝑦ො𝑀
‒ σ𝑀 𝑗=1 𝑦
ො𝑗 = 1
‒ Desired output for an instance of class 𝑐:
‒ Neuron 𝑐 should be 1: 𝑦𝑐 = 1
‒ All other neurons should be 0: 𝑦𝑗 = 0 ∀𝑗 ≠ 𝑐

̶ Softmax activation function for the output layer:


𝑒 𝑧𝑗
𝑦ෝ𝑗 = 𝑀 𝑥
σ𝑖=1 𝑒 𝑖

̶ Cross entropy loss:


𝑁 𝑀
1
𝐽(𝜃) = ෍ ෍ 𝑦𝑗 log 𝑦ෝ𝑗
𝑁
𝑖=1 𝑗=1

26
FORMULATION AS AN OPTIMIZATION PROBLEM
̶ Minimize 𝐽(𝜃) with respect to the parameters of the model 𝜃

Training set
{(𝐱 𝟏 , 𝐲𝟏 ), …, (𝐱 𝐍 , 𝐲𝐍 )}

27
OPTIMIZING THE LOSS
Model Loss function Optimization

Linear regression Convex Analytical solution

Logistic regression Convex Newton-Raphson method

Neural network Non-convex Gradient descent

28
OPTIMIZING THE LOSS
Loss function: 𝐽(𝜃)
Loss for a single input 𝐱: 𝐽(𝐱; 𝜃)
1 𝐿
0
First layer: 𝐡 = 𝐱 𝜃 = {𝐖 , … , 𝐖 }
𝑙
Hidden layers: 𝐡 = 𝑔(𝐖 𝐡 ) 𝐥 𝑙−1

Last layer: 𝐡 𝐿 =𝑔 𝐋
𝐖 𝐡𝐿−1 =𝑔 𝐋
𝐖 𝑔 𝐋−𝟏
𝐖 𝑔(… ) = 𝐲ො

Gradient descent: Adapt weights by taking a step in the


negative direction of the gradient

𝑙 𝛿𝐽(𝐱; 𝜃)
For an instance 𝐱 : ∆𝑤𝑗𝑘 = −𝛾 𝑙
𝛿𝑤𝑗𝑘
29
BACKPROPAGATION
̶ In a multi-layered network:
̶ The gradient of the cost is easily computed for the
last layer
̶ The gradient in previous layer is computed using the
chain rule of calculus

̶ Backpropagation allows us to compute the gradient for


all weights in the network

30
Introduced in 1986 by Rumelhart
BACKPROPAGATION

Cost function 𝐽(𝜃)


Hidden Hidden
layer 1 layer 2 MSE:
Input 1
𝑁

𝐽(𝜃) = ෍ 𝐲𝐢 − 𝐲ො𝑖 2
Output 2𝑁
𝑖=1
layer
1 2 3 Cross entropy loss:
𝐖 𝐖 𝐖 𝑁 𝑀
1
x y ŷ 𝐽(𝜃) = ෍ ෍ 𝑦𝑖𝑗 log 𝑦ො𝑖𝑗
𝑁
𝑖=1 𝑗=1

𝛻𝜃 𝐽(𝜃)
Gradient of the cost:
31
START WITH THE OUTPUT NODES
̶ Take output neuron 𝑘, for an input pair (𝐱,𝐲):
𝑦ො𝑘 = ℎ𝑘𝐿 = 𝑔 𝑧𝑘𝐿 Mean-Squared Error:
𝑛𝐿−1 𝑁
𝐿 𝐿−1 1
𝑧𝑘𝐿 = 𝐰𝑘𝐿 𝐡𝐿−1 = ෍ 𝑤𝑘𝑗 ℎ𝑗 𝐽 𝜃 = ෍ 𝐲𝑖 − 𝐲ො𝑖 2
𝑗=1
2𝑁
𝑖=1

̶ Compute the gradient descent for the weights in this layer:


𝑛𝐿
1 2
𝐽 𝐱; 𝜃 = ෍ 𝑦𝑘 − 𝑦ො𝑘
𝐿 𝛿𝐽 𝐱; 𝜃 2
∆𝑤𝑘𝑗 = −𝛾 𝐿
𝑘=1
𝑛𝐿
𝛿𝑤𝑘𝑗 1
= ෍ 𝑦𝑘2 − 2𝑦𝑘 𝑦ො𝑘 + 𝑦ො𝑘2
𝛿𝐽 𝛿 𝑦ො𝑘 𝛿𝑧𝑘𝐿 2
𝑘=1
= −𝛾 ℎ𝑗𝐿−1
𝛿 𝑦ො𝑘 𝛿𝑧𝑘𝐿 𝛿𝑤𝑘𝑗
𝐿
𝛿𝐽 𝑔′ 𝑧𝑘𝐿
𝛿𝑧𝑘𝐿 𝛿𝐽 𝐱; 𝜃
𝑦ො𝑘 − 𝑦𝑘 = 𝑦ො𝑘 − 𝑦𝑘
𝐿
∆𝑤𝑘𝑗 = −𝛾 𝑦ො𝑘 − 𝑦𝑘 𝑔′ 𝑧𝑘𝐿 ℎ𝑗𝐿−1 𝛿 𝑦ො𝑘

32
INTERPRETATION OF THE GRADIENT
𝐿 ′ 𝐿 𝐿−1
∆𝑤𝑘𝑗 = −𝛾 𝑦ො𝑘 − 𝑦𝑘 𝑔 𝑧𝑘 ℎ𝑗

Step size of the Output of hidden node ℎ𝑗𝐿−1


gradient descent

Difference between expected


and observed output

33
CONTINUE WITH THE HIDDEN NODES
̶ Take hidden node ℎ𝑗𝐿−1 , for an input pair (𝐱,𝐲):
ℎ𝑗𝐿−1 = 𝑔 𝑧𝑗𝐿−1
𝑛𝐿−2

𝑧𝑗𝐿−1 = 𝐰𝑗𝐿−1 𝐡𝐿−2 = ෍ 𝑤𝑗𝑖𝐿−1 ℎ𝑖𝐿−2


𝑖=1

̶ Compute the gradient descent for the weights in this layer:

𝛿𝐽 𝐱; 𝜃
∆𝑤𝑗𝑖𝐿−1 = −𝛾
𝛿𝑤𝑗𝑖𝐿−1
𝐿−1 𝐿−1
𝛿𝐽 𝛿ℎ𝑗 𝛿𝑧𝑗
= −𝛾 𝐿−1 𝐿−1 ℎ𝑖𝐿−2
𝛿ℎ𝑗 𝛿𝑧𝑗 𝛿𝑤𝑗𝑖𝐿−1
𝑔′ 𝑧𝑗𝐿−1

Trickier!

34
CONTINUE WITH THE HIDDEN NODES
̶ We don’t know directly the contribution of ℎ𝑗𝐿−1 to
𝐽 𝐱𝑖 ; 𝜃
̶ But we can write the error as a function of the weighted
Hidden layer Output layer sums of the inputs from the hidden layer:
𝐽 𝐱; 𝜃 = 𝐽(𝑧1𝐿 , … , 𝑧𝑁𝐿 𝐿 )
̶ We can now calculate the derivative:
𝐿
𝑦ො1
𝑤1𝑗 𝛿𝐽
𝑛𝐿
𝛿𝐽 𝛿𝑧𝑙𝐿
𝐿−1 = ෍
𝛿ℎ𝑗 𝛿𝑧𝑙𝐿 𝛿ℎ𝑗𝐿−1
ℎ𝑗𝐿−1 …
𝑙=1
𝑛𝐿 𝑛𝐿−1 𝐿 𝐿−1
σ
𝛿𝐽 𝛿 𝑡=1 𝑤𝑙𝑡 ℎ𝑡
=෍ 𝐿
𝑤𝑁𝐿 𝐿 𝑗 𝛿𝑧𝑙 𝛿ℎ𝑗𝐿−1
𝑦ො𝑁𝐿 𝑙=1
𝑛𝐿
𝛿𝐽 𝐿
We computed this for the = ෍ 𝐿 𝑤𝑙𝑗
𝛿𝑧𝑙
gradient of the output layer! 𝑙=1
𝛿𝐽 ′ 𝑧𝐿 𝑛𝐿
𝐿 = 𝑦ො𝑙 − 𝑦𝑙 𝑔 𝑙 𝐿
𝛿𝑧𝑙 = ෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝑙=1
35
CONTINUE WITH THE HIDDEN NODES
̶ Take hidden node ℎ𝑗𝐿−1 , for an input pair (𝐱,𝐲):
ℎ𝑗𝐿−1 = 𝑔 𝑧𝑗𝐿−1
𝑛𝐿−2

𝑧𝑗𝐿−1 = 𝐰𝑗𝐿−1 𝐡𝐿−2 = ෍ 𝑤𝑗𝑖𝐿−1 ℎ𝑖𝐿−2


𝑖=1

̶ Compute the gradient descent for the weights in this layer:

𝛿𝐽 𝐱; 𝜃
∆𝑤𝑗𝑖𝐿−1 = −𝛾
𝛿𝑤𝑗𝑖𝐿−1
𝐿−1 𝐿−1
𝛿𝐽 𝛿ℎ𝑗 𝛿𝑧𝑗
= −𝛾 𝐿−1 𝐿−1 ℎ𝑖𝐿−2
𝛿ℎ𝑗 𝛿𝑧𝑗 𝛿𝑤𝑗𝑖𝐿−1
𝑔′ 𝑧𝑗𝐿−1
𝑛𝐿
𝐿
෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝑛𝐿 𝑙=1
∆𝑤𝑗𝑖𝐿−1 = −𝛾𝑔′ 𝑧𝑗𝐿−1 ℎ𝑖𝐿−2 ෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝐿
36
𝑙=1
WEIGHT UPDATE FOR THE HIDDEN NODES
𝑛𝐿
𝐿−1 𝐿−1 𝐿−2 ′ 𝐿 𝐿
∆𝑤𝑗𝑖 = −𝛾𝑔′ 𝑧𝑗 ℎ𝑖 ෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔 𝑧𝑙 𝑤𝑙𝑗
𝑙=1
Influence on node of
the next layer
Step size of the
gradient descent Output of previous layer
Contribution to the error on
all nodes of the next layer,
weighted by influence

37
VANISHING AND EXPLODING GRADIENT
̶ The gradient of the loss in one layer is computed as
the product of the gradients in all subsequent layers
̶ Vanishing gradient: if one of the elements is close
to zero, the product is close to zero
̶ Exploding gradient: if one of the elements is very
large, the product becomes very large

Discovered by Sepp Hochreiter in 1991


38
BIAS-VARIANCE TRADEOFF
̶ Remember! The more parameters the model has, the
higher the variance and the lower the bias.

̶ How many parameters do we have in a neural net?


̶ For each layer:
‒ Weights (matrix)
‒ Biases (vector)

39
EXAMPLE: MNIST
Handwritten digit classification:
̶ 28x28 images (784 pixels)
̶ 10 classes

Imagine a network with:


̶ 784 inputs
̶ A single hidden layer with 1000
units
̶ 10 output units

40
EXAMPLE: MNIST
Hidden
Imagine a network with: layer
̶ 784 inputs Input (1000)
̶ A single hidden layer with 1000 units (784) Output
̶ 10 output units layer
(10)
𝐖1 ∈ ℝ1000×784 𝐖2
𝐖1
𝐖 2 ∈ ℝ10×1000 ŷ

̶ Total parameters counting biases:


784,000 + 10,000 + 1000 + 10
= 795,010

41
HOW TO GO DEEPER
42
DEEP NEURAL NETWORKS
̶ By adding more layers:
̶ The number of parameters is smaller, if the width of
the layers is not too large
̶ Complex non-linearities can be modeled
̶ Deeper layers in the model represent more abstract
features based on the input data

43
CHALLENGES
̶ Neural networks can have huge numbers of
parameters
̶ This requires enormous training datasets
̶ The availability of such data is limited
̶ The computational effort to train such a network is
considerable

Daniel Peralta <[email protected]> 44


WHY NOW?
̶ Two main factors have enabled the rise of deep
learning in the 2000’s and 2010’s:
̶ Data availability
̶ GPUs as general-purpose platforms
‒ GPUs can be used to compute matrix/tensor
operations very efficiently
‒ Therefore, forward- and backpropagation can be
implemented in GPU

Daniel Peralta <[email protected]> 45


HOW TO DEAL WITH THE LARGE PARAMETER SPACE
̶ Huge training set
̶ But the gradient becomes very expensive to compute!
‒ Stochastic gradient descent
̶ Assumptions on the inputs to reduce the number of parameters
̶ Some assumptions are bound to the type of input:
‒ Images
‒ Sounds
‒ ...
̶ New architectures:
‒ Convolutional neural networks
̶ Regularization
̶ Parameter regulatization
̶ Data augmentation
̶ Momentum

46
STOCHASTIC GRADIENT DESCENT
̶ Instead of computing the total gradient for the entire training
set, compute it for a “mini-batch” containing a few instances
̶ Update the weights
𝜽 ⟵ 𝜽 − 𝜖𝒈
̶ Repeat many, many times
̶ Iteration: gradient descent update for a single mini-batch
̶ Epoch: run throughout the entire training set

Daniel Peralta <[email protected]> 47


STOCHASTIC GRADIENT DESCENT

Daniel Peralta <[email protected]> 48


MOMENTUM
̶ In each iteration, update the weights with a
combination of the current and the previous gradients
̶ Potential benefits:
̶ Accelerate convergence
̶ Reduce the risk of undesired leaps in the search
space

Daniel Peralta <[email protected]> 49


REGULARIZATION
̶ As in more simple models, regularization can be added
by penalizing the cost:

Daniel Peralta <[email protected]> 50


HUGE TRAINING SET
̶ In some cases, we just don’t have enough data to train

̶ Data augmentation
̶ Sometimes it is easy to generate new training data
by adding realistic variations to available instances
‒ Images can be translated, rotated, flipped
‒ Adding noise
̶ Noisy training data reduces the chance of overfitting

Daniel Peralta <[email protected]> 51


BATCH NORMALIZATION
̶ To deal with vanishing and exploding gradient, all
gradients and weights should be kept within a
reasonable range
̶ This range is usually a standard normal distribution

̶ Batch normalization layers normalize the activations of


the network for every minibatch

Proposed by Ioffe and Szegedy in 2015

Daniel Peralta <[email protected]> 52


SELF NORMALIZING NETWORKS
̶ The SELU activation function ensures that the weights
and gradients remain close to a 0-1 normal

Proposed by Klambauer et al. in 2017


53
PROPERTIES OF THE ACTIVATION FUNCTION

1. Controlling the mean 2. Dampen the variance if it


is too large in the lower layer
Negative and
Saturation region
positive values

3. Increase the variance if it 4. Single point where variance


is too small in the lower layer damping is equalized by variance
increasing

Region with slope


Continuous curve
larger than one
54
DROPOUT
̶ Dropout consists on randomly setting a subset of the
activations of a layers to zero during the training

̶ This allows to reduce the overfitting, as the output will


not be allowed to depend on a single highly specific
path across the network

55
CONVOLUTIONAL
NEURAL NETWORKS
56
LEARNING FROM IMAGES
̶ A very simplistic way of working with images is serializing all
pictures into a vector

̶ However, there are known facts about images:


̶ There is a spatial structure
̶ A single pixel doesn’t have much value by itself
̶ Pixels close to each other tend to be similar
̶ Patterns can appear in different parts of the image
̶ In some problems, concepts are translation and/or rotation
invariant

57
CONVOLUTION
̶ A convolution is an operation on two functions
̶ At each point, the “input” function f is weighted across its
entire domain, by using a “kernel” function g

58
CONVOLUTION=LOCAL RECEPTIVE FIELD

Input: 28x28 inputs (e.g. image pixels)

Convolution: 5x5 mask

59
EXAMPLES OF CONVOLUTIONS

Original image Mean kernel (9x9)

Median kernel (9x9) Sobel filter [-1 0 1]


60
LEARNING THE KERNEL
̶ What if we don’t know the weights?
̶ We want to TRAIN the kernel:
̶ Include the convolution operation as a layer of the neural network
̶ The convolution is applied to the layer, producing an output of
(almost) the same size
̶ The same kernel is used for the entire input

̶ A convolutional layer has few trainable weights, no matter the size of its
input
̶ This allows to stack many convolutional layers!!

61
CONVOLUTION

https://cs231n.github.io/convolutional-networks/

62
SHARED WEIGHTS AND BIASES
̶ Each hidden neuron has a bias and 5×5 weights connected to its local
receptive field
̶ We are going to use the same weights and bias for each of the 24×24
hidden neurons
̶ The convolution defines a filter or kernel, that is applied to different local
regions in the input

yj,k=

wl,m : connection weights, ai,j=inputs


63
CONVOLUTIONAL NEURAL NETWORKS

Proposed in 1989 by Yann LeCun


64
IMAGENET CNN ARCHITECTURE

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep 65
convolutional neural networks. In NIPS, pp. 1106–1114, 2012
HYPERPARAMETERS
66
LIST OF HYPERPARAMETERS
̶ Architecture:
̶ Number (and type) of layers
̶ Size of the fully connected layers
̶ Size of the convolution (and pooling) kernels
̶ Activation functions
̶ Non-standard connections (e.g. skip connections)
̶ …
̶ Parameters for stochastic gradient descent:
̶ Learning rate
̶ Learning rate updates
̶ Mini-batch size
̶ Number of iterations
̶ Momentum
̶ …
̶ Regularization parameters:
̶ Weight decay
̶ Dropout rate
̶ …
̶ …

67
HOW TO CHOOSE HYPERPARAMETER VALUES
̶ Deep neural networks are extremely prone to over-fitting
̶ They may also take a lot of iterations before they reach a good area of
the parameter space and start converging
̶ A few general guidelines:
̶ Larger networks have more capacity to learn, but require more:
‒ Memory
‒ Time
‒ Training data
̶ Minibatches should be as large as possible, for more robust
convergence

68
DEEP NEURAL NETS
AND ABSTRACTION
69
TRADITIONAL ML APPROACH

Hand-crafted
Trainable
feature
classifier
extractor
Input Feature-based Supervised
representation model

70
EXAMPLE: SPEECH RECOGNITION

Gaussian
Mixture Classifier
MFCC (MLP)
model
(GMM)
Fixed Unsupervised Supervised

71
EXAMPLE: OBJECT RECOGNITION

K-Means
SIFT sparse Pooling
HoG coding

Fixed Unsupervised
Low-level Mid-level
features features Classifier
(MLP)
Supervised
72
REPRESENTATION LEARNING

Trainable
Trainable
feature
classifier
extractor
Input Supervised
model

73
INSPIRATION FROM THE VISUAL CORTEX

74
LEARNING HIERARCHICAL REPRESENTATIONS
̶ It's deep if it has more than one stage of non-linear feature
transformation

75
TRAINABLE FEATURE HIERARCHIES
̶ Hierarchy of representations with increasing level of abstraction
̶ Each stage is a kind of trainable feature transform
̶ Image recognition
‒ Pixel → edge → texton → motif → part → object
̶ Text
‒ Character → word → word group → clause → sentence → story
̶ Speech
‒ Sample → spectral band → sound → ... → phone → phoneme → word →

76
TRAINABLE FEATURE HIERARCHIES

̶ Each module transforms its input representation into a


higher-level one
̶ High-level features are more global and more invariant
̶ Low-level features are shared among categories

77
DEEP LEARNING
̶ Each layer corresponds to a ‘‘distributed representation’’
̶ units in layer are not mutually exclusive
̶ each unit is a separate feature of the input
̶ two units can be ‘‘active’’ at the same time
̶ They do not correspond to a partitioning (clustering) of the
inputs
̶ in clustering, an input can only belong to a single cluster

78
DEEP LEARNING ARCHITECTURES

79
TRAINING DEEP NETWORKS
̶ Purely Supervised
̶ Initialize parameters randomly
̶ Train in supervised mode, typically with SGD, using backprop
to compute gradients
̶ Used in most practical systems for speech and image
recognition

80
TRAINING DEEP NETWORKS
̶ Unsupervised, layerwise + supervised classifier on top
̶ Train each layer unsupervised, one after the other
̶ Train a supervised classifier on top, keeping the other layers
fixed
̶ Good when very few labeled samples are available

81
TRAINING DEEP NETWORKS
̶ Unsupervised, layerwise + global supervised fine-tuning
̶ Train each layer unsupervised, one after the other
̶ Add a classifier layer, and retrain the whole thing
supervised
̶ Good when label set is poor (e.g. pedestrian detection)
̶ Unsupervised pre-training often uses regularized auto-
encoders

82
ADVERSARIAL
EXAMPLES
83
ADVERSARIAL EXAMPLES

?? ??
84
ADVERSARIAL EXAMPLES

85
ADVERSARIAL EXAMPLES

86
WHY DOES THIS HAPPEN?
̶ Neural networks are built out of ~linear building blocks
̶ In high dimensions, the value of a linear function can
change rapidly:
̶ With a perturbation of 𝜖, a linear function with
weights 𝒘 can change up to 𝜖 𝒘 1
̶ This value can be very large!

87
TAKING ADVANTAGE OF ADVERSARIAL EXAMPLES
̶ Adversarial training:
̶ Including adversarial examples in the training set
̶ Good for regularization!
̶ Encourages the learned function to be locally
constant (instead of locally linear)

88
VIRTUAL ADVERSARIAL TRAINING
̶ A variant of adversarial training for semi-supervised learning

1. Take an unlabeled instance 𝒙


2. Use our (partially) trained model to predict its label, 𝑦ො (high
probability of being the true label)
3. Generate an adversarial example 𝒙’, for which the model

predicts 𝑦 ≠ 𝑦ො
4. Include the pairs 𝒙, 𝑦ො and 𝒙’, 𝑦ො in the training set
5. Re-train the model

89
ADVERSARIAL TRAINING
̶ Assumption:
̶ Different classes lie on disconnected manifolds

̶ A small perturbation should not be able to jump


between manifolds

90
APPLICATIONS: EVADING RECOGNITION

Xu, K., Zhang, G., Liu, S., Fan, Q., Sun, M., Chen, H., Chen, P. Y., Wang, Y., & Lin, X.
(2019). Adversarial T-shirt! Evading Person Detectors in A Physical World. Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 12350 LNCS, 665–681. https://doi.org/10.48550/arxiv.1910.11099
91
APPLICATIONS: EVADING RECOGNITION

Zolfi, A., Avidan, S., Elovici, Y., & Shabtai, A. (2021).


Adversarial Mask: Real-World Adversarial Attack Against
Face Recognition Models.
https://doi.org/10.48550/arxiv.2111.10759
92
GOING FURTHER
93
BOOKS AND COURSES

https://www.deeplearningbook.org/

deeplearning.ai

coursera.org

udacity.com

http://neuralnetworksanddeeplearning.com/

94

You might also like