0% found this document useful (0 votes)

34 views94 pages

Lecture8 DeepLearning

Uploaded by

Kassa Derbie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views94 pages

Lecture8 DeepLearning

Uploaded by

Kassa Derbie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 94

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

DEEP LEARNING
Big Data Science (Master in Statistical Data Analysis)
TURING AWARD 2018

Yann LeCun Geoffrey Hinton Yoshua Bengio

“for conceptual and engineering breakthroughs that have

made deep neural networks a critical component of
computing”
A MODEL THAT CAN LEARN ANYTHING?
̶ Translating text
̶ Moving a robot
̶ Detecting cancer from a picture
̶ Driving a car
̶ Playing computer games (and beating human players)

3
A MODEL THAT CAN LEARN ANYTHING?

4
A MODEL THAT CAN LEARN ANYTHING?

5
A MODEL THAT CAN LEARN ANYTHING?
̶ Most machine learning models:
̶ Make assumptions on the data

̶ Rely on expert-driven feature engineering

6
̶ "A model that can learn anything"

̶ "A model that can approximate

continuous functions on compact
𝑛
subsets of ℝ "

7
NEURAL NETWORKS
8
THE ARTIFICIAL NEURON
• Biological analogy

9
THE ARTIFICIAL NEURON

Inputs Activation
function
Weights
x1 (non-linear)
w1 Output

x2
w2 𝑛
... y 𝑔 ෍ 𝑤𝑖 𝑥𝑖 + 𝑏
... wn 𝑖=1

First proposed in 1943 (McCulloch and Pitts)

10
A SINGLE NEURON MODEL
Geometrical interpretation
̶ Output can be written as

̶ Separating line is given by

̶ This can be rewritten as:

11
A SINGLE NEURON MODEL
Geometrical interpretation
̶ weights determine the slope
of the line (direction)
̶ bias determines the offset of
the line (shift)

12
ACTIVATION FUNCTION

Identity Sigmoid (Logistic) Hyperbolic tangent (tanh)

Rectified linear unit (ReLU) Exponential linear unit (ELU) Scaled exponential
linear unit (SELU)

13
MANY NEURONS MAKE A LAYER (OR A PERCEPTRON)
A (single-layer) perceptron is a very simple type of feed-forward network
𝑛
x1 y1 𝑔 ෍ 𝑤1𝑖 𝑥𝑖 + 𝑏1
𝑖=1
x2

...

𝑛
xn y2 𝑔 ෍ 𝑤2𝑖 𝑥𝑖 + 𝑏2
𝑖=1

Introduced in 1958 by Frank Rosenblatt

14
FEED-FORWARD LAYER
̶ A fully connected feed-forward layer can also be
expressed as a matrix multiplication:
d Layer i+1
Layer i

wi b z y
𝐲 = 𝑔(𝐖𝐱 + 𝐛)
g
d x W b z y

15
PERCEPTRON LEARNING (TRAINING)
̶ Adapting the weights and bias of the perceptron
̶ The goal is to minimize the errors on the training samples
̶ Weights and biases are updated in an iterative procedure:

̶ How to compute 𝑤𝑖 (𝑡) and (𝑡) ?

16
PERCEPTRON
̶ How to compute 𝑤𝑖 (𝑡) and (𝑡) ?
̶ Three learning rules:
̶ Hebbian rule
‒ 𝑤𝑖 𝑡 = 𝛾𝑦𝑥𝑖
‒ If two units are activated simultaneously, their connection should be strengthened
(always update)
̶ Perceptron rule
‒ 𝑤𝑖 𝑡 = 𝑑 𝑥 𝑥𝑖
‒ Weights are only updated if y  d(x)
̶ Delta rule (Widrow-Hoff rule)
‒ 𝑤𝑖 𝑡 = 𝛾 𝑑 𝑥 − 𝑦 𝑥𝑖
‒ Uses the difference between the actual and the desired activation to adapt the
connection strength

17
PERCEPTRON
New York Times:

The Navy revealed the

embryo of an electronic
computer today that it expects
will be able to walk, talk, see,
write, reproduce itself and be
conscious of its existence."

18
MANY LAYERS MAKE A MULTI-LAYER PERCEPTRON

Input layer Output layer

Hidden layer
x1 y1
h1
x2
y2
...
h2
xn y3
AN MLP WITH SEVERAL HIDDEN LAYERS IS A "DEEP NEURAL NETWORK"

Input layer Hidden layers Output layer

x1 y1

x2
y2
...

xn y3
MULTI-LAYER PERCEPTRON (MLP)

21
UNIVERSAL APPROXIMATION THEOREM
̶ A feed-forward network with a single hidden layer
containing a finite number of neurons can approximate
continuous functions on compact subsets of ℝ 𝑛

Cybenko, 1989
Hornik, 1991
Daniel Peralta <[email protected]> 22
SO WHAT NOW?
̶ Deep learning was invented in the 1960's
̶ It was stated in the 1980's that a single-layer MLP is
enough (?)
̶ The theorem still restricts functions that can be
learned
̶ We don't know:
‒ Size of the hidden layer needed
‒ Learning algorithm needed

Daniel Peralta <[email protected]> 23

TRAINING A NEURAL
NETWORK
24
LOSS FUNCTIONS
̶ As for other models (e.g. Linear regression) we choose
a cost function that is minimized during the training

̶ Examples of cost function for regression:

̶ Mean Square Error (MSE):
𝑁
1 2
𝐽(𝜃) = ෍ 𝑦𝑖 − 𝑦ො𝑖
𝑁
𝑖=1

25
LOSS FUNCTIONS
̶ Loss function for classification (𝑀 classes):
̶ Typically, the outputs are 1-hot encoded:
‒ 𝑀 output neurons 𝑦ො1 , … , 𝑦ො𝑀
‒ σ𝑀 𝑗=1 𝑦
ො𝑗 = 1
‒ Desired output for an instance of class 𝑐:
‒ Neuron 𝑐 should be 1: 𝑦𝑐 = 1
‒ All other neurons should be 0: 𝑦𝑗 = 0 ∀𝑗 ≠ 𝑐

̶ Softmax activation function for the output layer:

𝑒 𝑧𝑗
𝑦ෝ𝑗 = 𝑀 𝑥
σ𝑖=1 𝑒 𝑖

̶ Cross entropy loss:

𝑁 𝑀
1
𝐽(𝜃) = ෍ ෍ 𝑦𝑗 log 𝑦ෝ𝑗
𝑁
𝑖=1 𝑗=1

26
FORMULATION AS AN OPTIMIZATION PROBLEM
̶ Minimize 𝐽(𝜃) with respect to the parameters of the model 𝜃

Training set
{(𝐱 𝟏 , 𝐲𝟏 ), …, (𝐱 𝐍 , 𝐲𝐍 )}

27
OPTIMIZING THE LOSS
Model Loss function Optimization

Linear regression Convex Analytical solution

Logistic regression Convex Newton-Raphson method

Last layer: 𝐡 𝐿 =𝑔 𝐋
𝐖 𝐡𝐿−1 =𝑔 𝐋
𝐖 𝑔 𝐋−𝟏
𝐖 𝑔(… ) = 𝐲ො

Gradient descent: Adapt weights by taking a step in the

negative direction of the gradient

𝑙 𝛿𝐽(𝐱; 𝜃)
For an instance 𝐱 : ∆𝑤𝑗𝑘 = −𝛾 𝑙
𝛿𝑤𝑗𝑘
29
BACKPROPAGATION
̶ In a multi-layered network:
̶ The gradient of the cost is easily computed for the
last layer
̶ The gradient in previous layer is computed using the
chain rule of calculus

̶ Backpropagation allows us to compute the gradient for

all weights in the network

30
Introduced in 1986 by Rumelhart
BACKPROPAGATION

Cost function 𝐽(𝜃)

Hidden Hidden
layer 1 layer 2 MSE:
Input 1
𝑁

𝐽(𝜃) = ෍ 𝐲𝐢 − 𝐲ො𝑖 2
Output 2𝑁
𝑖=1
layer
1 2 3 Cross entropy loss:
𝐖 𝐖 𝐖 𝑁 𝑀
1
x y ŷ 𝐽(𝜃) = ෍ ෍ 𝑦𝑖𝑗 log 𝑦ො𝑖𝑗
𝑁
𝑖=1 𝑗=1

𝛻𝜃 𝐽(𝜃)
Gradient of the cost:
31
START WITH THE OUTPUT NODES
̶ Take output neuron 𝑘, for an input pair (𝐱,𝐲):
𝑦ො𝑘 = ℎ𝑘𝐿 = 𝑔 𝑧𝑘𝐿 Mean-Squared Error:
𝑛𝐿−1 𝑁
𝐿 𝐿−1 1
𝑧𝑘𝐿 = 𝐰𝑘𝐿 𝐡𝐿−1 = ෍ 𝑤𝑘𝑗 ℎ𝑗 𝐽 𝜃 = ෍ 𝐲𝑖 − 𝐲ො𝑖 2
𝑗=1
2𝑁
𝑖=1

̶ Compute the gradient descent for the weights in this layer:

𝑛𝐿
1 2
𝐽 𝐱; 𝜃 = ෍ 𝑦𝑘 − 𝑦ො𝑘
𝐿 𝛿𝐽 𝐱; 𝜃 2
∆𝑤𝑘𝑗 = −𝛾 𝐿
𝑘=1
𝑛𝐿
𝛿𝑤𝑘𝑗 1
= ෍ 𝑦𝑘2 − 2𝑦𝑘 𝑦ො𝑘 + 𝑦ො𝑘2
𝛿𝐽 𝛿 𝑦ො𝑘 𝛿𝑧𝑘𝐿 2
𝑘=1
= −𝛾 ℎ𝑗𝐿−1
𝛿 𝑦ො𝑘 𝛿𝑧𝑘𝐿 𝛿𝑤𝑘𝑗
𝐿
𝛿𝐽 𝑔′ 𝑧𝑘𝐿
𝛿𝑧𝑘𝐿 𝛿𝐽 𝐱; 𝜃
𝑦ො𝑘 − 𝑦𝑘 = 𝑦ො𝑘 − 𝑦𝑘
𝐿
∆𝑤𝑘𝑗 = −𝛾 𝑦ො𝑘 − 𝑦𝑘 𝑔′ 𝑧𝑘𝐿 ℎ𝑗𝐿−1 𝛿 𝑦ො𝑘

32
INTERPRETATION OF THE GRADIENT
𝐿 ′ 𝐿 𝐿−1
∆𝑤𝑘𝑗 = −𝛾 𝑦ො𝑘 − 𝑦𝑘 𝑔 𝑧𝑘 ℎ𝑗

Step size of the Output of hidden node ℎ𝑗𝐿−1

gradient descent

Difference between expected

and observed output

33
CONTINUE WITH THE HIDDEN NODES
̶ Take hidden node ℎ𝑗𝐿−1 , for an input pair (𝐱,𝐲):
ℎ𝑗𝐿−1 = 𝑔 𝑧𝑗𝐿−1
𝑛𝐿−2

𝑧𝑗𝐿−1 = 𝐰𝑗𝐿−1 𝐡𝐿−2 = ෍ 𝑤𝑗𝑖𝐿−1 ℎ𝑖𝐿−2

𝑖=1

̶ Compute the gradient descent for the weights in this layer:

𝛿𝐽 𝐱; 𝜃
∆𝑤𝑗𝑖𝐿−1 = −𝛾
𝛿𝑤𝑗𝑖𝐿−1
𝐿−1 𝐿−1
𝛿𝐽 𝛿ℎ𝑗 𝛿𝑧𝑗
= −𝛾 𝐿−1 𝐿−1 ℎ𝑖𝐿−2
𝛿ℎ𝑗 𝛿𝑧𝑗 𝛿𝑤𝑗𝑖𝐿−1
𝑔′ 𝑧𝑗𝐿−1

Trickier!

34
CONTINUE WITH THE HIDDEN NODES
̶ We don’t know directly the contribution of ℎ𝑗𝐿−1 to
𝐽 𝐱𝑖 ; 𝜃
̶ But we can write the error as a function of the weighted
Hidden layer Output layer sums of the inputs from the hidden layer:
𝐽 𝐱; 𝜃 = 𝐽(𝑧1𝐿 , … , 𝑧𝑁𝐿 𝐿 )
̶ We can now calculate the derivative:
𝐿
𝑦ො1
𝑤1𝑗 𝛿𝐽
𝑛𝐿
𝛿𝐽 𝛿𝑧𝑙𝐿
𝐿−1 = ෍
𝛿ℎ𝑗 𝛿𝑧𝑙𝐿 𝛿ℎ𝑗𝐿−1
ℎ𝑗𝐿−1 …
𝑙=1
𝑛𝐿 𝑛𝐿−1 𝐿 𝐿−1
σ
𝛿𝐽 𝛿 𝑡=1 𝑤𝑙𝑡 ℎ𝑡
=෍ 𝐿
𝑤𝑁𝐿 𝐿 𝑗 𝛿𝑧𝑙 𝛿ℎ𝑗𝐿−1
𝑦ො𝑁𝐿 𝑙=1
𝑛𝐿
𝛿𝐽 𝐿
We computed this for the = ෍ 𝐿 𝑤𝑙𝑗
𝛿𝑧𝑙
gradient of the output layer! 𝑙=1
𝛿𝐽 ′ 𝑧𝐿 𝑛𝐿
𝐿 = 𝑦ො𝑙 − 𝑦𝑙 𝑔 𝑙 𝐿
𝛿𝑧𝑙 = ෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝑙=1
35
CONTINUE WITH THE HIDDEN NODES
̶ Take hidden node ℎ𝑗𝐿−1 , for an input pair (𝐱,𝐲):
ℎ𝑗𝐿−1 = 𝑔 𝑧𝑗𝐿−1
𝑛𝐿−2

𝑧𝑗𝐿−1 = 𝐰𝑗𝐿−1 𝐡𝐿−2 = ෍ 𝑤𝑗𝑖𝐿−1 ℎ𝑖𝐿−2

𝑖=1

̶ Compute the gradient descent for the weights in this layer:

𝛿𝐽 𝐱; 𝜃
∆𝑤𝑗𝑖𝐿−1 = −𝛾
𝛿𝑤𝑗𝑖𝐿−1
𝐿−1 𝐿−1
𝛿𝐽 𝛿ℎ𝑗 𝛿𝑧𝑗
= −𝛾 𝐿−1 𝐿−1 ℎ𝑖𝐿−2
𝛿ℎ𝑗 𝛿𝑧𝑗 𝛿𝑤𝑗𝑖𝐿−1
𝑔′ 𝑧𝑗𝐿−1
𝑛𝐿
𝐿
෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝑛𝐿 𝑙=1
∆𝑤𝑗𝑖𝐿−1 = −𝛾𝑔′ 𝑧𝑗𝐿−1 ℎ𝑖𝐿−2 ෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔′ 𝑧𝑙𝐿 𝑤𝑙𝑗
𝐿
36
𝑙=1
WEIGHT UPDATE FOR THE HIDDEN NODES
𝑛𝐿
𝐿−1 𝐿−1 𝐿−2 ′ 𝐿 𝐿
∆𝑤𝑗𝑖 = −𝛾𝑔′ 𝑧𝑗 ℎ𝑖 ෍ 𝑦ො𝑙 − 𝑦𝑙 𝑔 𝑧𝑙 𝑤𝑙𝑗
𝑙=1
Influence on node of
the next layer
Step size of the
gradient descent Output of previous layer
Contribution to the error on
all nodes of the next layer,
weighted by influence

37
VANISHING AND EXPLODING GRADIENT
̶ The gradient of the loss in one layer is computed as
the product of the gradients in all subsequent layers
̶ Vanishing gradient: if one of the elements is close
to zero, the product is close to zero
̶ Exploding gradient: if one of the elements is very
large, the product becomes very large

Discovered by Sepp Hochreiter in 1991

38
BIAS-VARIANCE TRADEOFF
̶ Remember! The more parameters the model has, the
higher the variance and the lower the bias.

̶ How many parameters do we have in a neural net?

̶ For each layer:
‒ Weights (matrix)
‒ Biases (vector)

39
EXAMPLE: MNIST
Handwritten digit classification:
̶ 28x28 images (784 pixels)
̶ 10 classes

Imagine a network with:

̶ 784 inputs
̶ A single hidden layer with 1000
units
̶ 10 output units

40
EXAMPLE: MNIST
Hidden
Imagine a network with: layer
̶ 784 inputs Input (1000)
̶ A single hidden layer with 1000 units (784) Output
̶ 10 output units layer
(10)
𝐖1 ∈ ℝ1000×784 𝐖2
𝐖1
𝐖 2 ∈ ℝ10×1000 ŷ

̶ Total parameters counting biases:

784,000 + 10,000 + 1000 + 10
= 795,010

41
HOW TO GO DEEPER
42
DEEP NEURAL NETWORKS
̶ By adding more layers:
̶ The number of parameters is smaller, if the width of
the layers is not too large
̶ Complex non-linearities can be modeled
̶ Deeper layers in the model represent more abstract
features based on the input data

43
CHALLENGES
̶ Neural networks can have huge numbers of
parameters
̶ This requires enormous training datasets
̶ The availability of such data is limited
̶ The computational effort to train such a network is
considerable

Daniel Peralta <[email protected]> 44

WHY NOW?
̶ Two main factors have enabled the rise of deep
learning in the 2000’s and 2010’s:
̶ Data availability
̶ GPUs as general-purpose platforms
‒ GPUs can be used to compute matrix/tensor
operations very efficiently
‒ Therefore, forward- and backpropagation can be
implemented in GPU

Daniel Peralta <[email protected]> 45

HOW TO DEAL WITH THE LARGE PARAMETER SPACE
̶ Huge training set
̶ But the gradient becomes very expensive to compute!
‒ Stochastic gradient descent
̶ Assumptions on the inputs to reduce the number of parameters
̶ Some assumptions are bound to the type of input:
‒ Images
‒ Sounds
‒ ...
̶ New architectures:
‒ Convolutional neural networks
̶ Regularization
̶ Parameter regulatization
̶ Data augmentation
̶ Momentum

46
STOCHASTIC GRADIENT DESCENT
̶ Instead of computing the total gradient for the entire training
set, compute it for a “mini-batch” containing a few instances
̶ Update the weights
𝜽 ⟵ 𝜽 − 𝜖𝒈
̶ Repeat many, many times
̶ Iteration: gradient descent update for a single mini-batch
̶ Epoch: run throughout the entire training set

Daniel Peralta <[email protected]> 47

STOCHASTIC GRADIENT DESCENT

Daniel Peralta <[email protected]> 48

MOMENTUM
̶ In each iteration, update the weights with a
combination of the current and the previous gradients
̶ Potential benefits:
̶ Accelerate convergence
̶ Reduce the risk of undesired leaps in the search
space

Daniel Peralta <[email protected]> 49

REGULARIZATION
̶ As in more simple models, regularization can be added
by penalizing the cost:

Daniel Peralta <[email protected]> 50

HUGE TRAINING SET
̶ In some cases, we just don’t have enough data to train

̶ Data augmentation
̶ Sometimes it is easy to generate new training data
by adding realistic variations to available instances
‒ Images can be translated, rotated, flipped
‒ Adding noise
̶ Noisy training data reduces the chance of overfitting

Daniel Peralta <[email protected]> 51

BATCH NORMALIZATION
̶ To deal with vanishing and exploding gradient, all
gradients and weights should be kept within a
reasonable range
̶ This range is usually a standard normal distribution

̶ Batch normalization layers normalize the activations of

the network for every minibatch

Proposed by Ioffe and Szegedy in 2015

Daniel Peralta <[email protected]> 52

SELF NORMALIZING NETWORKS
̶ The SELU activation function ensures that the weights
and gradients remain close to a 0-1 normal

Proposed by Klambauer et al. in 2017

53
PROPERTIES OF THE ACTIVATION FUNCTION

1. Controlling the mean 2. Dampen the variance if it

is too large in the lower layer
Negative and
Saturation region
positive values

3. Increase the variance if it 4. Single point where variance

is too small in the lower layer damping is equalized by variance
increasing

Region with slope

Continuous curve
larger than one
54
DROPOUT
̶ Dropout consists on randomly setting a subset of the
activations of a layers to zero during the training

̶ This allows to reduce the overfitting, as the output will

not be allowed to depend on a single highly specific
path across the network

55
CONVOLUTIONAL
NEURAL NETWORKS
56
LEARNING FROM IMAGES
̶ A very simplistic way of working with images is serializing all
pictures into a vector

̶ However, there are known facts about images:

̶ There is a spatial structure
̶ A single pixel doesn’t have much value by itself
̶ Pixels close to each other tend to be similar
̶ Patterns can appear in different parts of the image
̶ In some problems, concepts are translation and/or rotation
invariant

57
CONVOLUTION
̶ A convolution is an operation on two functions
̶ At each point, the “input” function f is weighted across its
entire domain, by using a “kernel” function g

58
CONVOLUTION=LOCAL RECEPTIVE FIELD

Input: 28x28 inputs (e.g. image pixels)

Convolution: 5x5 mask

59
EXAMPLES OF CONVOLUTIONS

Original image Mean kernel (9x9)

Median kernel (9x9) Sobel filter [-1 0 1]

60
LEARNING THE KERNEL
̶ What if we don’t know the weights?
̶ We want to TRAIN the kernel:
̶ Include the convolution operation as a layer of the neural network
̶ The convolution is applied to the layer, producing an output of
(almost) the same size
̶ The same kernel is used for the entire input

̶ A convolutional layer has few trainable weights, no matter the size of its
input
̶ This allows to stack many convolutional layers!!

61
CONVOLUTION

https://cs231n.github.io/convolutional-networks/

62
SHARED WEIGHTS AND BIASES
̶ Each hidden neuron has a bias and 5×5 weights connected to its local
receptive field
̶ We are going to use the same weights and bias for each of the 24×24
hidden neurons
̶ The convolution defines a filter or kernel, that is applied to different local
regions in the input

yj,k=

wl,m : connection weights, ai,j=inputs

63
CONVOLUTIONAL NEURAL NETWORKS

Proposed in 1989 by Yann LeCun

64
IMAGENET CNN ARCHITECTURE

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep 65
convolutional neural networks. In NIPS, pp. 1106–1114, 2012
HYPERPARAMETERS
66
LIST OF HYPERPARAMETERS
̶ Architecture:
̶ Number (and type) of layers
̶ Size of the fully connected layers
̶ Size of the convolution (and pooling) kernels
̶ Activation functions
̶ Non-standard connections (e.g. skip connections)
̶ …
̶ Parameters for stochastic gradient descent:
̶ Learning rate
̶ Learning rate updates
̶ Mini-batch size
̶ Number of iterations
̶ Momentum
̶ …
̶ Regularization parameters:
̶ Weight decay
̶ Dropout rate
̶ …
̶ …

67
HOW TO CHOOSE HYPERPARAMETER VALUES
̶ Deep neural networks are extremely prone to over-fitting
̶ They may also take a lot of iterations before they reach a good area of
the parameter space and start converging
̶ A few general guidelines:
̶ Larger networks have more capacity to learn, but require more:
‒ Memory
‒ Time
‒ Training data
̶ Minibatches should be as large as possible, for more robust
convergence

68
DEEP NEURAL NETS
AND ABSTRACTION
69
TRADITIONAL ML APPROACH

Hand-crafted
Trainable
feature
classifier
extractor
Input Feature-based Supervised
representation model

70
EXAMPLE: SPEECH RECOGNITION

Gaussian
Mixture Classifier
MFCC (MLP)
model
(GMM)
Fixed Unsupervised Supervised

71
EXAMPLE: OBJECT RECOGNITION

K-Means
SIFT sparse Pooling
HoG coding

Fixed Unsupervised
Low-level Mid-level
features features Classifier
(MLP)
Supervised
72
REPRESENTATION LEARNING

Trainable
Trainable
feature
classifier
extractor
Input Supervised
model

73
INSPIRATION FROM THE VISUAL CORTEX

74
LEARNING HIERARCHICAL REPRESENTATIONS
̶ It's deep if it has more than one stage of non-linear feature
transformation

75
TRAINABLE FEATURE HIERARCHIES
̶ Hierarchy of representations with increasing level of abstraction
̶ Each stage is a kind of trainable feature transform
̶ Image recognition
‒ Pixel → edge → texton → motif → part → object
̶ Text
‒ Character → word → word group → clause → sentence → story
̶ Speech
‒ Sample → spectral band → sound → ... → phone → phoneme → word →

76
TRAINABLE FEATURE HIERARCHIES

̶ Each module transforms its input representation into a

higher-level one
̶ High-level features are more global and more invariant
̶ Low-level features are shared among categories

77
DEEP LEARNING
̶ Each layer corresponds to a ‘‘distributed representation’’
̶ units in layer are not mutually exclusive
̶ each unit is a separate feature of the input
̶ two units can be ‘‘active’’ at the same time
̶ They do not correspond to a partitioning (clustering) of the
inputs
̶ in clustering, an input can only belong to a single cluster

78
DEEP LEARNING ARCHITECTURES

79
TRAINING DEEP NETWORKS
̶ Purely Supervised
̶ Initialize parameters randomly
̶ Train in supervised mode, typically with SGD, using backprop
to compute gradients
̶ Used in most practical systems for speech and image
recognition

80
TRAINING DEEP NETWORKS
̶ Unsupervised, layerwise + supervised classifier on top
̶ Train each layer unsupervised, one after the other
̶ Train a supervised classifier on top, keeping the other layers
fixed
̶ Good when very few labeled samples are available

81
TRAINING DEEP NETWORKS
̶ Unsupervised, layerwise + global supervised fine-tuning
̶ Train each layer unsupervised, one after the other
̶ Add a classifier layer, and retrain the whole thing
supervised
̶ Good when label set is poor (e.g. pedestrian detection)
̶ Unsupervised pre-training often uses regularized auto-
encoders

82
ADVERSARIAL
EXAMPLES
83
ADVERSARIAL EXAMPLES

?? ??
84
ADVERSARIAL EXAMPLES

85
ADVERSARIAL EXAMPLES

86
WHY DOES THIS HAPPEN?
̶ Neural networks are built out of ~linear building blocks
̶ In high dimensions, the value of a linear function can
change rapidly:
̶ With a perturbation of 𝜖, a linear function with
weights 𝒘 can change up to 𝜖 𝒘 1
̶ This value can be very large!

87
TAKING ADVANTAGE OF ADVERSARIAL EXAMPLES
̶ Adversarial training:
̶ Including adversarial examples in the training set
̶ Good for regularization!
̶ Encourages the learned function to be locally
constant (instead of locally linear)

88
VIRTUAL ADVERSARIAL TRAINING
̶ A variant of adversarial training for semi-supervised learning

1. Take an unlabeled instance 𝒙

2. Use our (partially) trained model to predict its label, 𝑦ො (high
probability of being the true label)
3. Generate an adversarial example 𝒙’, for which the model
′
predicts 𝑦 ≠ 𝑦ො
4. Include the pairs 𝒙, 𝑦ො and 𝒙’, 𝑦ො in the training set
5. Re-train the model

89
ADVERSARIAL TRAINING
̶ Assumption:
̶ Different classes lie on disconnected manifolds

̶ A small perturbation should not be able to jump

between manifolds

90
APPLICATIONS: EVADING RECOGNITION

Xu, K., Zhang, G., Liu, S., Fan, Q., Sun, M., Chen, H., Chen, P. Y., Wang, Y., & Lin, X.
(2019). Adversarial T-shirt! Evading Person Detectors in A Physical World. Lecture Notes in
Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 12350 LNCS, 665–681. https://doi.org/10.48550/arxiv.1910.11099
91
APPLICATIONS: EVADING RECOGNITION

Zolfi, A., Avidan, S., Elovici, Y., & Shabtai, A. (2021).

Adversarial Mask: Real-World Adversarial Attack Against
Face Recognition Models.
https://doi.org/10.48550/arxiv.2111.10759
92
GOING FURTHER
93
BOOKS AND COURSES

https://www.deeplearningbook.org/

deeplearning.ai

coursera.org

udacity.com

http://neuralnetworksanddeeplearning.com/

Aidl Unit III
No ratings yet
Aidl Unit III
79 pages
Effects of Class Scheduling and Student Achievement On State Test
No ratings yet
Effects of Class Scheduling and Student Achievement On State Test
113 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
MAT097 Chapter 7 Random Variables (With Solution)
100% (1)
MAT097 Chapter 7 Random Variables (With Solution)
32 pages
The Cause and Prevention of Steam Turbine Blade Deposits
100% (4)
The Cause and Prevention of Steam Turbine Blade Deposits
58 pages
Sample Accomplishment Report Format
No ratings yet
Sample Accomplishment Report Format
13 pages
Community Diagnosis
100% (2)
Community Diagnosis
39 pages
Deep Learning Notes
No ratings yet
Deep Learning Notes
3 pages
10 Multilayer Perceptrons
No ratings yet
10 Multilayer Perceptrons
54 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Back Propagation
No ratings yet
Back Propagation
27 pages
Neural - Networks
No ratings yet
Neural - Networks
47 pages
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
No ratings yet
Pattern Classification 10. Linear Perceptron, Least Squares & Multi-Layer Nns
38 pages
Neural Network
No ratings yet
Neural Network
97 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
ML.8-Neural Networks - Deep Learning (Week 12,13)
No ratings yet
ML.8-Neural Networks - Deep Learning (Week 12,13)
80 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Lecture14 - ML (FF, Autoenc, Dense Networks)
No ratings yet
Lecture14 - ML (FF, Autoenc, Dense Networks)
28 pages
Lect 5
No ratings yet
Lect 5
89 pages
AI ML Nov 15
No ratings yet
AI ML Nov 15
32 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
David Sm13 PPT 04
No ratings yet
David Sm13 PPT 04
34 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Mid 1 DL Notes
No ratings yet
Mid 1 DL Notes
15 pages
Unit 4 ML NN, DL, CNN-1
No ratings yet
Unit 4 ML NN, DL, CNN-1
84 pages
Chapter 2 - 2 Shallow Neural Network 2 - 2
No ratings yet
Chapter 2 - 2 Shallow Neural Network 2 - 2
34 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Neural Networks Essay Feranmi Dere
No ratings yet
Neural Networks Essay Feranmi Dere
7 pages
1.1 Introduction
No ratings yet
1.1 Introduction
73 pages
Neural Network (Basics)
No ratings yet
Neural Network (Basics)
48 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Machine Learning
No ratings yet
Machine Learning
83 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
Unit 1
No ratings yet
Unit 1
72 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
Interview Evaluation Report
No ratings yet
Interview Evaluation Report
6 pages
Kagan Lecture2
No ratings yet
Kagan Lecture2
118 pages
Lecture 09 Slides - After
No ratings yet
Lecture 09 Slides - After
57 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Neural Networks
No ratings yet
Neural Networks
10 pages
NN Concepts
No ratings yet
NN Concepts
4 pages
Ece18898g Neural Networks
No ratings yet
Ece18898g Neural Networks
47 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
AML 03 Dense Neural Networks
No ratings yet
AML 03 Dense Neural Networks
20 pages
Psychosocial Assessment Template Social Work
No ratings yet
Psychosocial Assessment Template Social Work
7 pages
Module1 ECO-598 AI & ML Aug 21
No ratings yet
Module1 ECO-598 AI & ML Aug 21
45 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
26 Neural Nets
No ratings yet
26 Neural Nets
77 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
8 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
Neural Network Presentation
No ratings yet
Neural Network Presentation
33 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
Lecture 10 Neural Network
No ratings yet
Lecture 10 Neural Network
34 pages
MEDNET - Analysis
100% (1)
MEDNET - Analysis
8 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
ChapterII-II It Perception On Shifting Course
No ratings yet
ChapterII-II It Perception On Shifting Course
10 pages
2024S3 Busm2570 Asm 01
0% (1)
2024S3 Busm2570 Asm 01
5 pages
PHD Thesis Fisheries Management
100% (3)
PHD Thesis Fisheries Management
6 pages
SR23409152640
No ratings yet
SR23409152640
11 pages
Mcckinsey 7s Short Summary
No ratings yet
Mcckinsey 7s Short Summary
7 pages
Strategic Planning For The Placement and Protection of Indonesian Labor in Restraining Non Procedural Indonesian Labor: A Study On BNP2TKI
No ratings yet
Strategic Planning For The Placement and Protection of Indonesian Labor in Restraining Non Procedural Indonesian Labor: A Study On BNP2TKI
8 pages
CPG - Cataract Surgery-Jul 1999
No ratings yet
CPG - Cataract Surgery-Jul 1999
19 pages
Group Members: Neelam Abro Hammad Khalique Omar Siqqi Zohair Hassan
No ratings yet
Group Members: Neelam Abro Hammad Khalique Omar Siqqi Zohair Hassan
10 pages
Class: Social Studies Grade: 5 Topic: The Fur Trade Date: March 30, 2016 GLO's
No ratings yet
Class: Social Studies Grade: 5 Topic: The Fur Trade Date: March 30, 2016 GLO's
3 pages
CEC IOBCProc Symp Fruit Flies Imp Rome 1987 2
No ratings yet
CEC IOBCProc Symp Fruit Flies Imp Rome 1987 2
648 pages
Lesson Plan Template: Date Subject Number of Students Grade
No ratings yet
Lesson Plan Template: Date Subject Number of Students Grade
3 pages
Smoothed Bootstrap Nelson-Siegel Revisited June 2010
No ratings yet
Smoothed Bootstrap Nelson-Siegel Revisited June 2010
38 pages
2018-Article Text-6741-2-10-20240229
No ratings yet
2018-Article Text-6741-2-10-20240229
14 pages
Genesis Technologies
No ratings yet
Genesis Technologies
9 pages
1 4 Bago Thesis
No ratings yet
1 4 Bago Thesis
87 pages
A Study For Teaching Advanced Level Physics Practical and Solution Approach To Practical Questions
No ratings yet
A Study For Teaching Advanced Level Physics Practical and Solution Approach To Practical Questions
14 pages
AYBAR Et Al. 2019 - Construction High-Res Gridded Rainfall Dataset For Peru From 1981 To The Present Day
No ratings yet
AYBAR Et Al. 2019 - Construction High-Res Gridded Rainfall Dataset For Peru From 1981 To The Present Day
17 pages
Faktor-Faktor Yang Mempengaruhi Petani Menyadap Pinus
No ratings yet
Faktor-Faktor Yang Mempengaruhi Petani Menyadap Pinus
8 pages
Culture and Organizations - SpringerLink
No ratings yet
Culture and Organizations - SpringerLink
16 pages
Research Into Low Pass Rate at ST Columbus
No ratings yet
Research Into Low Pass Rate at ST Columbus
14 pages
TensorFlow in 1 Day: Make your own Neural Network
From Everand
TensorFlow in 1 Day: Make your own Neural Network
Krishna Rungta
3.5/5 (10)
Introduction to Minimax
From Everand
Introduction to Minimax
V. F. Dem’yanov
No ratings yet
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
From Everand
Convolutional Neural Networks: Fundamentals and Applications for Analyzing Visual Imagery
Fouad Sabry
No ratings yet
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
From Everand
Multilayer Perceptron: Fundamentals and Applications for Decoding Neural Networks
Fouad Sabry
No ratings yet

Lecture8 DeepLearning

Uploaded by

Lecture8 DeepLearning

Uploaded by

DEPARTMENT OF APPLIED MATHEMATICS, COMPUTER SCIENCE AND STATISTICS

Yann LeCun Geoffrey Hinton Yoshua Bengio

“for conceptual and engineering breakthroughs that have

̶ Rely on expert-driven feature engineering

̶ "A model that can approximate

First proposed in 1943 (McCulloch and Pitts)

̶ Separating line is given by

̶ This can be rewritten as:

Identity Sigmoid (Logistic) Hyperbolic tangent (tanh)

Introduced in 1958 by Frank Rosenblatt

̶ How to compute 𝑤𝑖 (𝑡) and (𝑡) ?

The Navy revealed the

Input layer Output layer

Input layer Hidden layers Output layer

Daniel Peralta <[email protected]> 23

̶ Examples of cost function for regression:

̶ Softmax activation function for the output layer:

̶ Cross entropy loss:

Linear regression Convex Analytical solution

Logistic regression Convex Newton-Raphson method

Neural network Non-convex Gradient descent

Gradient descent: Adapt weights by taking a step in the

̶ Backpropagation allows us to compute the gradient for

Cost function 𝐽(𝜃)

̶ Compute the gradient descent for the weights in this layer:

Step size of the Output of hidden node ℎ𝑗𝐿−1

Difference between expected

𝑧𝑗𝐿−1 = 𝐰𝑗𝐿−1 𝐡𝐿−2 = ෍ 𝑤𝑗𝑖𝐿−1 ℎ𝑖𝐿−2

̶ Compute the gradient descent for the weights in this layer:

𝑧𝑗𝐿−1 = 𝐰𝑗𝐿−1 𝐡𝐿−2 = ෍ 𝑤𝑗𝑖𝐿−1 ℎ𝑖𝐿−2

̶ Compute the gradient descent for the weights in this layer:

Discovered by Sepp Hochreiter in 1991

̶ How many parameters do we have in a neural net?

Imagine a network with:

̶ Total parameters counting biases:

Daniel Peralta <[email protected]> 44

Daniel Peralta <[email protected]> 45

Daniel Peralta <[email protected]> 47

Daniel Peralta <[email protected]> 48

Daniel Peralta <[email protected]> 49

Daniel Peralta <[email protected]> 50

Daniel Peralta <[email protected]> 51

̶ Batch normalization layers normalize the activations of

Proposed by Ioffe and Szegedy in 2015

Daniel Peralta <[email protected]> 52

Proposed by Klambauer et al. in 2017

1. Controlling the mean 2. Dampen the variance if it

3. Increase the variance if it 4. Single point where variance

Region with slope

̶ This allows to reduce the overfitting, as the output will

̶ However, there are known facts about images:

Input: 28x28 inputs (e.g. image pixels)

Convolution: 5x5 mask

Original image Mean kernel (9x9)

Median kernel (9x9) Sobel filter [-1 0 1]

wl,m : connection weights, ai,j=inputs

Proposed in 1989 by Yann LeCun

̶ Each module transforms its input representation into a

1. Take an unlabeled instance 𝒙

̶ A small perturbation should not be able to jump

Zolfi, A., Avidan, S., Elovici, Y., & Shabtai, A. (2021).

You might also like