Deep Learning PDF
Deep Learning PDF
MIT 6.S191
Alexander Amini
January 28, 2019
ARTIFICIAL
INTELLIGENCE
MACHINE LEARNING
Any technique that enables
Ability to learn without DEEP LEARNING
computers to mimic
human behavior explicitly being programmed Extract patterns from data using
neural networks
Lines & Edges Eyes & Nose & Ears Facial Structure
1958
Perceptron 1. Big Data 2. Hardware 3. Software
• Learnable Weights
• Larger Datasets • Graphics • Improved
• Easier Collection Processing Units Techniques
& Storage (GPUs) • New Models
1986 Backpropagation • Massively • Toolboxes
• Multi-Layer Perceptron Parallelizable
1995 Deep Convolutional NN
• Digit Recognition
Linear combination
Output of inputs
&" !" #
(' = * + &, !,
!$
Σ
,-"
&$ ('
!# Non-linear
activation function
&#
Linear combination
Output of inputs
1
!%
#
Σ
./"
!$ )(
&$ Non-linear Bias
!# activation function
&#
1
!% #
)( = + !%+ - &. !.
&" !" ./"
!$ Σ )( )( = + ! % + 1 2 3
&$ &" !"
!#
where: 1 = ⋮ and 3 = ⋮
&# !#
&#
Activation Functions
1
!%
*) = , !% + . / 0
&" !"
&$ 1
!# , 1 = 2 1 =
1 + 3 45
&#
3
We have: + , = 1 and . =
1 −2
1
*) = / + , + 1 2 .
3
&' Σ *) = / 1+ &
&' 2 3
−2 ( −2
*) = / ( 1 + 3 & ' − 2 & ( )
&(
This is just a line in 2D!
0
1 1
=
(
2&
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
0
1 1
=
−1
(
2&
2
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
−1
Assume we have input: 0 =
2
*) = , 1 + 3 ∗ −1 − 2 ∗ 2
= , −6 ≈ 0.002
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(
0
1 1
=
1 < 0
(
2&
* < 0.5
−
3
&' Σ *)
'
3&
1+
−2
&'
&(
1 > 0
* > 0.5
1
!%
&" !"
!$ Σ *)
&$
!#
&#
!$
&=( %
!" %
!#
#
% = )* + , !- )-
-.$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Multi Output Perceptron
!$
&$ = ( %$
%$
!"
&" = ( %"
%"
!#
#
%) = *+,) + . !/ */,)
/0$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Single Layer Neural Network
($) (")
7 7
6 %$
%$
!$
6 %"
%" ('$
!"
%& ('"
6 %&
!#
%)*
6 %)*
%$
($)
!$ 5$,"
($)
5"," %" ('$
!"
($) %& ('"
5#,"
!#
%)*
#
($) ($)
%" = ,-," +2 !3 ,3,"
34$
($) ($) ($) ($)
= ,-," + !$ ,$," + !" ,"," + !# ,#,"
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Multi Output Perceptron
from tf.keras.layers import *
inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
%$ model = Model(inputs, outputs)
!$
%" ('$
!"
%& ('"
!#
%)*
%&,$
!$
%&," *)$
!" ⋯ ⋯
%&,( *)"
!#
%&,+,
! " = Hours
spent on the
final project
Legend
Pass
Fail
! " = Hours
spent on the
final project
Legend
? Pass
4
5 Fail
$#
!#
#
! = 4 ,5 $" '&# Predicted: 0.1
!"
$%
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
The loss of our network measures the cost incurred from incorrect predictions
$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%
The empirical loss measures the total loss over our entire dataset
+(!) '
$#
4, 5 !# 0.1 1
)= 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
Also known as: . / = 2 ℒ + ! (3) ; / , ' (3)
• Objective function
• Cost function
1 34#
• Empirical Risk Predicted Actual
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Binary Cross Entropy Loss
Cross entropy loss can be used with models that output a probability between 0 and 1
+(!) '
$#
4, 5 !# 0.1 1
) = 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%
1 5
. / = 2 ' (3) log + ! 3 ; / + (1 − ' (3) ) log 1 − + ! 3 ; /
1 34#
Actual Predicted Actual Predicted
Mean squared error loss can be used with regression models that output continuous real numbers
+(!) '
$#
4, 5 !# 30 90
) = 2, 1 80 20
$" '&#
5, 8 85 95
⋮ ⋮ !"
⋮ ⋮
$%
1 5 "
. / = 2 ' 3 3
− + ! ;/ Final Grades
1 34# (percentage)
Actual Predicted
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
We want to find the network weights that achieve the lowest loss
1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./
!∗ = argmin 9(!)
!
Remember:
! = !(:) , !(/) , ⋯
*(-., -0)
-0
-.
Loss Optimization
Randomly pick an initial ("#, "%)
'("#, "%)
"%
"#
Loss Optimization
!"($)
Compute gradient, !$
&('(, '*)
'*
'(
Loss Optimization
Take small step in opposite direction of gradient
!(#$, #&)
#&
#$
Gradient Descent
Repeat until convergence
!(#$, #&)
#&
#$
Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)
4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights
!) !"
' () +* #(%)
How does a small change in one weight (ex. !") affect the final loss #(%)?
&+ &'
) *+ -, "($)
!"($)
=
!&'
&. &'
, -. *) "($)
&' &.
, -' *) "($)
&' &.
- ,' *) "($)
&' &.
- ,' *) "($)
Repeat this for every weight in the network using gradients from later layers
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Neural Networks in Practice:
Optimization
Training Neural Networks is Difficult
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
Remember:
Optimization through gradient descent
%& (!)
!←!−$
%!
Small learning rate converges slowly and gets stuck in false local minima
"(!)
Initial guess
!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Setting the Learning Rate
"(!)
Initial guess
!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Setting the Learning Rate
"($)
Initial guess
!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
How to deal with this?
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 1:
Try lots of different learning rates and see what works “just right”
Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape
• Adam tf.train.AdamOptimizer
Kingma et al. “Adam: A Method for Stochastic
Optimization.” 2014.
• RMSProp tf.train.RMSPropOptimizer
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Can be very
computational to
compute!
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Easy to compute but
very noisy
(stochastic)!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Stochastic Gradient Descent
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Fast to compute and a much better
estimate of the true gradient!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Mini-batches while training
What is it?
Technique that constrains our optimization problem to discourage complex models
What is it?
Technique that constrains our optimization problem to discourage complex models
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)
Loss
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Loss Testing
Training
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Legend
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit
Under-fitting Over-fitting
Legend
Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Core Foundation Review
(),%
"% "%
(),# '&%
"# Σ '& "# ⋯ ⋯
(),+ '&#
"$
"$
(),,-
Lincoln 0.8
Washington 0.1
classification
Jefferson 0.05
Obama 0.05
Problems?
How can we use spatial structure in the input to inform the architecture of the network?
filters
= 9
filter
image
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs…
' '
4x4 filter: matrix 1) applying a window of weights
$ $ !"# (")*,#), + .
of weights !"# 2) computing linear combinations
"%& #%&
3) activating with non-linear function
for neuron (p,q) in hidden layer
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
CNNs: Spatial Arrangement of Output Volume
depth
Layer Dimensions:
ℎ"#"$
where h and w are spatial dimensions
d (depth) = number of filters
height
Stride:
Filter step size
Receptive Field:
Locations in input image that
width a node is path connected to
1) Reduced dimensionality
2) Spatial invariance
Classification task: produce a list of object categories present in image. 1000 categories.
“Top 5 error”: rate at which the model does not output correct label in top 5 predictions
Other tasks include:
single-object localization, object detection from video/image, scene classification, scene parsing
2013: ZFNet
20
- 8 layers, more filters
16.4 2014:VGG
- 19 layers
11.7 2014: GoogLeNet
10
6.7 - “Inception” modules
5.1 - 22 layers, 5million parameters
3.57
2015: ResNet
0 - 152 layers
10
11
12
13
14
15
an
20
20
20
20
20
20
um
H
number of layers
20
100
16.4
11.7
10 50
7.3 6.7
5.1
3.57
0 0
10
11
12
13
14
14
15
an
10
11
12
13
14
14
15
um
20
20
20
20
20
20
20
20
20
20
20
20
20
20
H
Automobile
Bird
Cat
Deer
Dog
MNIST: handwritten digits
Frog
Horse
Ship
ImageNet:
22K categories. 14M images. Truck
places: natural scenes
CIFAR-10
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Deep Learning for Computer Vision: Impact
vs
How can we use latent distributions to create fair and representative datasets?
Can we learn the true explanatory factors, e.g. latent variables, from only observed data?
! "
“Encoder” learns mapping from the data, !, to a low-dimensional latent space, "
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data
! " !#
! " !#
! " !#
! " !#
$
! " !#
%
$
! " !#
%
standard deviation
vector
$
! " !#
%
$
! " !#
%
$
! " !#
%
e.g. ! − !# 5
$
! " !#
%
$
! " !#
%
" ' = ) * = 0, - . = 1
" ' = ) * = 0, - . = 1
$
! " !#
%
$
! " !#
%
⇒ ! = " + #⨀.
where .~%(0,1)
!
Backprop
Deterministic node
" ∼ $% (z|)) "
Stochastic node
+ )
Original form
Backprop
! !
Deterministic node 0!
" ∼ $% (z|)) " 0"
" " = ,(-, ), /)
0!
0-
Stochastic node
- ) - ) / ~2(0,1)
Head pose
Disentanglement
Head pose
6.S191 Introduction to Deep Learning [8] 1/29/19
introtodeeplearning.com
VAEs: Latent perturbation
Google BeatBlender
! " !#
! " !#
! " !#
! " !#
! " !#
#$%&'
noise ! " “fake” sample from the
training distribution
$%&'(
data from fakes created by the generator.
$)'*&
noise ! "
Generator
Fake data
Discriminator Generator
Fake data
Discriminator Generator
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
Discriminator Generator
! "#$% = 1
min max )*~,(-.- log 2$( 3 + )5~,(5) log 1 − 2$( :$% (;)
$% $(
$%&'(
# +
$)'*&
noise
! "
Project Proposals!
Jan Kautz,
VP of Research Judging and Awards!
Learning and Perception
Pizza Celebration!
Data
• Signals
• Images
• Sensors
…
Data Decision
• Signals • Prediction
• Images • Detection
• Sensors • Action
… …
Data Decision
• Signals • Prediction
• Images • Detection
• Sensors • Action
… …
Caveats:
100%
accuracy
0%
original randomization completely
labels random
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Capacity of Deep Neural Networks
100%
accuracy
0%
original randomization completely
labels random
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Capacity of Deep Neural Networks
Modern deep networks can
perfectly fit to random data
100%
accuracy
0%
original randomization completely
labels random
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators
Remember:
We train our networks with gradient descent
%&(!, ), *)
! ←!−$
%!
“How does a small change in weights decrease our loss”
Remember:
We train our networks with gradient descent
%&(!, ), *)
! ←!−$
%!
“How does a small change in weights decrease our loss”
Remember:
We train our networks with gradient descent
%&(!, ), *)
! ←!−$ Fix your image ),
%! and true label *
Adversarial Image:
Modify image to increase error
%&((, !, *)
! ←!+$
%!
“How does a small change in the input increase our loss”
Adversarial Image:
Modify image to increase error
%&((, !, *)
! ←!+$
%!
“How does a small change in the input increase our loss”
Adversarial Image:
Modify image to increase error
%&((, !, *)
! ←!+$ Fix your weights (,
%! and true label *
ℙ(cat)
OR
ℙ(dog)
ℙ cat = 0.2
ℙ dog = 0.8
Sampled network
Training Data Prediction
from RNN
Google Cloud.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AutoML Spawns a Powerful Idea
• Design an AI algorithm that can build new models
capable of solving a task
• Reduces the need for experienced engineers to
design the networks
• Makes deep learning more accessible to the public
Connection to
Artificial General Intelligence:
the ability to intelligently
reason about how we learn