0% found this document useful (0 votes)
36 views289 pages

Deep Learning PDF

The document provides an introduction to deep learning, detailing its definition, importance, and the resurgence of neural networks due to advancements in data, hardware, and software. It explains the structure and function of perceptrons, activation functions, and how neural networks can be built and applied to problems. Additionally, it covers loss quantification and optimization in training neural networks.

Uploaded by

Nirranjan J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views289 pages

Deep Learning PDF

The document provides an introduction to deep learning, detailing its definition, importance, and the resurgence of neural networks due to advancements in data, hardware, and software. It explains the structure and function of perceptrons, activation functions, and how neural networks can be built and applied to problems. Additionally, it covers loss quantification and optimization in training neural networks.

Uploaded by

Nirranjan J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 289

Introduction to Deep Learning

MIT 6.S191
Alexander Amini
January 28, 2019

Follow me of LinkedIn for more:


Steve Nouri
https://www.linkedin.com/in/stevenouri/
The Rise of Deep Learning

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
What is Deep Learning?

ARTIFICIAL
INTELLIGENCE
MACHINE LEARNING
Any technique that enables
Ability to learn without DEEP LEARNING
computers to mimic
human behavior explicitly being programmed Extract patterns from data using
neural networks

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Why Deep Learning and Why Now?
Why Deep Learning?
Hand engineered features are time consuming, brittle and not scalable in practice
Can we learn the underlying features directly from data?

Low Level Features Mid Level Features High Level Features

Lines & Edges Eyes & Nose & Ears Facial Structure

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Why Now?
Neural Networks date back decades, so why the resurgence?
Stochastic Gradient
1952
Descent

1958
Perceptron 1. Big Data 2. Hardware 3. Software
• Learnable Weights
• Larger Datasets • Graphics • Improved
• Easier Collection Processing Units Techniques
& Storage (GPUs) • New Models
1986 Backpropagation • Massively • Toolboxes
• Multi-Layer Perceptron Parallelizable
1995 Deep Convolutional NN
• Digit Recognition

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron
The structural building block of deep learning
The Perceptron: Forward Propagation

Linear combination
Output of inputs

&" !" #

(' = * + &, !,
!$
Σ
,-"
&$ ('
!# Non-linear
activation function
&#

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Forward Propagation

Linear combination
Output of inputs
1
!%
#

&" !" )( = + !%+ - &. !.

Σ
./"
!$ )(
&$ Non-linear Bias
!# activation function

&#

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Forward Propagation

1
!% #

)( = + !%+ - &. !.
&" !" ./"

!$ Σ )( )( = + ! % + 1 2 3
&$ &" !"
!#
where: 1 = ⋮ and 3 = ⋮
&# !#
&#

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Forward Propagation

Activation Functions
1
!%
*) = , !% + . / 0
&" !"

!$ Σ *) • Example: sigmoid function

&$ 1
!# , 1 = 2 1 =
1 + 3 45

&#

Inputs Weights Sum Non-Linearity Output


1
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Common Activation Functions
Sigmoid Function Hyperbolic Tangent Rectified Linear Unit (ReLU)

1 & ( − & '(


! " = ! " = ( ! " = max ( 0 , " )
1 + & '( & + & '(
1, " > 0
! ′ " = !(") 1 − !(") ! ′ " = 1 − !(")- !′(") = 3
0, otherwise

tf.nn.sigmoid(z) tf.nn.tanh(z) tf.nn.relu(z)

NOTE: All activation functions are non-linear


6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network

What if we wanted to build a Neural Network to


distinguish green vs red points?

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network

Linear Activation functions produce linear


decisions no matter the network size

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Importance of Activation Functions
The purpose of activation functions is to introduce non-linearities into the network

Linear Activation functions produce linear Non-linearities allow us to approximate


decisions no matter the network size arbitrarily complex functions

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Example

3
We have: + , = 1 and . =
1 −2
1
*) = / + , + 1 2 .
3
&' Σ *) = / 1+ &
&' 2 3
−2 ( −2
*) = / ( 1 + 3 & ' − 2 & ( )
&(
This is just a line in 2D!

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(

0
1 1

=
(
2&

3
&' Σ *)

'
3&
1+
−2
&'
&(

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(

0
1 1

=
−1

(
2&
2


3
&' Σ *)

'
3&
1+
−2
&'
&(

−1
Assume we have input: 0 =
2
*) = , 1 + 3 ∗ −1 − 2 ∗ 2
= , −6 ≈ 0.002
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
The Perceptron: Example
*) = , ( 1 + 3 & ' − 2 & ( )
&(

0
1 1

=
1 < 0

(
2&
* < 0.5


3
&' Σ *)

'
3&
1+
−2
&'
&(

1 > 0
* > 0.5

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Building Neural Networks with Perceptrons
The Perceptron: Simplified

1
!%

&" !"

!$ Σ *)
&$
!#

&#

Inputs Weights Sum Non-Linearity Output

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
The Perceptron: Simplified

!$

&=( %
!" %

!#

#
% = )* + , !- )-
-.$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Multi Output Perceptron

!$
&$ = ( %$
%$
!"
&" = ( %"
%"
!#

#
%) = *+,) + . !/ */,)
/0$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Single Layer Neural Network

($) (")
7 7
6 %$
%$
!$
6 %"
%" ('$
!"
%& ('"
6 %&
!#
%)*
6 %)*

Inputs Hidden Final Output


# )*
($) ($) (") (")
%+ = -.,+ +3 !4 -4,+ ('+ = 6 -.,+ +3 %4 -4,+
45$ 45$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Single Layer Neural Network

%$
($)
!$ 5$,"

($)
5"," %" ('$
!"
($) %& ('"
5#,"
!#
%)*
#
($) ($)
%" = ,-," +2 !3 ,3,"
34$
($) ($) ($) ($)
= ,-," + !$ ,$," + !" ,"," + !# ,#,"
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Multi Output Perceptron
from tf.keras.layers import *

inputs = Inputs(m)
hidden = Dense(d1)(inputs)
outputs = Dense(2)(hidden)
%$ model = Model(inputs, outputs)

!$
%" ('$
!"
%& ('"
!#
%)*

Inputs Hidden Output

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Deep Neural Network

%&,$
!$
%&," *)$
!" ⋯ ⋯
%&,( *)"
!#
%&,+,

Inputs Hidden Output


+,78
(&) (&)
%&,- = /0,- +4 9(%&:$,5 ) /5,-
56$
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Applying Neural Networks
Example Problem

Will I pass this class?

Let’s start with a simple two feature model

! " = Number of lectures you attend


! $ = Hours spent on the final project

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Example Problem: Will I pass this class?

! " = Hours
spent on the
final project

Legend

Pass
Fail

! $ = Number of lectures you attend


6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Example Problem: Will I pass this class?

! " = Hours
spent on the
final project

Legend
? Pass
4
5 Fail

! $ = Number of lectures you attend


6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Example Problem: Will I pass this class?

$#
!#
#
! = 4 ,5 $" '&# Predicted: 0.1
!"
$%

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Example Problem: Will I pass this class?

$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Quantifying Loss

The loss of our network measures the cost incurred from incorrect predictions

$#
!#
# Predicted: 0.1
! = 4 ,5 $" '&#
Actual: 1
!"
$%

ℒ , ! (.) ; 1 , ' (.)


Predicted Actual
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Empirical Loss

The empirical loss measures the total loss over our entire dataset

+(!) '
$#
4, 5 !# 0.1 1
)= 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%

1 5
Also known as: . / = 2 ℒ + ! (3) ; / , ' (3)
• Objective function
• Cost function
1 34#
• Empirical Risk Predicted Actual
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Binary Cross Entropy Loss

Cross entropy loss can be used with models that output a probability between 0 and 1
+(!) '
$#
4, 5 !# 0.1 1
) = 2, 1 0.8 0
$" '&#
5, 8 0.6 1
⋮ ⋮ !"
⋮ ⋮
$%

1 5
. / = 2 ' (3) log + ! 3 ; / + (1 − ' (3) ) log 1 − + ! 3 ; /
1 34#
Actual Predicted Actual Predicted

loss = tf.reduce_mean( tf.nn.softmax_cross_entropy_with_logits(model.y, model.pred) )


Mean Squared Error Loss

Mean squared error loss can be used with regression models that output continuous real numbers
+(!) '
$#
4, 5 !# 30 90
) = 2, 1 80 20
$" '&#
5, 8 85 95
⋮ ⋮ !"
⋮ ⋮
$%

1 5 "
. / = 2 ' 3 3
− + ! ;/ Final Grades
1 34# (percentage)
Actual Predicted

loss = tf.reduce_mean( tf.square(tf.subtract(model.y, model.pred) )


Training Neural Networks
Loss Optimization

We want to find the network weights that achieve the lowest loss

1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./

!∗ = argmin 9(!)
!

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Loss Optimization

We want to find the network weights that achieve the lowest loss

1 0
!∗ = argmin , ℒ 2 3 (-) ; ! , 8 (-)
! + -./

!∗ = argmin 9(!)
!

Remember:
! = !(:) , !(/) , ⋯

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Loss Optimization
!∗ = argmin *(!)
! Remember:
Our loss is a function of
the network weights!

*(-., -0)

-0
-.
Loss Optimization
Randomly pick an initial ("#, "%)

'("#, "%)

"%
"#
Loss Optimization
!"($)
Compute gradient, !$

&('(, '*)

'*
'(
Loss Optimization
Take small step in opposite direction of gradient

!(#$, #&)

#&
#$
Gradient Descent
Repeat until convergence

!(#$, #&)

#&
#$
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)

2. Loop until convergence:


3. )*(+)
Compute gradient, )+ grads = tf.gradients(ys=loss, xs=weights)

4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ') weights = tf.random_normal(shape, stddev=sigma)

2. Loop until convergence:


3. )*(+)
Compute gradient, )+ grads = tf.gradients(ys=loss, xs=weights)

4. )*(+)
Update weights, + ← + − . weights_new = weights.assign(weights – lr * grads)
)+
5. Return weights

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Computing Gradients: Backpropagation

!) !"
' () +* #(%)

How does a small change in one weight (ex. !") affect the final loss #(%)?

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Computing Gradients: Backpropagation

&+ &'
) *+ -, "($)

!"($)
=
!&'

Let’s use the chain rule!

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Computing Gradients: Backpropagation

&. &'
, -. *) "($)

!"($) !"($) !*)


= ∗
!&' !*) !&'

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Computing Gradients: Backpropagation

&' &.
, -' *) "($)

!"($) !"($) !*)


= ∗
!&' !*) !&'

Apply chain rule! Apply chain rule!

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Computing Gradients: Backpropagation

&' &.
- ,' *) "($)

!"($) !"($) !*) !,'


= ∗ ∗
!&' !*) !,' !&'

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Computing Gradients: Backpropagation

&' &.
- ,' *) "($)

!"($) !"($) !*) !,'


= ∗ ∗
!&' !*) !,' !&'

Repeat this for every weight in the network using gradients from later layers
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Neural Networks in Practice:
Optimization
Training Neural Networks is Difficult

“Visualizing the loss landscape


of neural nets”. Dec 2017.
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

%& (!)
!←!−$
%!

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Loss Functions Can Be Difficult to Optimize

Remember:
Optimization through gradient descent

%& (!)
!←!−$
%!

How can we set the


learning rate?

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Setting the Learning Rate

Small learning rate converges slowly and gets stuck in false local minima

"(!)

Initial guess

!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Setting the Learning Rate

Large learning rates overshoot, become unstable and diverge

"(!)

Initial guess

!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Setting the Learning Rate

Stable learning rates converge smoothly and avoid local minima

"($)

Initial guess

!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
How to deal with this?

Idea 1:
Try lots of different learning rates and see what works “just right”

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
How to deal with this?

Idea 1:
Try lots of different learning rates and see what works “just right”

Idea 2:
Do something smarter!
Design an adaptive learning rate that “adapts” to the landscape

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Adaptive Learning Rates

• Learning rates are no longer fixed


• Can be made larger or smaller depending on:
• how large gradient is
• how fast learning is happening
• size of particular weights
• etc...

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Adaptive Learning Rate Algorithms
Qian et al. “On the momentum term in gradient
• Momentum tf.train.MomentumOptimizer descent learning algorithms.” 1999.

Duchi et al. “Adaptive Subgradient Methods for Online


• Adagrad tf.train.AdagradOptimizer Learning and Stochastic Optimization.” 2011.

Zeiler et al. “ADADELTA: An Adaptive Learning Rate


• Adadelta tf.train.AdadeltaOptimizer
Method.” 2012.

• Adam tf.train.AdamOptimizer
Kingma et al. “Adam: A Method for Stochastic
Optimization.” 2014.

• RMSProp tf.train.RMSPropOptimizer

Additional details: http://ruder.io/optimizing-gradient-descent/


6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Neural Networks in Practice:
Mini-batches
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. )*(+)
Compute gradient, )+
4. )*(+)
Update weights, + ← + − .
)+
5. Return weights
Can be very
computational to
compute!

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Stochastic Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Stochastic Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick single data point )
4. *+, (-)
Compute gradient, *-
5. *+(-)
Update weights, - ← - − 0 *-
6. Return weights
Easy to compute but
very noisy
(stochastic)!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Stochastic Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Stochastic Gradient Descent

Algorithm
1. Initialize weights randomly ~"(0, & ')
2. Loop until convergence:
3. Pick batch of ) data points
4. *+(,) . *+3 (,)
Compute gradient, *,
= / ∑/
12. *,
5. *+(,)
Update weights, , ← , − 6 *,
6. Return weights
Fast to compute and a much better
estimate of the true gradient!
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Mini-batches while training

More accurate estimation of gradient


Smoother convergence
Allows for larger learning rates

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Mini-batches while training

More accurate estimation of gradient


Smoother convergence
Allows for larger learning rates

Mini-batches lead to fast training!


Can parallelize computation + achieve significant speed increases on GPU’s

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Neural Networks in Practice:
Overfitting
The Problem of Overfitting

Underfitting Ideal fit Overfitting


Model does not have capacity Too complex, extra parameters,
to fully learn the data does not generalize well

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Regularization

What is it?
Technique that constrains our optimization problem to discourage complex models

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Regularization

What is it?
Technique that constrains our optimization problem to discourage complex models

Why do we need it?


Improve generalization of our model on unseen data

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Regularization 1: Dropout
• During training, randomly set some activations to 0

'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node

'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Regularization 1: Dropout
• During training, randomly set some activations to 0
• Typically ‘drop’ 50% of activations in layer tf.keras.layers.Dropout(p=0.5)
• Forces network to not rely on any 1 node

'$,$ '",$
!$
'$," '"," &%$
!"
'$,# '",# &%"
!#
'$,) '",)

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Loss

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Legend

Loss Testing

Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Legend

Loss Testing

Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Legend

Loss Testing

Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Legend

Loss Testing

Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Legend

Loss Testing

Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Legend

Loss Stop training Testing


here!
Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Regularization 2: Early Stopping
• Stop training before we have a chance to overfit

Under-fitting Over-fitting

Legend

Loss Stop training Testing


here!
Training

Training Iterations
6.S191 Introduction to Deep Learning
1/28/19
introtodeeplearning.com
Core Foundation Review

The Perceptron Neural Networks Training in Practice

• Structural building blocks • Stacking Perceptrons to • Adaptive learning


• Nonlinear activation form neural networks • Batching
functions • Optimization through • Regularization
backpropagation

(),%
"% "%
(),# '&%
"# Σ '& "# ⋯ ⋯
(),+ '&#
"$
"$
(),,-

6.S191 Introduction to Deep Learning


1/28/19
introtodeeplearning.com
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
What Computers “See”
Images are Numbers

6.S191 Introduction to Deep Learning


[1] 1/29/19
introtodeeplearning.com
Images are Numbers

6.S191 Introduction to Deep Learning


[1] 1/29/19
introtodeeplearning.com
Images are Numbers
What the computer sees

An image is just a matrix of numbers [0,255]!


i.e., 1080x1080x3 for an RGB image

6.S191 Introduction to Deep Learning


[1] 1/29/19
introtodeeplearning.com
Tasks in Computer Vision

Lincoln 0.8

Washington 0.1
classification
Jefferson 0.05

Obama 0.05

Input Image Pixel Representation

- Regression: output variable takes continuous value


- Classification: output variable takes class label. Can produce probability of belonging to a particular class

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
High Level Feature Detection
Let’s identify key features in each image category

Nose, Wheels, Door,


Eyes, License Plate, Windows,
Mouth Headlights Steps

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Manual Feature Extraction
Detect features
Domain knowledge Define features
to classify

Problems?

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Manual Feature Extraction
Detect features
Domain knowledge Define features
to classify

6.S191 Introduction to Deep Learning


[2] 1/29/19
introtodeeplearning.com
Manual Feature Extraction
Detect features
Domain knowledge Define features
to classify

6.S191 Introduction to Deep Learning


[2] 1/29/19
introtodeeplearning.com
Learning Feature Representations
Can we learn a hierarchy of features directly from the data
instead of hand engineering?
Low level features Mid level features High level features

Edges, dark spots Eyes, ears, nose Facial structure

6.S191 Introduction to Deep Learning


[3] 1/29/19
introtodeeplearning.com
Learning Visual Features
Fully Connected Neural Network

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Fully Connected Neural Network

Input: Fully Connected:


• 2D image • Connect neuron in hidden
• Vector of pixel values layer to all neurons in input
layer
• No spatial information!
• And many, many parameters!

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Fully Connected Neural Network

Input: Fully Connected:


• 2D image • Connect neuron in hidden
• Vector of pixel values layer to all neurons in input
layer
• No spatial information!
• And many, many parameters!

How can we use spatial structure in the input to inform the architecture of the network?

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Using Spatial Structure

Input: 2D image. Idea: connect patches of input


Array of pixel values to neurons in hidden layer.
Neuron connected to region of
input. Only “sees” these values.

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Using Spatial Structure

Connect patch in input layer to a single neuron in subsequent layer.


Use a sliding window to define connections.
How can we weight the patch to detect particular features?

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Applying Filters to Extract Features

1) Apply a set of weights – a filter – to extract local features

2) Use multiple filters to extract different features

3) Spatially share parameters of each filter


(features that matter in one part of the input should matter elsewhere)

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Feature Extraction with Convolution
- Filter of size 4x4 : 16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch

This “patchy” operation is convolution

1) Apply a set of weights – a filter – to extract local features

2) Use multiple filters to extract different features

3) Spatially share parameters of each filter

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Feature Extraction and Convolution
A Case Study
X or X?

Image is represented as matrix of pixel values… and computers are literal!


We want to be able to classify an X as an X even if it’s shifted, shrunk, rotated, deformed.

6.S191 Introduction to Deep Learning


[4] 1/29/19
introtodeeplearning.com
Features of X

6.S191 Introduction to Deep Learning


[4] 1/29/19
introtodeeplearning.com
Filters to Detect X Features

filters

6.S191 Introduction to Deep Learning


[4] 1/29/19
introtodeeplearning.com
The Convolution Operation
1 X 1 =1
element wise
add outputs
multiply

= 9

6.S191 Introduction to Deep Learning


[4] 1/29/19
introtodeeplearning.com
The Convolution Operation
Suppose we want to compute the convolution of a 5x5 image and a 3x3 filter:

filter

image
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs…

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
The Convolution Operation
We slide the 3x3 filter over the input image, element-wise multiply, and add the outputs:

filter feature map

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
Producing Feature Maps

Original Sharpen Edge Detect “Strong” Edge


Detect

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Feature Extraction with Convolution

1) Apply a set of weights – a filter – to extract local features


2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Convolutional Neural Networks (CNNs)
CNNs for Classification

1. Convolution: Apply filters with learned weights to generate feature maps.


2. Non-linearity: Often ReLU.
3. Pooling: Downsampling operation on each feature map.
Train model with image data.
Learn weights of filters in convolutional layers.
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Convolutional Layers: Local Connectivity

For a neuron in hidden layer:


- Take inputs from patch
- Compute weighted sum
- Apply bias

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Convolutional Layers: Local Connectivity

For a neuron in hidden layer:


- Take inputs from patch
- Compute weighted sum
- Apply bias

' '
4x4 filter: matrix 1) applying a window of weights
$ $ !"# (")*,#), + .
of weights !"# 2) computing linear combinations
"%& #%&
3) activating with non-linear function
for neuron (p,q) in hidden layer
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
CNNs: Spatial Arrangement of Output Volume
depth
Layer Dimensions:
ℎ"#"$
where h and w are spatial dimensions
d (depth) = number of filters
height
Stride:
Filter step size

Receptive Field:
Locations in input image that
width a node is path connected to

6.S191 Introduction to Deep Learning


[3] 1/29/19
introtodeeplearning.com
Introducing Non-Linearity
- Apply after every convolution operation (i.e., after Rectified Linear Unit (ReLU)
convolutional layers)
- ReLU: pixel-by-pixel operation that replaces all negative
values by zero. Non-linear operation!

! " = max ( 0 , " )

6.S191 Introduction to Deep Learning


[5] 1/29/19
introtodeeplearning.com
Pooling

1) Reduced dimensionality
2) Spatial invariance

How else can we downsample and preserve spatial invariance?


6.S191 Introduction to Deep Learning
[3] 1/29/19
introtodeeplearning.com
Representation Learning in Deep CNNs

Low level features Mid level features High level features

Edges, dark spots Eyes, ears, nose Facial structure


Conv Layer 1 Conv Layer 2 Conv Layer 3

6.S191 Introduction to Deep Learning


[3] 1/29/19
introtodeeplearning.com
CNNs for Classification: Feature Learning

1. Learn features in input image through convolution


2. Introduce non-linearity through activation function (real-world data is non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
CNNs for Classification: Class Probabilities

- CONV and POOL layers output high-level features of input + ,-


- Fully connected layer uses these features for classifying input image softmax () =
∑/ + ,0
- Express output as probability of image belonging to a particular class

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
CNNs: Training with Backpropagation

Learn weights for convolutional filters and fully connected layers


Backpropagation: cross-entropy loss
! " = $ & (%) log -
,(%)
%
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
CNNs for Classification: ImageNet
ImageNet Dataset
Dataset of over 14 million images across 21,841 categories
“Elongated crescent-shaped yellow fruit with soft sweet flesh”

1409 pictures of bananas.


6.S191 Introduction to Deep Learning
[6,7] 1/29/19
introtodeeplearning.com
ImageNet Challenge

Classification task: produce a list of object categories present in image. 1000 categories.
“Top 5 error”: rate at which the model does not output correct label in top 5 predictions
Other tasks include:
single-object localization, object detection from video/image, scene classification, scene parsing

6.S191 Introduction to Deep Learning


[6,7] 1/29/19
introtodeeplearning.com
ImageNet Challenge: Classification Task
30 28.2 2012: AlexNet. First CNN to win.
25.8 - 8 layers, 61 million parameters
classification error %

2013: ZFNet
20
- 8 layers, more filters
16.4 2014:VGG
- 19 layers
11.7 2014: GoogLeNet
10
6.7 - “Inception” modules
5.1 - 22 layers, 5million parameters
3.57
2015: ResNet
0 - 152 layers
10

11

12

13

14

15

an
20
20

20

20

20

20

um
H

6.S191 Introduction to Deep Learning


[6,7] 1/29/19
introtodeeplearning.com
ImageNet Challenge: Classification Task
30 28.2 150
25.8
classification error %

number of layers
20
100
16.4
11.7
10 50
7.3 6.7
5.1
3.57
0 0
10

11

12

13

14

14

15

an

10

11

12

13

14

14

15
um
20

20

20

20

20

20

20

20

20

20

20

20

20

20
H

6.S191 Introduction to Deep Learning


[6,7] 1/29/19
introtodeeplearning.com
An Architecture for Many Applications
An Architecture for Many Applications

Object detection with R-CNNs


Segmentation with fully convolutional networks
Image captioning with RNNs
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Beyond Classification

Semantic Segmentation Object Detection Image Captioning

CAT CAT, DOG, DUCK The cat is in the grass.

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Semantic Segmentation: FCNs
FCN: Fully Convolutional Network.
Network designed with all convolutional layers,
with downsampling and upsampling operations

6.S191 Introduction to Deep Learning


[3,8,9] 1/29/19
introtodeeplearning.com
Driving Scene Segmentation

6.S191 Introduction to Deep Learning Fix reference [10] 1/29/19


introtodeeplearning.com
Driving Scene Segmentation

6.S191 Introduction to Deep Learning


[11, 12] 1/29/19
introtodeeplearning.com
Object Detection with R-CNNs
R-CNN: Find regions that we think have objects. Use CNN to classify.

6.S191 Introduction to Deep Learning


[13] 1/29/19
introtodeeplearning.com
Image Captioning using RNNs

6.S191 Introduction to Deep Learning


[14,15] 1/29/19
introtodeeplearning.com
Image Captioning using RNNs

6.S191 Introduction to Deep Learning


[14,15] 1/29/19
introtodeeplearning.com
Deep Learning for Computer Vision:
Impact and Summary
Data, Data, Data
Airplane

Automobile

Bird

Cat

Deer

Dog
MNIST: handwritten digits
Frog

Horse

Ship
ImageNet:
22K categories. 14M images. Truck
places: natural scenes
CIFAR-10
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Deep Learning for Computer Vision: Impact

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Impact: Face Detection 6.S191 Lab!

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Impact: Self-Driving Cars

6.S191 Introduction to Deep Learning


[16] 1/29/19
introtodeeplearning.com
Impact: Healthcare
Identifying facial phenotypes of genetic disorders using deep learning
Gurovich et al., Nature Med. 2019

6.S191 Introduction to Deep Learning


[17] 1/29/19
introtodeeplearning.com
Deep Learning for Computer Vision: Summary

Foundations CNNs Applications


• Why computer vision? • CNN architecture • Segmentation, object
• Representing images • Application to detection, image
classification captioning
• Convolutions for
feature extraction • ImageNet • Visualization

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Which face is fake?

6.S191 Introduction to Deep Learning [1] 1/29/19


introtodeeplearning.com
Supervised vs unsupervised learning
Supervised Learning Unsupervised Learning

Data: (", $) Data: "


" is data, $ is label " is data, no labels!

Goal: Learn function to map Goal: Learn some hidden or


"→$ underlying structure of the data

Examples: Classification, Examples: Clustering, feature or


regression, object detection, dimensionality reduction, etc.
semantic segmentation, etc.
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Supervised vs unsupervised learning
Supervised Learning Unsupervised Learning

Data: (", $) Data: "


" is data, $ is label " is data, no labels!

Goal: Learn function to map Goal: Learn some hidden or


"→$ underlying structure of the data

Examples: Classification, Examples: Clustering, feature or


regression, object detection, dimensionality reduction, etc.
semantic segmentation, etc.
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Generative modeling
Goal: Take as input training samples from some distribution
and learn a model that represents that distribution
Density Estimation Sample Generation

Input samples Generated samples


samples
Training data ~ "#$%$ & Generated ~ "'(#)* &

How can we learn "'(#)* & similar to "#$%$ & ?


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Why generative models? Debiasing
Capable of uncovering underlying latent variables in a dataset

vs

Homogeneous skin color, pose Diverse skin color, pose, illumination

How can we use latent distributions to create fair and representative datasets?

6.S191 Introduction to Deep Learning [2] 1/29/19


introtodeeplearning.com
Why generative models? Outlier detection
• Problem: How can we detect when 95% of Driving Data:
we encounter something new or rare? (1) sunny, (2) highway, (3) straight road
• Strategy: Leverage generative
models, detect outliers in the
distribution
• Use outliers during training to
improve even more!
Detect outliers to avoid unpredictable behavior when training

Edge Cases Harsh Weather Pedestrians


6.S191 Introduction to Deep Learning [3] 1/29/19
introtodeeplearning.com
Latent variable models

Autoencoders and Variational Generative Adversarial


Autoencoders (VAEs) Networks (GANs)
What is a latent variable?

Myth of the Cave

6.S191 Introduction to Deep Learning [4] 1/29/19


introtodeeplearning.com
What is a latent variable?

Can we learn the true explanatory factors, e.g. latent variables, from only observed data?

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Autoencoders
Autoencoders: background
Unsupervised approach for learning a lower-dimensional feature
representation from unlabeled training data

Why do we care about a


low-dimensional "?

! "

“Encoder” learns mapping from the data, !, to a low-dimensional latent space, "
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data

! " !#

“Decoder” learns mapping back from latent, ", to a reconstructed observation, !#


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data

! " !#

ℒ !, !# = ! − !# ( Loss function doesn’t


use any labels!!
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Autoencoders: background
How can we learn this latent space?
Train the model to use these features to reconstruct the original data

! " !#

ℒ !, !# = ! − !# ( Loss function doesn’t


use any labels!!
6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Dimensionality of latent space à
reconstruction quality
Autoencoding is a form of compression!
Smaller latent space will force a larger training bottleneck

2D latent space 5D latent space Ground Truth

6.S191 Introduction to Deep Learning [5] 1/29/19


introtodeeplearning.com
Autoencoders for representation learning

Bottleneck hidden layer forces network to learn a compressed


latent representation

Reconstruction loss forces the latent representation to capture


(or encode) as much “information” about the data as possible

Autoencoding = Automatically encoding data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Variational Autoencoders (VAEs)
VAEs: key difference with traditional autoencoder

! " !#

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
VAEs: key difference with traditional autoencoder

$
! " !#
%

6.S191 Introduction to Deep Learning [6] 1/29/19


introtodeeplearning.com
VAEs: key difference with traditional autoencoder
mean
vector

$
! " !#
%

standard deviation
vector

Variational autoencoders are a probabilistic twist on autoencoders!


Sample from the mean and standard dev. to compute latent sample
6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
VAE optimization

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes: ,- (x|")

6.S191 Introduction to Deep Learning [6] 1/29/19


introtodeeplearning.com
VAE optimization

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes:,- (x|")

ℒ ϕ, 2 = (reconstruction loss) + (regularization term)


6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
VAE optimization

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes:,- (x|")

ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)


6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
VAE optimization

e.g. ! − !# 5

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes:,- (x|")

ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)


6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
VAE optimization
Inferred latent Fixed prior on
distribution latent distribution

4 &' z ! ∥ & "

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes:,- (x|")

ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)


6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
Priors on the latent distribution
! "# z % ∥ " '

Inferred latent Fixed prior on


distribution latent distribution

Common choice of prior:

" ' = ) * = 0, - . = 1

• Encourages encodings to distribute encodings evenly around


the center of the latent space
• Penalize the network when it tries to “cheat” by clustering
points in specific regions (ie. memorizing the data)

6.S191 Introduction to Deep Learning [7] 1/29/19


introtodeeplearning.com
Priors on the latent distribution
! "# z % ∥ " '
KL-divergence between
678 the two distributions
1
= − 2 -3 + *3. − 1 − log -3
2
345

Common choice of prior:

" ' = ) * = 0, - . = 1

• Encourages encodings to distribute encodings evenly around


the center of the latent space
• Penalize the network when it tries to “cheat” by clustering
points in specific regions (ie. memorizing the data)

6.S191 Introduction to Deep Learning [7] 1/29/19


introtodeeplearning.com
VAEs computation graph

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes:,- (x|")

ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)


6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
VAEs computation graph
Problem: We cannot backpropagate gradients through sampling layers!

$
! " !#
%

Encoder computes: &' (z|!) Decoder computes:,- (x|")

ℒ ϕ, 2, ! = (reconstruction loss) + (regularization term)


6.S191 Introduction to Deep Learning [6] 1/29/19
introtodeeplearning.com
Reparametrizing the sampling layer
Key Idea:
! ~%(", # ()

Consider the sampled latent


" vector as a sum of
! • a fixed " vector,
# • and fixed # vector, scaled by
random constants drawn from
the prior distribution

⇒ ! = " + #⨀.
where .~%(0,1)

6.S191 Introduction to Deep Learning [6] 1/29/19


introtodeeplearning.com
Reparametrizing the sampling layer

!
Backprop

Deterministic node
" ∼ $% (z|)) "

Stochastic node
+ )

Original form

6.S191 Introduction to Deep Learning [6] 1/29/19


introtodeeplearning.com
Reparametrizing the sampling layer

Backprop
! !

Deterministic node 0!
" ∼ $% (z|)) " 0"
" " = ,(-, ), /)

0!
0-
Stochastic node
- ) - ) / ~2(0,1)

Original form Reparametrized form

6.S191 Introduction to Deep Learning [6] 1/29/19


introtodeeplearning.com
VAEs: Latent perturbation
Slowly increase or decrease a single latent variable
Keep all other variables fixed

Head pose

Different dimensions of ! encodes different interpretable latent features

6.S191 Introduction to Deep Learning [8] 1/29/19


introtodeeplearning.com
VAEs: Latent perturbation

Ideally, we want latent variables that


are uncorrelated with each other

Smile Enforce diagonal prior on the latent


variables to encourage
independence

Disentanglement

Head pose
6.S191 Introduction to Deep Learning [8] 1/29/19
introtodeeplearning.com
VAEs: Latent perturbation

Google BeatBlender

6.S191 Introduction to Deep Learning [9] 1/29/19


introtodeeplearning.com
VAEs: Latent perturbation

6.S191 Introduction to Deep Learning [10] 1/29/19


introtodeeplearning.com
VAE summary
1. Compress representation of world to something we can use to learn

! " !#

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)

! " !#

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
3. Reparameterization trick to train end-to-end

! " !#

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
3. Reparameterization trick to train end-to-end
4. Interpret hidden latent variables using perturbation

! " !#

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
VAE summary
1. Compress representation of world to something we can use to learn
2. Reconstruction allows for unsupervised learning (no labels!)
3. Reparameterization trick to train end-to-end
4. Interpret hidden latent variables using perturbation
5. Generating new examples

! " !#

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Generative Adversarial Networks (GANs)
What if we just want to sample?
Idea: don’t explicitly model density, and instead just sample to generate new instances.

Problem: want to sample from complex distribution – can’t do this directly!

Solution: sample from something simple (noise), learn a


transformation to the training distribution.

#$%&'
noise ! " “fake” sample from the
training distribution

Generator Network "

6.S191 Introduction to Deep Learning [11] 1/29/19


introtodeeplearning.com
Generative Adversarial Networks (GANs)
Generative Adversarial Networks (GANs) are a way to make a generative
model by having two neural networks compete with each other.

The discriminator tries to identify real

$%&'(
data from fakes created by the generator.

The generator turns noise into an imitation


of the data to try to trick the discriminator.
# +

$)'*&
noise ! "

6.S191 Introduction to Deep Learning [11] 1/29/19


introtodeeplearning.com
Intuition behind GANs
Generator starts from noise to try to create an imitation of the data.

Generator

Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator looks at both real data and fake data created by the generator.

Discriminator Generator

Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator looks at both real data and fake data created by the generator.

Discriminator Generator

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Generator tries to improve its imitation of the data.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Generator tries to improve its imitation of the data.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Generator tries to improve its imitation of the data.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to predict what’s real and what’s fake.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Generator tries to improve its imitation of the data.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Generator tries to improve its imitation of the data.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Generator tries to improve its imitation of the data.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Intuition behind GANs
Discriminator tries to identify real data from fakes created by the generator.
Generator tries to create imitations of data to trick the discriminator.

Discriminator Generator
! "#$% = 1

Real data Fake data

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
Training GANs
Discriminator tries to identify real data from fakes created by the generator.
Generator tries to create imitations of data to trick the discriminator.

Train GAN jointly via minimax game:

min max )*~,(-.- log 2$( 3 + )5~,(5) log 1 − 2$( :$% (;)
$% $(

Discriminator wants to maximize objective s.t. 2 3 close to 1, 2 :(; ) close to 0.


Generator wants to minimize objective s.t. 2 :(; ) close to 1.

6.S191 Introduction to Deep Learning [11] 1/29/19


introtodeeplearning.com
Why GANs?

A. Courville, 6S191 2018.


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Why GANs?

A. Courville, 6S191 2018.


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Generating new data with GANs
After training, use generator network to create new data that’s never been seen before.

$%&'(
# +

$)'*&
noise
! "

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
GANs: Recent Advances
Progressive growing of GANs (NVIDIA)

Karras et al., ICLR 2018.


6.S191 Introduction to Deep Learning [12] 1/29/19
introtodeeplearning.com
Progressive growing of GANs: results

Karras et al., ICLR 2018.


6.S191 Introduction to Deep Learning [12] 1/29/19
introtodeeplearning.com
Style-based generator: results

Karras et al., Arxiv 2018.


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Style-based transfer: results

Karras et al., Arxiv 2018.


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
CycleGAN: domain transformation
CycleGAN learns transformations across domains with unpaired data.

Zhu et al., ICCV 2017.


6.S191 Introduction to Deep Learning
1/29/19
introtodeeplearning.com
Deep Generative Modeling: Summary
Autoencoders and Variational Generative Adversarial
Autoencoders (VAEs) Networks (GANs)
Learn lower-dimensional latent Competing generator and
space and sample to generate discriminator networks
input reconstructions

6.S191 Introduction to Deep Learning


1/29/19
introtodeeplearning.com
T-shirts! Today!

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Course Schedule

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Final Class Project
Option 1: Proposal Presentation
• Present a novel deep learning • Judged by a panel of industry judges
research idea or application • Top winners are awarded:
• Groups of 1 welcome
• Listeners welcome
• Groups of 2 to 4 to be eligible
for prizes, incl. 1 for-credit student
• 3 minutes
• Proposal instructions:
goo.gl/JGJ5E7
3x NVIDIA RTX 2080 Ti 4x Google Home
MSRP: $4000 MSRP: $400

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Final Class Project
Option 1: Proposal Presentation Proposal Logistics
• Present a novel deep learning • >= 1 for-credit student to be eligible
research idea or application for prizes
• Groups of 1 welcome • Prepare slides on Google Slides
• Listeners welcome • Group submit by today 10pm:
• Groups of 2 to 4 to be eligible goo.gl/rV6rLK
for prizes, incl. 1 for-credit student • In class project work: Thu, Jan 31
• 3 minutes • Slide submit by Thu 11:59 pm:
• Proposal instructions: goo.gl/7smL8w
goo.gl/JGJ5E7 • Presentations on Friday, Feb 1

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Final Class Project
Option 1: Proposal Presentation Option 2: Write a 1-page review
• Present a novel deep learning of a deep learning paper
research idea or application • Grade is based on clarity of
• Groups of 1 welcome writing and technical
• Listeners welcome communication of main ideas
• Groups of 2 to 4 to be eligible • Due Friday 1:00pm (before
for prizes, incl. 1 for-credit student lecture)
• 3 minutes
• Proposal instructions:
goo.gl/JGJ5E7

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Thursday: Visualization in ML +
Biologically Inspired Learning
Fernanda Viegas,
Co-Director Google PAIR
Data Visualization for Final project work
Machine Learning
Ask us questions!

Open office hours!


Dmitry Krotov,
MIT-IBM Watson AI Lab Work with group members!
Biologically Inspired Deep
Learning

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Friday: Learning and Perception +
Project Proposals + Awards + Pizza

Project Proposals!
Jan Kautz,
VP of Research Judging and Awards!
Learning and Perception

Pizza Celebration!

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
So far in 6.S191…
The Rise of Deep Learning

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
So far in 6.S191…

Data
• Signals
• Images
• Sensors

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
So far in 6.S191…

Data Decision
• Signals • Prediction
• Images • Detection
• Sensors • Action
… …

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
So far in 6.S191…

Data Decision
• Signals • Prediction
• Images • Detection
• Sensors • Action
… …

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Power of Neural Nets
Universal Approximation Theorem
A feedforward network with a single layer is sufficient to approximate, to
an arbitrary precision, any continuous function.

Hornik et al. Neural Networks. (1989)


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Power of Neural Nets
Universal Approximation Theorem
A feedforward network with a single layer is sufficient to approximate, to
an arbitrary precision, any continuous function.

Caveats:

The number of The resulting


hidden units may model may not
be infeasibly large generalize

Hornik et al. Neural Networks. (1989)


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Artificial Intelligence “Hype”: Historical Perspective

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Limitations
Rethinking Generalization
“Understanding Deep Neural Networks Requires Rethinking Generalization”

dog banana dog tree

Zhang et al. ICLR. (2017)


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Rethinking Generalization
“Understanding Deep Neural Networks Requires Rethinking Generalization”

dog banana dog tree

Zhang et al. ICLR. (2017)


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Rethinking Generalization
“Understanding Deep Neural Networks Requires Rethinking Generalization”

dog banana dog tree

banana dog tree dog


Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Rethinking Generalization
“Understanding Deep Neural Networks Requires Rethinking Generalization”

dog banana dog tree

banana dog tree dog


Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Capacity of Deep Neural Networks

100%

accuracy

0%
original randomization completely
labels random
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Capacity of Deep Neural Networks

100%

accuracy

0%
original randomization completely
labels random
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Capacity of Deep Neural Networks
Modern deep networks can
perfectly fit to random data
100%

accuracy

0%
original randomization completely
labels random
Training Set Testing Set
Zhang et al. ICLR. (2017)
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Neural Networks as Function Approximators
Neural networks are excellent function approximators
…when they have training data

How do we know when our


network doesn’t know?

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Despois. “Adversarial examples and their implications” (2017).


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Remember:
We train our networks with gradient descent

%&(!, ), *)
! ←!−$
%!
“How does a small change in weights decrease our loss”

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Remember:
We train our networks with gradient descent

%&(!, ), *)
! ←!−$
%!
“How does a small change in weights decrease our loss”

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Remember:
We train our networks with gradient descent

%&(!, ), *)
! ←!−$ Fix your image ),
%! and true label *

“How does a small change in weights decrease our loss”

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Adversarial Image:
Modify image to increase error

%&((, !, *)
! ←!+$
%!
“How does a small change in the input increase our loss”

Goodfellow et al. NIPS (2014)


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Adversarial Image:
Modify image to increase error

%&((, !, *)
! ←!+$
%!
“How does a small change in the input increase our loss”

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Adversarial Attacks on Neural Networks

Adversarial Image:
Modify image to increase error

%&((, !, *)
! ←!+$ Fix your weights (,
%! and true label *

“How does a small change in the input increase our loss”

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Synthesizing Robust Adversarial Examples

Athalye et al. ICML. (2018)


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Neural Network Limitations…
• Very data hungry (eg. often millions of examples)
• Computationally intensive to train and deploy (tractably requires GPUs)
• Easily fooled by adversarial examples
• Can be subject to algorithmic bias
• Poor at representing uncertainty (how do you know what the model knows?)
• Uninterpretable black boxes, difficult to trust
• Finicky to optimize: non-convex, choice of architecture, learning parameters
• Often require expert knowledge to design, fine tune architectures

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Neural Network Limitations…
• Very data hungry (eg. often millions of examples)
• Computationally intensive to train and deploy (tractably requires GPUs)
• Easily fooled by adversarial examples
• Can be subject to algorithmic bias
• Poor at representing uncertainty (how do you know what the model knows?)
• Uninterpretable black boxes, difficult to trust
• Finicky to optimize: non-convex, choice of architecture, learning parameters
• Often require expert knowledge to design, fine tune architectures

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
New Frontiers 1:
Bayesian Deep Learning
Why Care About Uncertainty?

ℙ(cat)
OR
ℙ(dog)

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Why Care About Uncertainty?

ℙ cat = 0.2

ℙ dog = 0.8

Remember: ℙ cat + ℙ dog = 1

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Bayesian Deep Learning for Uncertainty
Network tries to learn output, !, directly from raw data, "

Find mapping, #, parameterized by weights $ such that


min ℒ(!, # +; $ )
Bayesian neural networks aim to learn a posterior over weights,
ℙ $ ", ! :
ℙ ! ", $ ℙ($)
ℙ $ ", ! =
ℙ(!|")

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Bayesian Deep Learning for Uncertainty
Network tries to learn output, !, directly from raw data, "

Find mapping, #, parameterized by weights $ such that


min ℒ(!, # +; $ )
Bayesian neural networks aim to learn a posterior over weights,
ℙ $ ", ! :
ℙ ! ", $ ℙ($)
Intractable! ℙ $ ", ! =
ℙ(!|")

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Elementwise Dropout for Uncertainty
&
Evaluate ! stochastic forward passes through the network "# #$%

Dropout as a form of stochastic sampling '(,# ~ +,-./0112 3 ∀5 ∈"

Unregularized Kernel Bernoulli Dropout Stochastic Sampled


" '",# "#
&
1
: < = > ? < "#
9 ;
!
⊙ =
#$%
&
1 D
: < = > ?(<)D − 9 ;
@A- ; :<
!
#$%

Gal and Ghahramani, ICML, 2016.


Amini, Soleimany, et al., NIPS Workshop on Bayesian Deep Learning, 2017.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Model Uncertainty Application

Input image Predicted Depth Model Uncertainty

Kendall, Gal, NIPS, 2017.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Multi-Task Learning Using Uncertainty

Kendall, et al., CVPR, 2018.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Multi-Task Learning Using Uncertainty

Kendall, et al., CVPR, 2018.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
Multi-Task Learning Using Uncertainty

Kendall, et al., CVPR, 2018.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
New Frontiers II:
Learning to Learn
Motivation: Learning to Learn
Standard deep neural networks are optimized for a single task

Complexity of models increases Greater need for specialized engineers

Often require expert knowledge to build an architecture for a given task

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Motivation: Learning to Learn
Standard deep neural networks are optimized for a single task

Complexity of models increases Greater need for specialized engineers

Often require expert knowledge to build an architecture for a given task


Build a learning algorithm that learns which model to use to solve a given problem

6.S191 Introduction to Deep Learning


1/30/19
introtodeeplearning.com
Motivation: Learning to Learn
Standard deep neural networks are optimized for a single task

Complexity of models increases Greater need for specialized engineers

Often require expert knowledge to build an architecture for a given task


Build a learning algorithm that learns which model to use to solve a given problem
AutoML
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AutoML: Learning to Learn

Zoph and Le, ICLR 2017.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AutoML: Model Controller
At each step, the model samples a brand new network

Zoph and Le, ICLR 2017.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AutoML: The Child Network

Sampled network
Training Data Prediction
from RNN

Compute final accuracy on this dataset.


Update RNN controller based on the accuracy of the child network after training.

Zoph and Le, ICLR 2017.


6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AutoML on the Cloud

Google Cloud.
6.S191 Introduction to Deep Learning
1/30/19
introtodeeplearning.com
AutoML Spawns a Powerful Idea
• Design an AI algorithm that can build new models
capable of solving a task
• Reduces the need for experienced engineers to
design the networks
• Makes deep learning more accessible to the public

Connection to
Artificial General Intelligence:
the ability to intelligently
reason about how we learn

Follow me of LinkedIn for more:


Steve Nouri
https://www.linkedin.com/in/stevenouri/

You might also like