CS 236 Section 3
CS 236 Section 3
Outline
Properties:
Problems:
Properties:
Problems:
Properties:
● No saturation
● Computationally cheap
● Empirically known to converge faster
Problems:
● Problem statement
● Simple example
Problem Statement
Compound function
Intermediate Variables
(forward propagation)
Modularity - Neural Network Example
Compound function
Intermediate Variables
(forward propagation)
Intermediate Variables Intermediate Gradients
(forward propagation) (backward propagation)
Chain Rule Behavior
● Fully-Connected Layers
● Convolutional Layers
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
input activation
1 1
10 x 3072
3072 weights 10
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --23 April17,
April 17,2018
2018
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1
input activation
1 1
10 x 3072
3072 weights 10
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --24 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image -> preserve spatial structure
height
32
width
32
3 depth
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --25 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”
32
3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --26 April17,
April 17,2018
2018
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”
32
3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --27 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --28 April17,
April 17,2018
2018
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3
1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --29 April17,
April 17,2018
2018
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3
1
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --30 April17,
April 17,2018
2018
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3
6
We stack these up to get a “new image” of size 28x28x6!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --31 April17,
April 17,2018
2018
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --32 April17,
April 17,2018
2018
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions
32 28 24
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --33 April17,
April 17,2018
2018
RNNs
● Review of RNNs
● RNN Language Models
● Vanishing Gradient Problem
● GRUs
● LSTMs
RNNs are good for:
● When terms less than 1, product can get small very quickly
● Vanishing gradients → RNNs fail to learn, since parameters barely update.
● GRUs and LSTMs to the rescue!
High-Level Idea
● Architecture
● Self-Attention
● Multi-Attention Heads
● Prediction vs. Sampling
Transformer Architecture
● Encoder stack
○ 6 layers, each with 2 sublayers:
Multi-head Attention + FFN
● Decoder stack
○ Same as encoder, but with
encoder-decoder self-attention
● Positional encodings
○ Added to input embedding
Transformer (Simplified)
Self-Attention
Self-Attention
Key Idea:
● Reference by context
● Attention: query with vector, and look at similar things in your past
○ Find most similar key and get values that correspond to these similar things
● Softmax gives you probability distribution over keys
● Normalize by sqrt(d_k) for numerical stability
Self-Attention Example
Self-Attention Example
Self-Attention Example
3 Types of Self-Attention
● During training, our examples are “labeled” in the sense that we know the
true word that we are supposed to decode
● During sampling, we don’t know our target
○ Generate autoregressively, but using as input the previously generated token
Acknowledgements