0% found this document useful (0 votes)
47 views30 pages

CS 231N Midterm Review

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views30 pages

CS 231N Midterm Review

Uploaded by

Hongming Zheng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

CS 231N Midterm Review

Midterm Logistics
● Multiple Choice
● True/False
● Short Answer Questions
● More emphasis on topics covered earlier in the course than those discussed
more recently

Focus is more on high-level understanding of concepts


How many layers in a ResNet?
What problem does the ResNet solve and how?
More Logistics…
○ The midterm exam will take place at 12:00 - 1:20pm PT on Tuesday, May 16 in person at NVIDIA
Auditorium, 420-040, and Hewlett 200.
○ If your last name begins with a letter between A and G (inclusive), you will take the exam at NVIDIA
Auditorium.
○ If your last name begins with a letter between H and M (inclusive), you will take the exam at 420-040.
○ If your last name begins with a letter between N and Z (inclusive), you will take the exam at Hewlett 200.
○ Closed-book, no internet. One double-sided cheat sheet (written or typed)
○ The exam may cover material from Assignments 1 and 2 and all lectures up to and including Lecture 12
(Visualizing and Understanding).
○ You will have 80 minutes to complete the exam. The exam will start immediately at 12:00pm. If you
arrive late, you will not be given additional time.
Midterm Review
Plan
1. Transformers & Attention
2. RNNs
3. Back Propagation
4. Optimizers
5. CNNs
6. Normalization Layers
7. Regularization Techniques
General Attention
Self-Attention
Transformer Encoder
Transformer Decoder
2 Layer Transformer Example

Credit: Medium article by Ketan Doshi


RNNs
Backpropagation
Backpropagation

z
a m

n p
Backpropagation

z
2
a m
1
4
L
1 1
1

n p
Backpropagation

z
2
a m
1
4
L
1 1
1

0 0 0
n p
Optimizers

Optimizer Per-Parameter Learning Momentum


Rate

SGD No No

SGD + Momentum No Yes

AdaGrad Yes No

RMSProp Yes No

Adam Yes Yes


SGD
AdaGrad - Learning Rate Scaling (No momentum)
RMSProp - Slowing down scaling
Adam - Momentum + RMSProp
Visualization

Video: Lily
Jiang
CNNs

Each filter has the same number of


channels as the input image.

Each filter outputs just a one channel


feature image.

Therefore, the total number of channels


in the output vector is the same as the
number of filters.
CNNs

The learnable parameters are the


weights and biases.

Each filter has one bias, that is applied


to all channels.

For an input image with C channels, N


filters, each of size (FxF), the layer has
N*(C*(F*F) + 1) learnable parameters
CNNs

Input Shape: (C,H,W)


User specifies: N filters, each of shape
(FxF). Padding P, and Stride S

Output Shape:
(N, H’, W’)

W’=(W−F+2P)/S+1
H’=(H−F+2P)/S+1

Note: Image on the left has Stride=2


BatchNorm vs LayerNorm
BatchNorm: Normalize across all data-points in the batch

LayerNorm: Normalize across the features of each data-point

Input shape: (N, D)

BatchNorm: Normalizes across N

LayerNorm: Normalizes across D


BatchNorm vs LayerNorm
BatchNorm: Normalize across all data-points in the batch

LayerNorm: Normalize across the features of each data-point

Input shape: (N, C, H, W)

BatchNorm: Normalizes across N*H*W


(calculates mean and var for each channel, across all images in the batch)

LayerNorm: Normalizes across C*H*W (calculates mean and var for each image,
across all pixels in all channels)
BatchNorm vs LayerNorm
Input shape: (N, C, H, W)

BatchNorm: Normalizes across N*H*W


(calculates mean and var for each channel, across all images in the batch)

LayerNorm: Normalizes across C*H*W (calculates mean and var for each image,
across all channels)

What is the size of their learnable parameters?


BatchNorm vs LayerNorm
Input shape: (N, C, H, W)

BatchNorm: Normalizes across N*H*W (reshape (N*H*W, C))


(calculates mean and var for each channel, across all images in the batch)

LayerNorm: Normalizes across C*H*W (reshape (N, C*H*W))


(calculates mean and var for each image, across all channels)

What is the size of their learnable parameters?

BatchNorm: C
LayerNorm: (C*H*W)
BatchNorm vs LayerNorm
One important difference:

BatchNorm calculates the mean and var across the batch, and stores a running
average which is used during test

LayerNorm performs the same during test and train


Regularization / Training a Neural Network
● L1 and L2 Regularization penalize the size of weights
● Dropout adds redundancies to learned parameters

You might also like