0% found this document useful (0 votes)
59 views59 pages

CS 236 Section 3

The document discusses neural network basics including forward and backpropagation, activation functions like ReLU and their properties, optimization algorithms like stochastic gradient descent, and common neural network architectures like convolutional neural networks. It provides details on convolutional layers, activation maps, and regularization techniques like dropout to prevent overfitting.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views59 pages

CS 236 Section 3

The document discusses neural network basics including forward and backpropagation, activation functions like ReLU and their properties, optimization algorithms like stochastic gradient descent, and common neural network architectures like convolutional neural networks. It provides details on convolutional layers, activation maps, and regularization techniques like dropout to prevent overfitting.

Uploaded by

Gobi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Section 3: Neural Networks

Outline

● Neural Network basics


● Backpropagation
● CNNs
● RNNs
● Transformers
Neural Network (NN) Basics
Dataset: (x, y) where x: inputs, y: labels

Steps to train a 1-hidden layer NN:

● Do a forward pass: ŷ = f(xW + b)


● Compute loss: loss(y, ŷ)
● Compute gradients using backprop
● Update weights using an optimization
algorithm, like SGD
● Do hyperparameter tuning on Dev set
● Evaluate on Test set
Activation Functions: Sigmoid

Properties:

● Squashes input between 0 and 1.

Problems:

● Saturation of neurons kills gradients.


● Output is not centered at 0.
Activation Functions: Tanh

Properties:

● Squashes input between -1 and 1.


● Output centered at 0.

Problems:

● Saturation of neurons kills gradients.


Activation Functions: ReLU

Properties:

● No saturation
● Computationally cheap
● Empirically known to converge faster

Problems:

● Output not centered at 0


When input < 0, ReLU gradient is 0.
Never changes.
Stochastic Gradient Descent (SGD)

● Stochastic Gradient Descent (SGD)


○ 𝝷 : weights/parameters
○ 𝛂 : learning rate
○ J : loss function
● SGD update happens after every training
example.
● Minibatch SGD (sometimes also
abbreviated as SGD) considers a small
batch of training examples at once,
averages their loss, and updates 𝝷.
Backpropagation

● Problem statement
● Simple example
Problem Statement

Given a function f with respect to inputs x, labels y, and parameters


𝜃
compute the gradient of Loss with respect to 𝜃
Backpropagation

An algorithm for computing the gradient of a compound function as a


series of local, intermediate gradients
Backpropagation

1. Identify intermediate functions (forward prop)


2. Compute local gradients
3. Combine with upstream error signal to get full gradient
Modularity - Simple Example

Compound function

Intermediate Variables
(forward propagation)
Modularity - Neural Network Example

Compound function

Intermediate Variables
(forward propagation)
Intermediate Variables Intermediate Gradients
(forward propagation) (backward propagation)
Chain Rule Behavior

Key chain rule intuition:


Slopes multiply
Backprop Menu for Success

1. Write down variable graph

2. Compute derivative of cost function

3. Keep track of error signals

4. Enforce shape rule on error signals

5. Use matrix balancing when deriving over a linear transformation


Regularization: Dropout

● Randomly drop neurons at forward


pass during training.
● At test time, turn dropout off.
Prevents overfitting by forcing
network to learn redundancies.
● Think about dropout as training an
ensemble of networks.
Training Tips and Tricks

● Learning rate: loss


○ If loss curve seems to be unstable very high learning rate

(jagged line), decrease learning rate. low learning rate


○ If loss curve appears to be “linear”,
high learning rate
increase learning rate.

good learning rate


Training Tips and Tricks

● Regularization (Dropout, L2 Norm, … ):


If the gap between train and dev
accuracies is large (overfitting),
increase the regularization constant.

DO NOT test your model on the test set until


overfitting is no longer an issue.
CNNs

● Fully-Connected Layers
● Convolutional Layers
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1

input activation

1 1
10 x 3072
3072 weights 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --23 April17,
April 17,2018
2018
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1

input activation

1 1
10 x 3072
3072 weights 10
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --24 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image -> preserve spatial structure

height
32

width
32
3 depth

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --25 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image

5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --26 April17,
April 17,2018
2018
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image

5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --27 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --28 April17,
April 17,2018
2018
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3
1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --29 April17,
April 17,2018
2018
consider a second, green filter
Convolution Layer
32x32x3 image activation maps

5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3
1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --30 April17,
April 17,2018
2018
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32
28

Convolution Layer

32 28
3
6
We stack these up to get a “new image” of size 28x28x6!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --31 April17,
April 17,2018
2018
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --32 April17,
April 17,2018
2018
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28 24

CONV, CONV, CONV, ….


ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --33 April17,
April 17,2018
2018
RNNs

● Review of RNNs
● RNN Language Models
● Vanishing Gradient Problem
● GRUs
● LSTMs
RNNs are good for:

● Learning representations for sequential data with temporal relationships


● Predictions can be made at every timestep, or at the end of a sequence
● Q: How do we incorporate past information to make predictions about the
future?
RNNs
Key points:

● Weights are shared (tied) across


timesteps
● Hidden state at time t depends on
previous hidden state and new input
● Backpropagation across timesteps
(use unrolled network)
RNN Language Model

● Language Modeling (LM): task of computing probability distributions over


sequence of words P(w_1, …, w_T)
● Important role in speech recognition, text summarization, etc.
Vanishing Gradient Problem

● Backprop in RNNs: recursive gradient call for hidden layer


● Magnitude of gradients of typical activation functions between 0 and 1.

● When terms less than 1, product can get small very quickly
● Vanishing gradients → RNNs fail to learn, since parameters barely update.
● GRUs and LSTMs to the rescue!
High-Level Idea

Gating mechanisms control information flow:

● How much do I care about the past?


● How much do I care about the present?
● How much do I want to output at the current timestep?

These questions are the underlying mechanisms behind GRUs/LSTMs.


Gated Recurrent Units (GRUs)

● z_t: Update gate


● r_t: Reset gate
● h_t: Cell memory content
○ Mixture of past memory and current
memory content
○ Also functions as cell output
● The reset and update gates control
long- and short-term dependencies
(mitigate vanishing gradients
problem!)
Gated Recurrent Units (GRUs)
LSTMs

● f_t: forget gate


○ How much do I care about the past?
● i_t: input gate
○ How much do I care about the
present?
● o_t: output gate
○ How much information do I output?
● C_t: current cell state
● h_t: cell output
● Cell state + output are separate!
LSTMs
So What’s Missing?

● RNNs + variants very successful for variable-length representations/seqs


○ Gating (LSTMs) for long-range error propagation
● But what if we want context from a *really* long time back? (many
thousands of steps)
● Sequentiability prohibits parallelization within instances
● Long-range dependencies are still tricky
Transformer

● Architecture
● Self-Attention
● Multi-Attention Heads
● Prediction vs. Sampling
Transformer Architecture

● Encoder stack
○ 6 layers, each with 2 sublayers:
Multi-head Attention + FFN
● Decoder stack
○ Same as encoder, but with
encoder-decoder self-attention
● Positional encodings
○ Added to input embedding
Transformer (Simplified)
Self-Attention
Self-Attention

Key Idea:

● Reference by context
● Attention: query with vector, and look at similar things in your past
○ Find most similar key and get values that correspond to these similar things
● Softmax gives you probability distribution over keys
● Normalize by sqrt(d_k) for numerical stability
Self-Attention Example
Self-Attention Example
Self-Attention Example
3 Types of Self-Attention

● Encoder self-attention: attends everywhere in the input


● Encoder-decoder attention (from output, attends to input)
● Masked decoder attention: only attends to things before
Multi-Attention Heads

Multiple attention heads to get more “representational power”


Multi-Attention Heads
Transformer

● Final linear layer


○ Projection of decoder output into a logits vector → each cell corresponds to the score of a
unique word in the possible vocabulary
● Softmax layer
○ Turns logits into probabilities, which we use for prediction (training) or sampling
(testing/inference)
● Cross Entropy Loss
○ Compare two probability distributions
Transformer: Prediction vs. Sampling

● During training, our examples are “labeled” in the sense that we know the
true word that we are supposed to decode
● During sampling, we don’t know our target
○ Generate autoregressively, but using as input the previously generated token
Acknowledgements

● Slides adapted from:


○ Justin Johnson, Serena Yeung, and Fei-Fei Li (CS231N, Spring 2018) [slides]
○ Barak Oshri, Lisa Wang, and Juhi Naik (CS224N, Winter 2017) [slides]
○ Nish Chintala (CS236, Fall 2018) [slides]
● Chris Olah, OpenAI [blog]
● Lukasz Kaiser, Google Brain [talk]
● Anna Huang and Ashish Vaswani, Google Brain [slides] [paper]
● Jay Alammar, [blog]

You might also like