0% found this document useful (0 votes)

59 views59 pages

CS 236 Section 3

The document discusses neural network basics including forward and backpropagation, activation functions like ReLU and their properties, optimization algorithms like stochastic gradient descent, and common neural network architectures like convolutional neural networks. It provides details on convolutional layers, activation maps, and regularization techniques like dropout to prevent overfitting.

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views59 pages

CS 236 Section 3

Uploaded by

Gobi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 59

Section 3: Neural Networks

Outline

● Neural Network basics

● Backpropagation
● CNNs
● RNNs
● Transformers
Neural Network (NN) Basics
Dataset: (x, y) where x: inputs, y: labels

Steps to train a 1-hidden layer NN:

● Do a forward pass: ŷ = f(xW + b)

● Compute loss: loss(y, ŷ)
● Compute gradients using backprop
● Update weights using an optimization
algorithm, like SGD
● Do hyperparameter tuning on Dev set
● Evaluate on Test set
Activation Functions: Sigmoid

Properties:

● Squashes input between 0 and 1.

Problems:

● Saturation of neurons kills gradients.

● Output is not centered at 0.
Activation Functions: Tanh

Properties:

● Squashes input between -1 and 1.

● Output centered at 0.

Problems:

● Saturation of neurons kills gradients.

Activation Functions: ReLU

Properties:

● No saturation
● Computationally cheap
● Empirically known to converge faster

Problems:

● Output not centered at 0

When input < 0, ReLU gradient is 0.
Never changes.
Stochastic Gradient Descent (SGD)

● Stochastic Gradient Descent (SGD)

○ 𝝷 : weights/parameters
○ 𝛂 : learning rate
○ J : loss function
● SGD update happens after every training
example.
● Minibatch SGD (sometimes also
abbreviated as SGD) considers a small
batch of training examples at once,
averages their loss, and updates 𝝷.
Backpropagation

● Problem statement
● Simple example
Problem Statement

Given a function f with respect to inputs x, labels y, and parameters

𝜃
compute the gradient of Loss with respect to 𝜃
Backpropagation

An algorithm for computing the gradient of a compound function as a

series of local, intermediate gradients
Backpropagation

1. Identify intermediate functions (forward prop)

2. Compute local gradients
3. Combine with upstream error signal to get full gradient
Modularity - Simple Example

Compound function

Intermediate Variables
(forward propagation)
Modularity - Neural Network Example

Compound function

Intermediate Variables
(forward propagation)
Intermediate Variables Intermediate Gradients
(forward propagation) (backward propagation)
Chain Rule Behavior

Key chain rule intuition:

Slopes multiply
Backprop Menu for Success

1. Write down variable graph

2. Compute derivative of cost function

3. Keep track of error signals

4. Enforce shape rule on error signals

5. Use matrix balancing when deriving over a linear transformation

Regularization: Dropout

● Randomly drop neurons at forward

pass during training.
● At test time, turn dropout off.
Prevents overfitting by forcing
network to learn redundancies.
● Think about dropout as training an
ensemble of networks.
Training Tips and Tricks

● Learning rate: loss

○ If loss curve seems to be unstable very high learning rate

(jagged line), decrease learning rate. low learning rate

○ If loss curve appears to be “linear”,
high learning rate
increase learning rate.

good learning rate

Training Tips and Tricks

● Regularization (Dropout, L2 Norm, … ):

If the gap between train and dev
accuracies is large (overﬁtting),
increase the regularization constant.

DO NOT test your model on the test set until

overﬁtting is no longer an issue.
CNNs

● Fully-Connected Layers
● Convolutional Layers
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1

input activation

1 1
10 x 3072
3072 weights 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --23 April17,
April 17,2018
2018
Fully Connected Layer
32x32x3 image -> stretch to 3072 x 1

input activation

1 1
10 x 3072
3072 weights 10
1 number:
the result of taking a dot product
between a row of W and the input
(a 3072-dimensional dot product)

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --24 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image -> preserve spatial structure

height
32

width
32
3 depth

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --25 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image

5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --26 April17,
April 17,2018
2018
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image

5x5x3 filter
32
Convolve the filter with the image
i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --27 April17,
April 17,2018
2018
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --28 April17,
April 17,2018
2018
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

convolve (slide) over all

spatial locations

32 28
3
1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --29 April17,
April 17,2018
2018
consider a second, green filter
Convolution Layer
32x32x3 image activation maps

5x5x3 filter
32

convolve (slide) over all

spatial locations

32 28
3
1

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --30 April17,
April 17,2018
2018
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32
28

Convolution Layer

32 28
3
6
We stack these up to get a “new image” of size 28x28x6!
Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --31 April17,
April 17,2018
2018
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --32 April17,
April 17,2018
2018
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28 24

CONV, CONV, CONV, ….

ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10

Fei-Fei Li & Justin Johnson & Serena Yeung Lecture 5 --33 April17,
April 17,2018
2018
RNNs

● Review of RNNs
● RNN Language Models
● Vanishing Gradient Problem
● GRUs
● LSTMs
RNNs are good for:

● Learning representations for sequential data with temporal relationships

● Predictions can be made at every timestep, or at the end of a sequence
● Q: How do we incorporate past information to make predictions about the
future?
RNNs
Key points:

● Weights are shared (tied) across

timesteps
● Hidden state at time t depends on
previous hidden state and new input
● Backpropagation across timesteps
(use unrolled network)
RNN Language Model

● Language Modeling (LM): task of computing probability distributions over

sequence of words P(w_1, …, w_T)
● Important role in speech recognition, text summarization, etc.
Vanishing Gradient Problem

● Backprop in RNNs: recursive gradient call for hidden layer

● Magnitude of gradients of typical activation functions between 0 and 1.

● When terms less than 1, product can get small very quickly
● Vanishing gradients → RNNs fail to learn, since parameters barely update.
● GRUs and LSTMs to the rescue!
High-Level Idea

Gating mechanisms control information ﬂow:

● How much do I care about the past?

● How much do I care about the present?
● How much do I want to output at the current timestep?

These questions are the underlying mechanisms behind GRUs/LSTMs.

Gated Recurrent Units (GRUs)

● z_t: Update gate

● r_t: Reset gate
● h_t: Cell memory content
○ Mixture of past memory and current
memory content
○ Also functions as cell output
● The reset and update gates control
long- and short-term dependencies
(mitigate vanishing gradients
problem!)
Gated Recurrent Units (GRUs)
LSTMs

● f_t: forget gate

○ How much do I care about the past?
● i_t: input gate
○ How much do I care about the
present?
● o_t: output gate
○ How much information do I output?
● C_t: current cell state
● h_t: cell output
● Cell state + output are separate!
LSTMs
So What’s Missing?

● RNNs + variants very successful for variable-length representations/seqs

○ Gating (LSTMs) for long-range error propagation
● But what if we want context from a *really* long time back? (many
thousands of steps)
● Sequentiability prohibits parallelization within instances
● Long-range dependencies are still tricky
Transformer

● Architecture
● Self-Attention
● Multi-Attention Heads
● Prediction vs. Sampling
Transformer Architecture

● Encoder stack
○ 6 layers, each with 2 sublayers:
Multi-head Attention + FFN
● Decoder stack
○ Same as encoder, but with
encoder-decoder self-attention
● Positional encodings
○ Added to input embedding
Transformer (Simpliﬁed)
Self-Attention
Self-Attention

Key Idea:

● Reference by context
● Attention: query with vector, and look at similar things in your past
○ Find most similar key and get values that correspond to these similar things
● Softmax gives you probability distribution over keys
● Normalize by sqrt(d_k) for numerical stability
Self-Attention Example
Self-Attention Example
Self-Attention Example
3 Types of Self-Attention

● Encoder self-attention: attends everywhere in the input

● Encoder-decoder attention (from output, attends to input)
● Masked decoder attention: only attends to things before
Multi-Attention Heads

Multiple attention heads to get more “representational power”

Multi-Attention Heads
Transformer

● Final linear layer

○ Projection of decoder output into a logits vector → each cell corresponds to the score of a
unique word in the possible vocabulary
● Softmax layer
○ Turns logits into probabilities, which we use for prediction (training) or sampling
(testing/inference)
● Cross Entropy Loss
○ Compare two probability distributions
Transformer: Prediction vs. Sampling

● During training, our examples are “labeled” in the sense that we know the
true word that we are supposed to decode
● During sampling, we don’t know our target
○ Generate autoregressively, but using as input the previously generated token
Acknowledgements

● Slides adapted from:

○ Justin Johnson, Serena Yeung, and Fei-Fei Li (CS231N, Spring 2018) [slides]
○ Barak Oshri, Lisa Wang, and Juhi Naik (CS224N, Winter 2017) [slides]
○ Nish Chintala (CS236, Fall 2018) [slides]
● Chris Olah, OpenAI [blog]
● Lukasz Kaiser, Google Brain [talk]
● Anna Huang and Ashish Vaswani, Google Brain [slides] [paper]
● Jay Alammar, [blog]

(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
No ratings yet
(Universitext) Paolo Baldi - Probability - An Introduction Through Theory and Exercises-Springer (2024) (Z-Lib - Io)
395 pages
SVM
No ratings yet
SVM
19 pages
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
No ratings yet
Getting Started With GPT-4 API: May 14,2024 Update To From gpt-4 To Gpt-4o
8 pages
05introduction To Convolutional Neural Networks
No ratings yet
05introduction To Convolutional Neural Networks
72 pages
CS236 Introduction To PyTorch
100% (4)
CS236 Introduction To PyTorch
33 pages
Ai
No ratings yet
Ai
28 pages
Endsem
No ratings yet
Endsem
738 pages
2024 11 15 AI Updates
No ratings yet
2024 11 15 AI Updates
20 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
19 pages
Modified Generative AI and LLMs in Practice
No ratings yet
Modified Generative AI and LLMs in Practice
6 pages
Btech Cs 7 Sem Deep Learning
No ratings yet
Btech Cs 7 Sem Deep Learning
3 pages
4th Unit Aktu Machine Learning
No ratings yet
4th Unit Aktu Machine Learning
9 pages
Super VIP Cheatsheet - Deep Learning
No ratings yet
Super VIP Cheatsheet - Deep Learning
47 pages
Chapter 2. Transformers: A Note For Early Release Readers
No ratings yet
Chapter 2. Transformers: A Note For Early Release Readers
85 pages
IF4071 Deep Learning QP
No ratings yet
IF4071 Deep Learning QP
2 pages
Inductive Moment Matching
No ratings yet
Inductive Moment Matching
36 pages
Chapter 14 - Analyzing Adversarial Performance - The Deep Learning Architect's Handbook
No ratings yet
Chapter 14 - Analyzing Adversarial Performance - The Deep Learning Architect's Handbook
1 page
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
100% (9)
Hang Li - Machine Learning Methods-Springer (2023) (Z-Lib - Io)
530 pages
21CS743 DL Module4 Notes
No ratings yet
21CS743 DL Module4 Notes
7 pages
Deepseek-Vl: Towards Real-World Vision-Language Understanding
No ratings yet
Deepseek-Vl: Towards Real-World Vision-Language Understanding
33 pages
ML-05 CNN Intro
No ratings yet
ML-05 CNN Intro
57 pages
Probabilistic Machine Learning: Exponential Families
No ratings yet
Probabilistic Machine Learning: Exponential Families
33 pages
Introduction To Neural Networks
No ratings yet
Introduction To Neural Networks
25 pages
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
No ratings yet
Mplug-Docowl 1.5: Unified Structure Learning For Ocr-Free Document Understanding
26 pages
11 CNNs
No ratings yet
11 CNNs
64 pages
Lecture 4
No ratings yet
Lecture 4
146 pages
Model Compression Techniquesin Deep Learning
No ratings yet
Model Compression Techniquesin Deep Learning
23 pages
Unit 3
No ratings yet
Unit 3
105 pages
Understanding of Convolutional Neural Network (CNN) - Deep Learning
No ratings yet
Understanding of Convolutional Neural Network (CNN) - Deep Learning
7 pages
Artificial Intelligence - Chapter 7
No ratings yet
Artificial Intelligence - Chapter 7
18 pages
Chapter 5 Artificial Neural Networks
No ratings yet
Chapter 5 Artificial Neural Networks
50 pages
RNN
No ratings yet
RNN
12 pages
Unit 12
No ratings yet
Unit 12
26 pages
CNN Slides Part2
No ratings yet
CNN Slides Part2
69 pages
(Fall 2024) Images and Convolutions
No ratings yet
(Fall 2024) Images and Convolutions
69 pages
Lect 5
No ratings yet
Lect 5
89 pages
Lec14 CNNRNNModels
No ratings yet
Lec14 CNNRNNModels
64 pages
Lecture - 07 (Convolutional Neural Networks)
No ratings yet
Lecture - 07 (Convolutional Neural Networks)
57 pages
CNN Iitkgp
No ratings yet
CNN Iitkgp
112 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
108 pages
Lecture 10 Slides - After
No ratings yet
Lecture 10 Slides - After
66 pages
The Most Used Positional Encoding: Rope: Damien Benveniste
No ratings yet
The Most Used Positional Encoding: Rope: Damien Benveniste
7 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
77 pages
IterateAI Careers
No ratings yet
IterateAI Careers
4 pages
IT8601 Notes 001 Edubuzz360
No ratings yet
IT8601 Notes 001 Edubuzz360
25 pages
Lecture # 4-1 Convolutional Neural Networks
No ratings yet
Lecture # 4-1 Convolutional Neural Networks
62 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
009 Neural - Networks Complete
No ratings yet
009 Neural - Networks Complete
61 pages
Images and Convolutional Neural Networks: Practical Deep Learning
No ratings yet
Images and Convolutional Neural Networks: Practical Deep Learning
34 pages
cs231n 2018 Midterm Review-2 PDF
No ratings yet
cs231n 2018 Midterm Review-2 PDF
86 pages
Iii Unit - Deeplearning
No ratings yet
Iii Unit - Deeplearning
93 pages
Guide Convolutional Neural Network CNN
100% (1)
Guide Convolutional Neural Network CNN
25 pages
Deep Learning UNIT-4
No ratings yet
Deep Learning UNIT-4
34 pages
Additional CNN
No ratings yet
Additional CNN
82 pages
Unit I
No ratings yet
Unit I
90 pages
Lecture 3
No ratings yet
Lecture 3
48 pages
Lec6 RNN Attention Search
No ratings yet
Lec6 RNN Attention Search
62 pages
AIcrowd - Single-Source Augmentation - Challenges
No ratings yet
AIcrowd - Single-Source Augmentation - Challenges
1 page
1 DL Introduction
No ratings yet
1 DL Introduction
51 pages
Keras and Tensorflow
No ratings yet
Keras and Tensorflow
11 pages
CNN and Stacked LSTM Model For Indian Sign Language Recognition
No ratings yet
CNN and Stacked LSTM Model For Indian Sign Language Recognition
9 pages
6 Lecture CNN
No ratings yet
6 Lecture CNN
45 pages
Wa0002.
No ratings yet
Wa0002.
28 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Aiml Demo
No ratings yet
Aiml Demo
12 pages
Deep Learning CNN
No ratings yet
Deep Learning CNN
204 pages
Convolutional Neural Networks: April 18, 2017 Lecture 5 - 1
No ratings yet
Convolutional Neural Networks: April 18, 2017 Lecture 5 - 1
64 pages
ML Algorithms
No ratings yet
ML Algorithms
5 pages
AE556 2024 Topic4 CNN
No ratings yet
AE556 2024 Topic4 CNN
26 pages
Chapter21 4e
No ratings yet
Chapter21 4e
35 pages
Unit IV Deep Leraning
No ratings yet
Unit IV Deep Leraning
35 pages
FODL Unit-4
No ratings yet
FODL Unit-4
46 pages
Class Notes Unit 5
No ratings yet
Class Notes Unit 5
13 pages
Supervised Learning
No ratings yet
Supervised Learning
2 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
3 - DeepLearning - and - CNN v3
No ratings yet
3 - DeepLearning - and - CNN v3
50 pages
NN 07
No ratings yet
NN 07
24 pages
Deep Learning PDF
No ratings yet
Deep Learning PDF
55 pages
Syllabus
No ratings yet
Syllabus
2 pages
Transformers
No ratings yet
Transformers
20 pages
Unit Ii DNN
No ratings yet
Unit Ii DNN
24 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
CS 601 Machine Learning Unit 3
No ratings yet
CS 601 Machine Learning Unit 3
37 pages
CC511 Week 7 - Deep - Learning
No ratings yet
CC511 Week 7 - Deep - Learning
33 pages
What Is A Convolutional Neural Network-Unit3
No ratings yet
What Is A Convolutional Neural Network-Unit3
12 pages
21CS743 Module4 Notes
No ratings yet
21CS743 Module4 Notes
15 pages
08 Natural Language Processing in Tensorflow
No ratings yet
08 Natural Language Processing in Tensorflow
29 pages
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
No ratings yet
An Introduction To Neural Networks: Instituto Tecgraf PUC-Rio Nome: Fernanda Duarte Orientador: Marcelo Gattass
45 pages
SDL Unit 2 3 4
No ratings yet
SDL Unit 2 3 4
12 pages
CNN
No ratings yet
CNN
31 pages
Intro CNN PDF
No ratings yet
Intro CNN PDF
31 pages
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
No ratings yet
Introduction To Deep Learning: TA: Drew Hudson May 8, 2020
33 pages
Deepfake Video Detection System Using Deep Neural Networks
No ratings yet
Deepfake Video Detection System Using Deep Neural Networks
6 pages
New
No ratings yet
New
8 pages
References
No ratings yet
References
9 pages
Experiment 12
No ratings yet
Experiment 12
3 pages
DL Mod 3
No ratings yet
DL Mod 3
4 pages
Batch No: 7 VYSHNAVI KHANDE 160031497 AVUTHU LIKITHA 160030088 Tullimalli Harsha Sree 160031385
No ratings yet
Batch No: 7 VYSHNAVI KHANDE 160031497 AVUTHU LIKITHA 160030088 Tullimalli Harsha Sree 160031385
15 pages
Reviewer
No ratings yet
Reviewer
7 pages
CNN
No ratings yet
CNN
6 pages
Fixed Weight Competitive Networks Fixed Weight Competitive Nets
No ratings yet
Fixed Weight Competitive Networks Fixed Weight Competitive Nets
5 pages
Java Image Cat Dog
No ratings yet
Java Image Cat Dog
13 pages
Mergeddv
No ratings yet
Mergeddv
2 pages
Comparison of CNN Architecutres
No ratings yet
Comparison of CNN Architecutres
7 pages
The Story of Artificial Intelligence
No ratings yet
The Story of Artificial Intelligence
2 pages
Nasa Fy23 Ai Inventory CSV Final
No ratings yet
Nasa Fy23 Ai Inventory CSV Final
3 pages
Personalized Image Editing
No ratings yet
Personalized Image Editing
3 pages
Introduction and Overview of The Project - Transcript
No ratings yet
Introduction and Overview of The Project - Transcript
2 pages
ECE 457 Course Synopsis Computational Intelligence: Fuzzy Logic and Neural Networks Fundamentals
No ratings yet
ECE 457 Course Synopsis Computational Intelligence: Fuzzy Logic and Neural Networks Fundamentals
1 page
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet