Lec6 RNN Attention Search
Lec6 RNN Attention Search
INTRODUCTION
TO ARTIFICIAL
INTELLIGENCE
Tianyi Zhou
09/18/2023
University of Maryland
Some slides are adapted from Song & Abbeel &
Russell @Berkeley, Fe—Fei @ Stanford 1
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Interface Multi-modal AI World
Neural
Networks
4
Plan
• Today
– CNN, RNN, Attention, Uninformed Search
– Reading:
• Russell and Norvig (4th edition): Ch 1-2, 3.1-3.4
• Goodfellow, Bengio, & Courville: Ch 9-10
• Coding:
– Numpy:
https://piazza.com/class/llup7xc8lm44w4/post/34
– PyTorch:
https://piazza.com/class/llup7xc8lm44w4/post/21
• Next lecture
– Uninformed Search
– Informed Search & A*
Project topics
• Vision-Language-Action (VLA) models
• Search and planning for Curriculum learning
• Mini-batch In-Context Learning on LLMs
• Mixture of Adapters for Personalized LLMs and VLMs
• Safety Verification and Validation of RL agent
• LLM agent controlling AI generation of images or videos
• In general:
– Input: n ✕ H ✕ W ✕ Nin
– Weights: Nout ✕ k ✕ k ✕ Nin
– Output: n ✕ H’ ✕ W’ ✕ Nout
"#$%&' )#$%&'
– 𝐻! = + 1, 𝑊 ! = +1
( (
Depth-wise convolution (left) &
depth-wise separable convolution (right)
• Can we reduce the computation and number of parameters for multi-channel input and output?
https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/
CNN: Receptive field
• Each pixel in the output featuremap of layer-l is
produced from a region of the input image (and thus
captures the feature of the region).
• The size (height x width) of the region is called
receptive field (RF): k=kernel size, s=stride
• Different RF captures different scales of features.
• Multi-scale features can be helpful to many tasks.
https://www.baeldung.com/cs/cnn-receptive-field-size
https://theaisummer.com/receptive-field/
Atrous (Dilated) convolution
• RF exponentially increases with the #layers and lower-layer features have smaller RF.
• How to increase the RF without increasing the depth or changing the kernel configurations?
1
CNN: pooling max
*+∈$-$ %./01
𝑥*+
4
+
*+∈$-$ %./01
𝑥*+
https://distill.pub/2017/feature-visualization/
CNN: flatten layer + MLP
How to compute the final prediction from the CNN featuremap?
CNN: design
Any Questions?
19
Data Augmentation
Data augmentation: a technique used to increase the amount of data by adding slightly
modified copies of already existing data
Flip
21
Data Augmentation
Crop
22
Data Augmentation
Rotation
23
Data Augmentation
Translation just involves moving the image along the X or Y direction
24
Data Augmentation
Cutout
25
Data Augmentation
CutMix
26
Data Augmentation
MixUp
27
ResNet: practical CNN
100 layers 𝐹 𝑥 +𝑥 ⨁
skip
𝐹(𝑥) connections
29
Transposed convolution
3 2 1 1
4 0 ** 1 1
Input Kernel
3 3 2 2
= 3 3 + 2 2 + 4 4 + 0 0
4 4 0 0
3 5 2
= 7 9 2
4 4 0
Transposed convolution (practice)
1 2 1 0
3 0 ** 0 1
Input Kernel
= + + +
=
Transposed convolution (answer)
1 2 1 0
3 0 ** 0 1
Input Kernel
1 0 2 0
= 0 1 + 0 2 + 3 0 + 0 0
0 3 0 0
1 2 0
= 3 1 2
0 3 0
U-Net
33
U-Net
34
Diffusion model
• Forward diffusion (image to
noise): progressively add noise to
an image for T steps.
• Backward diffusion (noise to
image): apply U-Net recursively
for T denoising steps.
• Train U-Net to predict the noise
at each step-t and then remove
the noise Type equation here..
• Diffusion model aims to learn a
generative model capturing the
distribution of images.
Stable diffusion & AIGC
Any Questions?
37
Why Recurrent neural networks?
• How to model time series data (speech,
text, videos, house price, trajectories,
sensor signals), i.e., (𝑥! , 𝑥" , ⋯ , 𝑥# )?
– How to capture the time dependency?
– How to represent a sequence?
• What kind of tasks we expect to address?
– Denoising of sequence: T inputs T outputs
– Generation of sequences: T inputs T’ outputs
– Sequential labeling: T inputs T outputs
– Sequence classification: T inputs 1 output
• Can we apply a neuron to input at each
time step?
• How to process input data with
different lengths using the same DNN?
Sequential generation of non-
sequential data
Image captioning: image Action recognition: video Machine translation: Speech recognition:
to a sequence of words frames to an action class Chinese to English word-level prediction
Sequence to sequence =
many to one + many to many
• Encoder-decoder architecture that is widely used in machine translation
RNN: update Hidden state from input
RNN: generate output from hidden state
Unrolling rnn
Vanilla RNN
Backpropagation through time
• “Many to One” task as an example.
Gradient vanishing & explosion
• Recurrent: send previous step’s output hidden state to next step
input for T time steps.
• RNN: MLP with T layers but they all share the same weight matrix.
• Training CNN can be a problem: backpropagation through time
(BPTT) leads to gradient vanishing.
https://towardsdatascience.com/the-exploding-and-
vanishing-gradients-problem-in-time-series-6b87d558d22
RNNs equipped with gates
LSTM: Long-short term memory GRU: gated recurrent unit
s or c s or c s or c
o
o
h h
x x
https://towardsdatascience.com/illustrated-guide-to-
lstms-and-gru-s-a-step-by-step-explanation-44e9eb85bf21
Lstm: Forget gate
Lstm: Input gate
Lstm: cell state
LSTM: Output gate
Seq2seq task using lstm
From perceptron to attention
• Neurons: weighted sum of inputs +
nonlinear activation
http://jalammar.github.io/illustrated-transformer/
Decoder model
• No encoder, decoder
only.
• Generative model only,
cannot provide sentence
embedding.
• Most SOTA LLMs use
this architecture.
• But discriminative tasks
(e.g., classification) need
further processing.
Vision transformer
• Split the input image into 16x16 patches.
• Each patch of image is treated as a token.
• Apply Transformer encoder to the tokens.
• Attention(task token, image tokens) produces
embedding for the whole image.
• Apply MLP to the image embedding to
produce class probabilities.
• ViT: global convolution with kernel weights
computed by attention for each token.
• All tokens share the same attention
parameters.
• More details in Perception section.
Any Questions?
59
Seven Components of this course
Action,
Acting Prediction
Probabilistic
Reasoning
Human
users Agent
Language
Models Embodied &
Multi-modal AI World
Neural
Networks
9/18/23
61
Optimization vs. search