0% found this document useful (0 votes)
16 views40 pages

Lecture # 5-2 PixelCNN

The document presents a lecture on PixelRNN and PixelCNN, focusing on their architectures and functionalities as autoregressive models for generating images. It discusses the components of PixelRNN, including LSTM layers and residual connections, and describes how PixelCNN utilizes convolutional layers with masked convolutions to maintain spatial resolution. The lecture also covers the evaluation of these models using datasets like CIFAR-10 and MNIST.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views40 pages

Lecture # 5-2 PixelCNN

The document presents a lecture on PixelRNN and PixelCNN, focusing on their architectures and functionalities as autoregressive models for generating images. It discusses the components of PixelRNN, including LSTM layers and residual connections, and describes how PixelCNN utilizes convolutional layers with masked convolutions to maintain spatial resolution. The lecture also covers the evaluation of these models using datasets like CIFAR-10 and MNIST.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 40

National University of Computer and Emerging Sciences

PixelRNN and PixelCNN

AI-4009 Generative AI

Dr. Akhtar Jamil


Department of Computer Science

04/23/2025 Presented by Dr. AKHTAR JAMIL 1


Goals
• Review of Previous Lecture
• Today’s Lecture
– Pixel Recurrent Neural Networks (PixelRNN)
– Pixel Convolutional Neural Networks (PixelCNN)

04/23/2025 Presented by Dr. AKHTAR JAMIL 2


Review of Previous Lecture

04/23/2025 Presented by Dr. AKHTAR JAMIL 3


Autoregressive models
• Popular examples of a Autoregressive Models: PixelRNN,
PixelCNN, WaveNet, etc.
• Causal convolutions allow the model to generate output
samples one timestep at a time.
• With each sample being dependent only on previous samples,
mimicking how actual output is produced in the real world.
• The advantage of causal convolutions is that they can be used in
real-time applications
– No need of the entire sequence in advance.
• They can process data as it comes in, making predictions for the
next time step using only currently available information

04/23/2025 Presented by Dr. AKHTAR JAMIL 4


Autoregressive models

Steffen et al.
2015

04/23/2025 Presented by Dr. AKHTAR JAMIL 5


Autoregressive models

The first CausalConv1D is of type A. Next, two


layers use CausalConv1D (option B) with
04/23/2025 Presented by Dr. AKHTAR JAMIL 6
Dilated convolutions
• Dilated convolutions (atrous convolutions).
• A variation of the standard convolution operation.
• Introduce an additional parameter called the dilation rate,
which defines the spacing between the kernel's elements.
• Can capture information over larger contexts without
losing resolution or increasing computational cost
significantly.
• A dilation rate of 1 means standard convolution.
• As the dilation rate increases, the kernel elements are
spread out
04/23/2025 Presented by Dr. AKHTAR JAMIL 7
Autoregressive models

The first CausalConv1D is of type A. Next, two


layers use CausalConv1D (option B) with
04/23/2025 Presented by Dr. AKHTAR JAMIL 8
PixelRNN
• PixelRNNs Components:
– Twelve fast two-dimensional Long Short-Term Memory (LSTM) layers.
• Two types of these layers are designed:
– Row LSTM layer: where the convolution is applied along each row
– Diagonal BiLSTM layer: where the convolution is applied in a novel
fashion along the diagonals of the image.
• The networks also incorporate residual connections around LSTM
layers
– Helps with the training of the PixelRNN for up to twelve layers of depth.

04/23/2025 Presented by Dr. AKHTAR JAMIL 9


PixelCNN
• Convolutional Neural Networks (CNN) is used as sequence model
with a fixed dependency range
– Need Masked convolutions.
• The PixelCNN architecture is a fully convolutional network of
fifteen layers that preserves the spatial resolution of its input
throughout the layers and outputs a conditional distribution at
each location.

04/23/2025 Presented by Dr. AKHTAR JAMIL 10


Model
• The network scans the image
one row at a time and one pixel
at a time within each row.
• For each pixel it predicts the
conditional distribution over the
possible pixel values given the
scanned context.
• The joint distribution over the
image pixels is factorized into a
product of conditional
distributions.
04/23/2025 Presented by Dr. AKHTAR JAMIL 11
Generating an Image Pixel by Pixel
•The goal is to assign a probability to each pixel of image
– A pixels.
•Assume the image as a one-dimensional sequence of pixels

•Pixels are taken from the image row by row.


•To estimate the joint distribution we write it as the product of the
conditional distributions over the pixels:

04/23/2025 Presented by Dr. AKHTAR JAMIL 12


Generating an Image Pixel by Pixel
• For color image, each pixel xi is jointly determined by three values
– One for each of the color channels: Red, Green and Blue (RGB).
• The distribution p(xi|x<i) as the following product:

• Each of the color values is conditioned on the other channels as well


as on all the previously generated pixels.
• During training and evaluation the distributions over the pixel values
are computed in parallel, while the generation of an image is
sequential.
04/23/2025 Presented by Dr. AKHTAR JAMIL 13
Pixels as Discrete Variables
• Previous approaches use a continuous
distribution for the values of the pixels in
the image.
• PixelRNN model p(x) as a discrete
distribution
– Every conditional distribution is multinomial
that is modeled with a softmax layer.
• Each channel variable xi simply takes
one of 256 distinct values.
04/23/2025 Presented by Dr. AKHTAR JAMIL 14
Today’s Lecture

04/23/2025 Presented by Dr. AKHTAR JAMIL 15


Architectural Components pf PixelRNN
• Two types of LSTM layers that use convolutions to compute at
once the states along one of the spatial dimensions.
• Incorporate residual connections to improve the training of a
PixelRNN with many LSTM layers.
• Softmax layer that computes the discrete joint distribution of the
colors and the masking technique that ensures the proper
conditioning scheme.

04/23/2025 Presented by Dr. AKHTAR JAMIL 16


LSTM Layers
• Row LSTM
• It is a unidirectional layer that processes the image row by row
from top to bottom computing features for a whole row at once;
• The computation is performed with a one-dimensional
convolution.
• For a pixel xi the layer captures a roughly triangular context above
the pixel as shown:

04/23/2025 Presented by Dr. AKHTAR JAMIL 17


LSTM Layers
• The kernel of the one-dimensional convolution has size k × 1
where k ≥ 3;
• The larger the value of k the broader the context that is captured.

Cell State

LSTM Cell RNN


04/23/2025 Presented by Dr. AKHTAR JAMIL 18
LSTM Layers
• Forget Gate: Decides what portions of the cell state should be
erased.
• Input Gate: Selects which values in the input should update the
cell state.
• Output Gate: Determines which parts of the cell state are passed
to the output.
• Cell State: Holds the LSTM unit's long-term memory across
sequence processing steps.

04/23/2025 Presented by Dr. AKHTAR JAMIL 19


LSTM Layers

04/23/2025 Presented by Dr. AKHTAR JAMIL 20


LSTM Layers
• An LSTM layer has two components:
– input-to-state component
– recurrent state-to-state component
• These together determine the four gates inside the LSTM core.
• To enhance parallelization in the Row LSTM the input-to-state
component is first computed for the entire two-dimensional input map;
• For input-to-state calculation, a k × 1 convolution is used.
• Convolution is masked to include only the valid context and produces
a tensor of size 4h×n×n
– where h is the number of output feature maps.

04/23/2025 Presented by Dr. AKHTAR JAMIL 21


LSTM Layers
• The state-to-state component of the LSTM layer is calculated:

• where xi of size h × n × 1 is row i of the input map


• Where xi is row i, represents the convolution operation and the element-
wise multiplication.
• are the kernel weights for the state-to-state and the
input-to-state components

04/23/2025 Presented by Dr. AKHTAR JAMIL 22


LSTM Layers
• The output, forget and input gates oi, fi and ii, the activation σ is
the logistic sigmoid function
• The content gate gi, σ is the tanh function.
• Each step computes the new state for an entire row of the input
map.
• The Row LSTM has a triangular receptive field, it is unable to
capture the entire available context.

04/23/2025 Presented by Dr. AKHTAR JAMIL 23


Diagonal BiLSTM
• The Diagonal BiLSTM is designed to both
parallelize the computation and to capture the
entire available context for any image size.
• In both directions of the layer scans the image
in a diagonal fashion
– From the top corner to the bottom right corner.
• Each step in the computation computes at
once the LSTM state along a diagonal in the
image.
• Resulting receptive field

04/23/2025 Presented by Dr. AKHTAR JAMIL 24


Diagonal BiLSTM
• Working: First, skew the input
map into a space
– Easy to apply convolutions along
diagonals.
• The skewing operation offsets
each row of the input map by one
position with respect to the
previous row
• This results in a map of size n ×
(2n − 1).

04/23/2025 Presented by Dr. AKHTAR JAMIL 25


Diagonal BiLSTM
• For each of the two directions, the input-to-state component is
simply a 1×1 convolution Kis that contributes to the four gates in
the LSTM core
• The operation generates a 4h × n × n tensor.
• The state-to-state recurrent component is then computed with a
column-wise convolution Kss that has a kernel of size 2 × 1.
• It takes the previous hidden and cell states, combines the
contribution of the input-to-state component and produces the
next hidden and cell states

04/23/2025 Presented by Dr. AKHTAR JAMIL 26


Diagonal BiLSTM
• The output feature map is then skewed back into an n × n map by
removing the offset positions
• This computation is repeated for each of the two directions.
• Uses a convolutional kernel of size 2 × 1 that processes a minimal
amount of information at each step
• Given the two output maps are added
– Left and the Right output map
• Kernel sizes larger than 2 × 1 are not particularly useful as they
do not broaden the already global receptive field of the Diagonal
BiLSTM
04/23/2025 Presented by Dr. AKHTAR JAMIL 27
Diagonal BiLSTM

1 2 3 4 1 2 3 4
5 6 7 8 5 6 7 8
9 0 1 2 9 0 1 2
5 4 6 7 5 4 6 7

1 2 3 4 1 2 3 4
5 6 7 8 5 6 7 8
9 0 1 2 9 0 1 2
5 4 6 7 5 4 6 7

04/23/2025 Presented by Dr. AKHTAR JAMIL 28


Residual Connections
• Residual connections (He et al., 2015):
– Residual connections were introduced in the ResNet architecture.
– Create shortcuts that allow the signal to skip one or more layers.
– These connections do this by adding the input of the current layer
to its output, which helps to preserve the strength of the signal
across the network.
• PixelRNNs is trained up to twelve layers of depth.
• To increase both convergence speed and propagate signals more
directly through the network, residual connections are used.
– LSTM layer to the next.

04/23/2025 Presented by Dr. AKHTAR JAMIL 29


Residual Connections
• The input map to the PixelRNN LSTM
layer has 2h features.
• The input-to-state component reduces the
number of features by producing h
features per gate.
• After applying the recurrent layer, the
output map is upsampled back to 2h
features per position via a 1 × 1
convolution and the input map is added to
the output map.
• We can also use learnable skip
connections from each layer to the output.
• The addition of residual and layer-to-
output skip connections is more effective.

04/23/2025 Presented by Dr. AKHTAR JAMIL 30


Masked Convolution
• The h features for each input position at every layer in the network
are split into three parts
– Each corresponding to one of the RGB channels.
• When predicting the R channel for the current pixel xi, only the
generated pixels left and above of x i can be used as context.
• When predicting the G channel, the value of the R channel can
also be used as context in addition to the previously generated
pixels.
• When predicting the B channel, the values of both the R and G
channels can be used.
04/23/2025 Presented by Dr. AKHTAR JAMIL 31
Masked Convolution
• To restrict connections in the network to these
dependencies, we apply a mask to the input- to-state
convolutions and to other convolutional layers
• Two types of masks are used as indicate with mask A and
mask B
• Mask A is applied only to the first convolutional layer in a
PixelRNN and restricts the connections to those
neighboring pixels and to those colors in the current pixels
that have already been predicted.
• Mask B is applied to all the subsequent input-to-state
convolutional transitions and relaxes the restrictions of
mask A by also allowing the connection from a color to
itself.

04/23/2025 Presented by Dr. AKHTAR JAMIL 32


PixelCNN
• The Row and Diagonal LSTM layers have a potentially
unbounded dependency range within their receptive field.
• High computational cost as each state needs to be computed
sequentially.
• How about making receptive field large, but not unbounded?
• We can use standard convolutional layers
– Compute features for all pixel positions at once.
• The PixelCNN uses multiple convolutional layers that preserve the
spatial resolution;
– Pooling layers are not used.

04/23/2025 Presented by Dr. AKHTAR JAMIL 33


PixelCNN
• Masks are adopted in the convolutions to avoid seeing the future
context;

1 1 1 1 1 1 1 1 1
1 0 0 1 1* 0 1 1** 0
0 0 0 0 0 0 0 0 0
R G B

04/23/2025 Presented by Dr. AKHTAR JAMIL 34


PixelRNN and PixelCNN Specification

04/23/2025 Presented by Dr. AKHTAR JAMIL 35


Evaluation
• All our models are trained and evaluated on the Negative Log-
Likelihood (NLL) function.

• Two datasets: CIFAR-10 and MNIST

04/23/2025 Presented by Dr. AKHTAR JAMIL 36


Samples Generated

04/23/2025 Presented by Dr. AKHTAR JAMIL 37


Research Questions
• All students will research and learn about the following topics:
• 1. Explain the working of 1x1 convolutions. What is the purpose of
using 1x1 convolutions? Give use cases.
• 2. In the DiagonalBiLSTM, a 2x1 convolution kernel is used? Is it
useful to increase its size like 3x1 or 5x1? Why?
• 3. What are Skip connections? Why they are useful? How can we
select the number of skip connections?

04/23/2025 Presented by Dr. AKHTAR JAMIL 38


References
• Pixel Recurrent Neural Networks

04/23/2025 Presented by Dr. AKHTAR JAMIL 39


Thank You 

04/23/2025 Presented by Dr. AKHTAR JAMIL 40

You might also like