0% found this document useful (0 votes)
19 views26 pages

AE556 2024 Topic4 CNN

The document discusses Convolutional Neural Networks (CNNs) and their advantages over fully-connected networks for image processing, including the use of convolutions, pooling, and techniques to prevent overfitting. It covers the architecture and parameters of CNNs, as well as notable models like AlexNet, VGGNet, GoogLeNet, and ResNet, highlighting their contributions to improving image classification performance. Additionally, it addresses the importance of loss functions and techniques like dropout and batch normalization in enhancing model training.

Uploaded by

Yang Woo Seong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views26 pages

AE556 2024 Topic4 CNN

The document discusses Convolutional Neural Networks (CNNs) and their advantages over fully-connected networks for image processing, including the use of convolutions, pooling, and techniques to prevent overfitting. It covers the architecture and parameters of CNNs, as well as notable models like AlexNet, VGGNet, GoogLeNet, and ResNet, highlighting their contributions to improving image classification performance. Additionally, it addresses the importance of loss functions and techniques like dropout and batch normalization in enhancing model training.

Uploaded by

Yang Woo Seong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Fall’24 AE556 AI for Aerospace Applications

Topic 4. Convolutional Neural Networks (CNN)

Sept. 24 (T), 2024

Han-Lim Choi
The Problem with Fully-Connected Networks

• A 256x256(RGB) image → ~200K dimensional input x


• A fully connected network would need a very large number of parameters,
very likely to overfit the data
• Generic deep network also dose not capture the “natural” invariances we
expect in images (translation, scale)

2
Convolutional Neural Networks

• To create architectures that can handle large images, restrict the weights in
two ways
1. Require that activations between layers only occur in “local” manner
2. Require that all activations share the same weights

These lead to an architecture known as a convolutional neural network

3
Convolutions

• Convolutions are a basic primitive in many computer vision and image


processing algorithms
• Idea is to “slide” the weights w (called a filter) over the image to produce a
new image, written y=z*w
• It repeats the process of performing convolution operations using kernels
(filters) and extracting features
kernel
Input image (=filter=weights) output

4
Additional Notes on Convolutions

• Pooling is a process that minimizes the quantity of data, removing noise and
leaving only clear information, making judgment and learning easier
• When a receptive field is formed only by convolution operations, the amount of
computation grows and is inefficient, thus more significant feature maps can be
obtained from the feature maps extracted through pooling

↑ Make the features into a 1D vector

5
Additional Notes on Convolutions

• Considering the convolution operation, the following questions may arise:


– 1) Adjusting the filter movement interval
– 2) Repeated convolution results in a smaller final output

• The stride refers to the interval at which the filter moves when applied
Adjusting this changes the filter movement interval

6
Additional Notes on Convolutions

• To maintain the output size of the output and preserve the edge pixels, a
technique called "padding" is used. This is the task of putting additional
values on the edges of the image
• It is usual to "zero pad" the input image so that the resultant image has the
same size

1) Zero Padding 2) Replicate Padding 3) Mirror Padding

7
Convolutions in Image Processing

• Convolutions (typically with prepecified filters) are a common operation in


many computer vision applications

8
Convolutions in Image Processing

• In RGB images, the result is generated by executing a convolution operation on


each channel and then adding them
• These are 2D convolution, but they work like a 3D convolution

9
Number of Parameters

• Consider a convolutional network that takes as input color(RGB) 32x32 images,


and uses the layers (all convolutional layers use zero-padding)
1. 5x5x64 conv
2. 2x2 maxpool
3. 3x3x128 con
4. 2x2 maxpool
5. Fully-connected to 10-dimensional output

• How many parameters does this network have?


1.
2.
3.
4.

10
Number of Parameters

• The total number of parameters in the network is the sum of the number of
parameters in each convolution layer (pooling, stride, and padding are
hyperparameters that are not calculated)

𝟐 𝑲 : Filter size
𝒄 𝑪 : Number of input channels
𝒄 𝒄+ 𝑵 : Number of filters

• The depth of each kernel in a convolution layer is always equal to the number
of channels in the input image
• All kernels have 𝟐 × parameters, and there are of them
• The number of parameters in the FC layer is as follows:

𝟐 𝑶 : Size of output image of the previous conv layer


𝒄𝒇
𝑪 : Number of neurons in the FC layer
𝒄𝒇 𝒄𝒇 + 𝑵 : Number of filters

11
Number of Parameters

• The convolution layer's output tensor size is determined by the input


image size, padding, stride, and kernel. The number of channels in the output
image is the same as the number of kernels (N)
𝑶 : Size of output image
𝑰 : size of input image
𝑲 : size of kernels
𝑺 : stride
𝑷 : padding size

• In slide 10, What is the output size passed to the FC layer?

12
Improving model performance

• A neural network might become unresponsive to test data if it is overtrained


on training data, and this is called “overfitting”
• Models with numerous parameters or models with not enough data for
training are subject to overfitting

• Dropout can assist in resolving the overfitting


• Part of the layer's input units are "dropped out" at random at each learning
step (think of it like a random forest in a decision tree)

Turn off a few


neurons randomly

13
Improving model performance

• Batch normalization aids in improving slow or unstable learning


• Every incoming batch is normalized by the batch normalization layer using its
mean and standard deviation
• The data is then rescaled using learnable rescale parameters to a new scale
This can help prevent outlier learning in the model

14
Learning with Convolutions

• How do we apply backpropagation to neural networks with convolutions?

𝒊 𝟏= 𝒊( 𝒊* 𝒊 𝒊)

• Remember that for a dense layer 𝒊 𝟏 = 𝒊 ( 𝒊 * 𝒊 𝒊 ), forward pass required


multiplication by 𝒊 and backward pass required multiplication by 𝑻𝒊

• We’re going to show that convolution is a type of (highly structured) matrix


multiplication, and show how to compute the multiplication by its transpose

15
Convolutions as Matrix Multiplication

• Consider initially a 1D convolution 𝒊* 𝒊 for 𝒊


𝟑 , 𝒊
𝟔

• Then 𝒊* 𝒊 = 𝒊 𝒊 for

𝟏 𝟐 𝟑
𝟏 𝟐 𝟑
𝒊
𝟏 𝟐 𝟑
𝟏 𝟐 𝟑

𝑻
• So how de we multiply by 𝒊?

16
Convolutions as Matrix Multiplication

• Multiplication by transpose is just

𝟏
𝟐 𝟏
𝑻 𝟑 𝟐 𝟏
𝒊 𝒊 𝟏= 𝒊 𝟏= 𝒊 𝟏 * 𝒊
𝟑 𝟐 𝟏
𝟑 𝟐
𝟑

• where 𝒊 𝟏 is just the flipped version of 𝒊


• In other words, transpose of convolution is just (zero-padded) convolution by
flipped filter (correlations for signal processing people)

• Property holds for 2D convolutions, backprop just flips convolutions

17
Loss function

• The loss function value is utilized during the learning process to determine
how well the model fits the learning data, and it is usually divided into two
categories: regression and classification
• “MSE” (Mean Squared Error) is commonly employed in regression problems
where the NN output and goal values are continuous (It is also used to determine
the difference between pictures or masks in segmentation and shows how far the data
is from the mean)
𝟏 𝒏
MSE = 𝒊 𝟏 𝒊 𝒊
𝟐
𝒏

• “Cross Entropy” (Binary or Categorical) calculates the probability of belonging


to a specific class
𝟏 𝒏
CE = - 𝒊 𝟏 𝒊 𝒊 In multi classes (softmax)
𝒏
𝟏 𝒏
CE = - 𝒊 𝟏 𝒊 𝒊 𝒊 𝒊 )) In binary classes (sigmoid)
𝒏

18
Loss function

• Cross Entropy (Binary or Categorical) calculates the probability of belonging to


a specific class
• The softmax function converts the final result to [0, 1], and the loss is
determined using cross entropy with the 1-hot label
Prediction value (𝒚𝒊 ) Real label (1-hot label, 𝒚𝒊 )
Sample #1 Sample #2 Sample #3 Sample #1 Sample #2 Sample #3
0.3 0.3 0.4 1 0 0
0.1 0.5 0.1 0 1 0

Increased 0.6 0.2 0.5 0 0 1


prediction
𝑪𝑬 = −(𝒍𝒐𝒈 𝟎. 𝟑 + 𝒍𝒐𝒈 𝟎. 𝟓 + 𝒍𝒐𝒈 𝟎. 𝟓 ) = 𝟐. 𝟓𝟗
accuracy
Sample #1 Sample #2 Sample #3 Sample #1 Sample #2 Sample #3
0.4 0.2 0.1 1 0 0
0.1 0.7 0.1 0 1 0
0.5 0.1 0.8 0 0 1

𝑪𝑬 = − 𝒍𝒐𝒈 𝟎. 𝟒 + 𝒍𝒐𝒈 𝟎. 𝟕 + 𝒍𝒐𝒈 𝟎. 𝟖 = 𝟏. 𝟓𝟎

19
LeNet, Digit Classification

• The network that started it all (and then stopped for ~14 years)

20
AlexNet

• Alexnet is an 8-layer CNN model that learns two identical structures in parallel
utilizing GPUs
• This replaces the existing tanh and sigmoid functions with the “ReLU” function
This converges six times faster than the current activation function. Following
this, most current models use the ReLU function

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "ImageNet classification with deep convolutional neural networks." Communications
of the ACM 60.6 (2012): 84-90. 21
VGGNet

• The Oxford University research team VGG developed VGGNet, which won
second in the 2014 ImageNet image recognition competition
• The VGGNet model confirmed that the deeper the network, the better the
performance, and it employed a stack of numerous tiny, 3x3 kernels in place of
large kernels

• VGG-16, 19, etc. are some divisions of the setup based on the layer's depth

Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint
arXiv:1409.1556 (2014). 22
GoogLeNet (Inception module)

• An idea was made for an inception module that uses different kernel sizes and
bottleneck structures to enhance the computational inefficiency of VGG
• It effectively extracts features by employing parallel convolutional layers at
several sizes
• GAP was used in place of the last FC layer to significantly reduce the size of the
model and improve accuracy and computational efficiency

Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition.
(2015). 23
ResNet

• ResNet uses residual connections in a block structure to handle gradient


vanishing and degradation problems efficiently
• GoogLeNet has 22 layers because gradient vanishing happens as the layer
becomes deeper, while resnet can create a deep network with up to 152 layers

This is similar to an open book test


where you are given material that
you have already learned

In 𝐻 𝑥 = 𝐹 𝑥 + 𝑥,
we learn so that 𝐹 𝑥 = 𝐻 𝑥 − 𝑥
which corresponds to the additional
learning amount, becomes 0
𝐻 𝑥 − 𝑥 is called the “residual”

• Residual blocks u lize shortcuts that add input values to output values
• By simply adding input x, the network uses “skip connections” to make each
layer learn only small information excluding existing information
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition.
(2016). 24
ResNet (bottleneck structure)

• Because this deep network demands a significant amount of computing, a


bottleneck block is used in deep models
• 1x1 convolution is used before 3x3 convolution to minimize the number of
channels, and subsequently it is converted back to 1x1 convolution
• Thus, when utilizing deep resnet (or any other network), it is recommended to
employ bottleneck

strandard bottleneck

He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition.
(2016). 25
Classification model performance

26

You might also like