0% found this document useful (0 votes)
5 views35 pages

Lecture 5

The document discusses the significance of the ImageNet dataset and the ILSVRC competition in the context of machine learning for computer vision, particularly focusing on breakthrough convolutional neural networks like AlexNet, VGG, GoogLeNet, and ResNet. It highlights the architectural innovations and performance improvements these networks have achieved in image classification tasks. The conclusion emphasizes the advantages of convolutional neural networks over traditional fully-connected networks in handling image data.

Uploaded by

hafsaladhasse7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views35 pages

Lecture 5

The document discusses the significance of the ImageNet dataset and the ILSVRC competition in the context of machine learning for computer vision, particularly focusing on breakthrough convolutional neural networks like AlexNet, VGG, GoogLeNet, and ResNet. It highlights the architectural innovations and performance improvements these networks have achieved in image classification tasks. The conclusion emphasizes the advantages of convolutional neural networks over traditional fully-connected networks in handling image data.

Uploaded by

hafsaladhasse7
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Machine Learning for Computer Vision

Convolutional Neural Networks

Mehdi Zakroum

International University of Rabat

Acknowledgments for slides: Courtesy of Prof. Mounir Ghogho.


Outline

1. About The ImageNet Dataset and The ILSVRC Competition

2. Breakthrough Convolutional Neural Networks

3. Conclusion

Mehdi Zakroum International University of Rabat 1 / 31 (3%)


1. About The ImageNet Dataset and The ILSVRC
Competition
1. About The ImageNet Dataset and The ILSVRC Competition

The ImageNet Dataset

ImageNet is an image dataset:


▶ Collected from the web
▶ It counts 14+ million high-resolution images
▶ Images are annotated along 21000+ categories
▶ Images of each concept (category) are quality-controlled and
human-annotated
▶ Images are of variable resolution (size)

Mehdi
Website: Zakroum International University of Rabat
https://www.image-net.org 2 / 31 (6%)
1. About The ImageNet Dataset and The ILSVRC Competition

About ILSVRC

ImageNet Large Scale Visual Recognition Challenge evaluates algorithms


for object detection and image classification at large scale.

It uses a subset of the ImageNet dataset:


▶ 1,281,167 training images
▶ 50,000 validation images
▶ 100,000 test images
▶ Images have been down-sampled to a fixed resolution of 256 × 256:
images are first rescaled then cropped to retain only the 256 × 256
central region

Researchers around the world report their results and the most successful
and innovative teams are invited to present at the Computer Vision and
Pattern Recognition (CVPR) conference.

Mehdi Zakroum International University of Rabat 3 / 31 (9%)


2. Breakthrough Convolutional Neural Networks
2. Breakthrough Convolutional Neural Networks

AlexNet

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet


classification with deep convolutional neural networks.” Advances in neural
information processing systems 25. 2012.

Mehdi Zakroum International University of Rabat 4 / 31 (12%)


2. Breakthrough Convolutional Neural Networks

AlexNet (2012)

AlexNet is one of the first Deep CNNs to achieve considerable accuracy on


the 2012 ILSVRC competition with an accuracy of 84.7% as compared to the
second-best with an accuracy of 73.8%.

How many trainable parameters?

Mehdi Zakroum International University of Rabat 5 / 31 (16%)


2. Breakthrough Convolutional Neural Networks

AlexNet (2012)

Mehdi Zakroum International University of Rabat 6 / 31 (19%)


2. Breakthrough Convolutional Neural Networks

AlexNet (2012): what made it win the challenge?

▶ ReLU Non-linearity: AlexNet uses Rectified Linear Units (ReLU)


instead of the tanh function, which was standard at the time.
▶ Multiple GPUs: AlexNet allows for multi-GPU training by putting half
of the model’s neurons on one GPU and the other half on another
GPU. Not only does this mean that a bigger model can be trained, but
it also cuts down on the training time.
▶ Overlapping Pooling: CNNs traditionally “pool” outputs of
neighboring groups of neurons with no overlapping.
▶ Methods to reduce overfitting: AlexNet had 60+ million parameters,
to alleviate this issue, AlexNet used Data Augmentation and
Dropout.

Mehdi Zakroum International University of Rabat 7 / 31 (22%)


2. Breakthrough Convolutional Neural Networks

VGG16 & VGG19

Karen Simonyan, and Andrew Zisserman. “Very deep convolutional


networks for large-scale image recognition.” arXiv preprint
arXiv:1409.1556. 2014.

Mehdi Zakroum International University of Rabat 8 / 31 (25%)


2. Breakthrough Convolutional Neural Networks

VGG16 (2014)

Figure 1: Architecture of VGG16

▶ Designed by Visual Geometry Group (VGG) from Oxford University


▶ 13 convolutional layers, 5 max pooling layers, 3 fully connected layers (4096, 4096, 1000)
⇒ 16 weight layers (i.e. 16 layers of trainable parameters)

How many trainable parameters?

Mehdi Zakroum International University of Rabat 9 / 31 (29%)


2. Breakthrough Convolutional Neural Networks

VGG16 (2014)

Mehdi Zakroum International University of Rabat 10 / 31 (32%)


2. Breakthrough Convolutional Neural Networks

Deeper CNN: VGG19

Figure 2: VGG19 Architecture

Mehdi Zakroum International University of Rabat 11 / 31 (35%)


2. Breakthrough Convolutional Neural Networks

Compared VGG CNNs and Number of Parameters

Mehdi Zakroum International University of Rabat 12 / 31 (38%)


2. Breakthrough Convolutional Neural Networks

VGG Accuracy

Mehdi Zakroum International University of Rabat 13 / 31 (41%)


2. Breakthrough Convolutional Neural Networks

VGG: What Made It Win the Competition?


Quote from the VGG article (where the authors discuss improvements over the previous models):

“It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective
receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field. So what have we gained by
using, for instance, a stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate
three non-linear rectification layers instead of a single one, which makes the decision function more
discriminative. Second, we decrease the number of parameters: assuming that both the input and the
output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by
3(32 C 2 ) = 27C 2 weights; at the same time, a single 7 × 7 conv. layer would require 72 C 2 = 49C 2
parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing
them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).”

Mehdi Zakroum International University of Rabat 14 / 31 (45%)


2. Breakthrough Convolutional Neural Networks

GoogLeNet (Inception)

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. “Going deeper with convolutions.” In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.

Mehdi Zakroum International University of Rabat 15 / 31 (48%)


2. Breakthrough Convolutional Neural Networks

Inception, a.k.a. GoogLeNet (2014)


▶ When designing a layer for a CNN, you need to pick a convolution layer with one specific
filter size (1 × 1, 3 × 3, 5 × 5, etc.), or you need to pick a pooling layer, etc.. This might
limit the captured information by that layer.
▶ The Inception model is characterized by its innovative use of “inception modules”.
These modules combine filters of different sizes, including 1 × 1, 3 × 3, and 5 × 5
convolutions, within the same layer.
▶ This architectural choice allows the network to capture features at multiple scales
simultaneously; i.e. the model gains the ability to adapt to different scales of features
present in the input data.

Figure 3: Inception Module (naive version)

Mehdi Zakroum International University of Rabat 16 / 31 (51%)


2. Breakthrough Convolutional Neural Networks

Inception Module Example

Figure 4: Example of an Inception Module:


input size is 28 × 28 × 192 and output size is 28 × 28 × 256

Mehdi Zakroum International University of Rabat 17 / 31 (54%)


2. Breakthrough Convolutional Neural Networks

Inception: Computational Cost Issue and Solution


▶ Issue: Large filters like 5 × 5 convolutions brings the benefit of capturing
spatial hierarchies and larger patterns, however, they are computationally
expensive.

The total number of float multiplication operations is:


28 × 28 × 32 × 5 × 5 × 192 ≈ 120M
▶ Solution: dimensionality reduction using 1 × 1 convolutions

The total number of float multiplication operations is:


(28 × 28 × 16 × 1 × 1 × 192) + (28 × 28 × 32 × 5 × 5 × 16) ≈ 12.4M
Mehdi Zakroum International University of Rabat 18 / 31 (58%)
2. Breakthrough Convolutional Neural Networks

Inception Module with Dimensionality Reduction

Figure 5: Inception Module (dimensionality reduction version)

Mehdi Zakroum International University of Rabat 19 / 31 (61%)


2. Breakthrough Convolutional Neural Networks

Inception

Figure 6: Full GoogLeNet Network

Mehdi Zakroum International University of Rabat 20 / 31 (64%)


2. Breakthrough Convolutional Neural Networks

Inception: Structural Details

Mehdi Zakroum International University of Rabat 21 / 31 (67%)


2. Breakthrough Convolutional Neural Networks

ResNet

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual
learning for image recognition.” In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 770-778. 2016.

Mehdi Zakroum International University of Rabat 22 / 31 (70%)


2. Breakthrough Convolutional Neural Networks

ResNet: Training Deeper Networks is Harder

Figure 7: Error curves on CIFAR-10 with 20-layer and 56-layer networks

▶ Network depth is of crucial importance: the network learns more


hidden patterns.
▶ The degradation issue is observed: deeper neural networks are
more difficult to train.
▶ The deeper network has higher training error, and thus test error.
▶ This degradation issue is not caused by overfitting!
Mehdi Zakroum International University of Rabat 23 / 31 (74%)
2. Breakthrough Convolutional Neural Networks

ResNet: Degradation Should not Happen In Deeper Nets!


The performance should not degrade by adding more layers, here is why:

Quote from the ResNet article:

“The degradation (of training accuracy) indicates that not all systems are similarly easy to
optimize. Let us consider a shallower architecture and its deeper counterpart that adds more
layers onto it. There exists a solution by construction to the deeper model: the added layers
are identity mapping, and the other layers are copied from the learned shallower model. The
existence of this constructed solution indicates that a deeper model should produce no higher
training error than its shallower counterpart. But experiments show that our current solvers
on hand are unable to find solutions that are comparably good or better than the constructed
solution (or unable to do so in feasible time).”

Mehdi Zakroum International University of Rabat 24 / 31 (77%)


2. Breakthrough Convolutional Neural Networks

ResNet: Introducing The Residual Block


▶ Let’s assume that H is the mapping that takes as input x and that produces the ideal
predicted output (which matches with the ground truth).
▶ Instead of learning the mapping H, why not learning the function F that tells us what
information to add to x to get the desired ideal output H(x) (H(x) = F(x) + x). F is
called the residual function.
▶ Thus, we want to learn: F(x) = H(x) − x. Then, to get the desired ideal output, we
compute: H(x) = F(x) + x.
▶ Hypothesis: it is easier to optimize the residual mapping F than to optimize the original,
unreferenced mapping H. To the extreme, if an identity mapping were optimal, it would
be easier to push the residual to zero than to fit an identity mapping by a stack of
nonlinear layers.

Figure 8: Example of a Residual Block

Mehdi Zakroum International University of Rabat 25 / 31 (80%)


2. Breakthrough Convolutional Neural Networks

ResNet: Optimization Algorithms Struggle To Learn Identity


Quote from the ResNet article:

“The degradation problem suggests that the solvers might have difficulties in approximating
identity mappings by multiple nonlinear layers. With the residual learning reformulation, if
identity mappings are optimal, the solvers may simply drive the weights of the multiple
nonlinear layers toward zero to approach identity mappings.

Mehdi Zakroum International University of Rabat 26 / 31 (83%)


2. Breakthrough Convolutional Neural Networks

ResNet: Shortcuts (Skip Connections) Formulation

Figure 9: Example of a Residual Block

▶ A ResNet building block is expressed as: y = F(x, {Wi }) + x, where:


x and y are the input and the output vectors
F represents the residual mapping to be learned, parameterized by {Wi }
{Wi } are the parameters to be learned
▶ In the figure above: F(x) = W2 σ(W1 x), where σ is the ReLU activation.
▶ To be able to perform the element-wise addition F(x) + x, the dimensions of F(x) and x
must be equal. If not, we can perform a linear projection with Ws as follows:

y = F(x, {Wi }) + Ws x

▶ The form of the residual function F is flexible; in the illustrative figure, F has two layers, while
more layers are possible.

Mehdi Zakroum International University of Rabat 27 / 31 (87%)


2. Breakthrough Convolutional Neural Networks

ResNet: Architecture

Figure 10: Architecture of ResNet-34 vs Plain-34

Mehdi Zakroum International University of Rabat 28 / 31 (90%)


2. Breakthrough Convolutional Neural Networks

ResNet: Results after Training on ImageNet

Figure 11: Training on ImageNet.


Thin curves denote training error.
Bold curves denote validation error.

Experiments show that ResNet:


▶ Allows faster convergence at early stage.
▶ Provides accuracy gains from increased depth.

Mehdi Zakroum International University of Rabat 29 / 31 (93%)


2. Breakthrough Convolutional Neural Networks

ResNet: Comparative Performance

Figure 12: Error rates (%) of single-model


results on the ImageNet validation set.

Figure 13: Error rates (%) of ensembles.


The top-5 error is on the test set of ImageNet.
Mehdi Zakroum International University of Rabat 30 / 31 (96%)
3. Conclusion
3. Conclusion

Conclusion
▶ Multi-Layer Perceptrons (Fully-Connected Neural Networks):
Do not scale well for images
Ignore the information brought by pixel position and correlation with neighbors
Cannot handle translations
▶ Convolutional Neural Networks:
Leverage sparse interaction and parameter sharing to reduce the number of
parameters to learn
Use the convolution operation to make object detection and classification
robust to shifts of objects in the image
Have demonstrated remarkable results in image classification for benchmark
tasks and practical applications.
However, CNNs have a limited robustness to other geometric transformations
such as scaling and rotation; Scaling or rotating an image changes the spatial
relationships between pixels and can result in a loss of relevant features that the
CNN was trained to recognize.

Mehdi Zakroum International University of Rabat 31 / 31 (100%)

You might also like