Lecture 5
Lecture 5
Mehdi Zakroum
3. Conclusion
Mehdi
Website: Zakroum International University of Rabat
https://www.image-net.org 2 / 31 (6%)
1. About The ImageNet Dataset and The ILSVRC Competition
About ILSVRC
Researchers around the world report their results and the most successful
and innovative teams are invited to present at the Computer Vision and
Pattern Recognition (CVPR) conference.
AlexNet
AlexNet (2012)
AlexNet (2012)
VGG16 (2014)
VGG16 (2014)
VGG Accuracy
“It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective
receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field. So what have we gained by
using, for instance, a stack of three 3 × 3 conv. layers instead of a single 7 × 7 layer? First, we incorporate
three non-linear rectification layers instead of a single one, which makes the decision function more
discriminative. Second, we decrease the number of parameters: assuming that both the input and the
output of a three-layer 3 × 3 convolution stack has C channels, the stack is parametrised by
3(32 C 2 ) = 27C 2 weights; at the same time, a single 7 × 7 conv. layer would require 72 C 2 = 49C 2
parameters, i.e. 81% more. This can be seen as imposing a regularisation on the 7 × 7 conv. filters, forcing
them to have a decomposition through the 3 × 3 filters (with non-linearity injected in between).”
GoogLeNet (Inception)
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed,
Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew
Rabinovich. “Going deeper with convolutions.” In Proceedings of the IEEE
conference on computer vision and pattern recognition, pp. 1-9. 2015.
Inception
ResNet
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual
learning for image recognition.” In Proceedings of the IEEE conference on
computer vision and pattern recognition, pp. 770-778. 2016.
“The degradation (of training accuracy) indicates that not all systems are similarly easy to
optimize. Let us consider a shallower architecture and its deeper counterpart that adds more
layers onto it. There exists a solution by construction to the deeper model: the added layers
are identity mapping, and the other layers are copied from the learned shallower model. The
existence of this constructed solution indicates that a deeper model should produce no higher
training error than its shallower counterpart. But experiments show that our current solvers
on hand are unable to find solutions that are comparably good or better than the constructed
solution (or unable to do so in feasible time).”
“The degradation problem suggests that the solvers might have difficulties in approximating
identity mappings by multiple nonlinear layers. With the residual learning reformulation, if
identity mappings are optimal, the solvers may simply drive the weights of the multiple
nonlinear layers toward zero to approach identity mappings.
y = F(x, {Wi }) + Ws x
▶ The form of the residual function F is flexible; in the illustrative figure, F has two layers, while
more layers are possible.
ResNet: Architecture
Conclusion
▶ Multi-Layer Perceptrons (Fully-Connected Neural Networks):
Do not scale well for images
Ignore the information brought by pixel position and correlation with neighbors
Cannot handle translations
▶ Convolutional Neural Networks:
Leverage sparse interaction and parameter sharing to reduce the number of
parameters to learn
Use the convolution operation to make object detection and classification
robust to shifts of objects in the image
Have demonstrated remarkable results in image classification for benchmark
tasks and practical applications.
However, CNNs have a limited robustness to other geometric transformations
such as scaling and rotation; Scaling or rotating an image changes the spatial
relationships between pixels and can result in a loss of relevant features that the
CNN was trained to recognize.