0% found this document useful (0 votes)
26 views6 pages

VGG Net

The document discusses Convolutional Neural Networks (CNNs) and their application in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to classify images into 1,000 categories using a large dataset. It highlights the VGG architecture, developed by the Visual Geometry Group at Oxford University, which utilizes small 3x3 convolution kernels to improve accuracy while managing parameters. The VGG models, ranging from VGG11 to VGG19, emphasize the benefits of deeper networks and multiple convolution layers for enhanced image classification performance.

Uploaded by

beebird234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views6 pages

VGG Net

The document discusses Convolutional Neural Networks (CNNs) and their application in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which aims to classify images into 1,000 categories using a large dataset. It highlights the VGG architecture, developed by the Visual Geometry Group at Oxford University, which utilizes small 3x3 convolution kernels to improve accuracy while managing parameters. The VGG models, ranging from VGG11 to VGG19, emphasize the benefits of deeper networks and multiple convolution layers for enhanced image classification performance.

Uploaded by

beebird234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Convolutional Neural Network (CNN, or ConvNet) are a special kind of multi-layer

neural networks, designed to recognize visual patterns directly from pixel images with
minimal preprocessing.. The ImageNet project is a large visual database designed for
use in visual object recognition software research. The ImageNet project runs an
annual software contest, the ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), where software programs compete to correctly classify and detect objects
and scenes. Here I will talk about CNN architectures of ILSVRC top competitor

What is ImageNet?

ImageNet is formally a project aimed at (manually) labeling and categorizing images


into almost 22,000 separate object categories for the purpose of computer vision
research.

However, when we hear the term “ImageNet” in the context of deep learning and
Convolutional Neural Networks, we are likely referring to the ImageNet Large Scale
Visual Recognition Challenge, or ILSVRC for short.
The goal of this image classification challenge is to train a model that can correctly
classify an input image into 1,000 separate object categories.

Models are trained on ~1.2 million training images with another 50,000 images for
validation and 100,000 images for testing.

These 1,000 image categories represent object classes that we encounter in our day-
to-day lives, such as species of dogs, cats, various household objects, vehicle types,
and much more. When it comes to image classification, the ImageNet challenge is
the de facto benchmark for computer vision classification algorithms — and the
leaderboard for this challenge has been dominated by Convolutional Neural
Networks and deep learning techniques since 2012.

VGGNet
Introduction-

The full name of VGG is the Visual Geometry Group, which belongs to the Department
of Science and Engineering of Oxford University. It has released a series of
convolutional network models beginning with VGG, which can be applied to face
recognition and image classification, from VGG16 to VGG19. The original purpose of
VGG's research on the depth of convolutional networks is to understand how the depth
of convolutional networks affects the accuracy and accuracy of large-scale image
classification and recognition. -Deep-16 CNN), in order to deepen the number of
network layers and to avoid too many parameters, a small 3x3 convolution kernel is
used in all layers.

Network structure-

The input of VGG is set to an RGB image of 224x244 size. The average RGB value is
calculated for all images on the training set image, and then the image is input as an
input to the VGG convolution network. A 3x3 or 1x1 filter is used, and the convolution
step is fixed. . There are 3 VGG fully connected layers, which can vary from VGG11
to VGG19 according to the total number of convolutional layers + fully connected
layers. The minimum VGG11 has 8 convolutional layers and 3 fully connected layers.
The maximum VGG19 has 16 convolutional layers. +3 fully connected layers. In
addition, the VGG network is not followed by a pooling layer behind each convolutional
layer, or a total of 5 pooling layers distributed under different convolutional layers. The
following figure is VGG Structure diagram:

VGG16 contains 16 layers and VGG19 contains 19 layers. A series of VGGs are
exactly the same in the last three fully connected layers. The overall structure includes
5 sets of convolutional layers, followed by a MaxPool. The difference is that more and
more cascaded convolutional layers are included in the five sets of convolutional
layers .
Each convolutional layer in AlexNet contains only one convolution, and the size of the
convolution kernel is 7 * 7 ,. In VGGNet, each convolution layer contains 2 to 4
convolution operations. The size of the convolution kernel is 3 * 3, the convolution step
size is 1, the pooling kernel is 2 * 2, and the step size is 2. The most obvious
improvement of VGGNet is to reduce the size of the convolution kernel and increase
the number of convolution layers.
Using multiple convolution layers with smaller convolution kernels instead of a larger
convolution layer with convolution kernels can reduce parameters on the one hand,
and the author believes that it is equivalent to more non-linear mapping, which
increases the Fit expression ability.

Two consecutive 3 * 3 convolutions are equivalent to a 5 * 5 receptive field, and three


are equivalent to 7 * 7. The advantages of using three 3 * 3 convolutions instead of
one 7 * 7 convolution are twofold : one, including three ReLu layers instead of one ,
makes the decision function more discriminative; and two, reducing parameters . For
example, the input and output are all C channels. 3 convolutional layers using 3 * 3
require 3 (3 * 3 * C * C) = 27 * C * C, and 1 convolutional layer using 7 * 7 requires 7
* 7 * C * C = 49C * C. This can be seen as applying a kind of regularization to the 7 *
7 convolution, so that it is decomposed into three 3 * 3 convolutions.

The 1 * 1 convolution layer is mainly to increase the non-linearity of the decision


function without affecting the receptive field of the convolution layer. Although the 1 *
1 convolution operation is linear, ReLu adds non-linearity.

You might also like