Advancements in Image Classification Using Convolutional Neural Network

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN)

Advancements in Image Classification using


Convolutional Neural Network
Farhana Sultana Abu Sufian Paramartha Dutta
Department of Computer Science Department of Computer Science Department of CSS
University of Gour Banga University of Gour Banga Visva-Bharati University
West Bengal, India West Bengal, India West Bengal, India
Email: [email protected] Email: sufi[email protected] Email: [email protected]

Abstract—Convolutional Neural Network (CNN) is the state- time in spite of several advantages, the performance of CNN
of-the-art for image classification task. Here we have briefly in intricate problems such as classification of high-resolution
discussed different components of CNN. In this paper, We have image, was limited by the lack of large training data, lack of
explained different CNN architectures for image classification.
Through this paper, we have shown advancements in CNN from better regularization method and inadequate computing power.
LeNet-5 to latest SENet model. We have discussed the model Nowadays we have larger datasets with millions of high
description and training details of each model. We have also resolution labelled data of thousands category like ImageNet
drawn a comparison among those models. [10], LabelMe [11] etc. With the advent of powerful GPU
Keywords—AlexNet, Capsnet, Convolutional Neural Network, machine and better regularization method, CNN delivers out-
Deep learning, DenseNet, Image classification, ResNet, SENet.
standing performance on image classification tasks. In 2012 a
I. I NTRODUCTION large deep convolution neural network, called AlexNet [12],
designed by Krizhevsky et al. showed excellent performance
Computer vision consists of different problems such as on the ImageNet Large Scale Visual Recognition Challenge
image classification, localization, segmentation and object (ILSVRC) [13]. The success of AlexNet has become the
detection. Among those, image classification can be consid- inspiration of different CNN model such as ZFNet [14],
ered as the fundamental problem and forms the basis for VGGNet [15], GoogleNet [16], ResNet [17], DenseNet [18],
other computer vision problems. Until ’90s only traditional CapsNet [19], SENet [20] etc in the following years.
machine learning approaches were used to classify image. In this study, we have tried to give a review of the advance-
But the accuracy and scope of the classification task were ments of the CNN in the area of image classification. We
bounded by several challenges such as hand-crafted feature have given a general view of CNN architectures in section
extraction process etc. In recent years, the deep neural network II. Section III describes architecture and training details of
(DNN), also entitled as deep learning [1][2], finds complex different models of CNN. In Section IV we have drawn a
formation in large data sets using the backpropagation [3] comparison between various CNN models. Finally, we have
algorithm. Among DNNs, convolutional neural network has concluded our paper in Section V.
demonstrated excellent achievement in problems of computer
vision, especially in image classification. II. C ONVOLUTIONAL N EURAL N ETWORK
Convolutional Neural Network (CNN or ConvNet) is a A typical CNN is composed of single or multiple blocks of
especial type of multi-layer neural network inspired by the convolution and sub-sampling layers, after that one or more
mechanism of the optical system of living creatures. Hubel fully connected layers and an output layer as shown in figure
and Wiesel [4] discovered that animal visual cortex cells detect 1.
light in the small receptive field. Motivated by this work,
in 1980, Kunihiko Fukushima introduced neocognitron [5]
which is a multi-layered neural network capable of recognizing
visual pattern hierarchically through learning. This network is
considered as the theoretical inspiration for CNN. In 1990
LeCun et al. introduced the practical model of CNN [6] [7] Fig. 1: Building block of a typical CNN
and developed LeNet-5 [8]. Training by backpropagation [9]
algorithm helped LeNet-5 recognizing visual patterns from
raw pixels directly without using any separate feature engi- A. Convolutional Layer
neering mechanism. Also fewer connections and parameters The convolutional layer (conv layer) is the central part of a
of CNN than conventional feedforward neural networks with CNN. Images are generally stationary in nature. That means
similar network size, made model training easier. But at that the formation of one part of the image is same as any other
978-1-5386-7638-7/18/$31.00 © 2018 IEEE part. So, a feature learnt in one region can match similar

122

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
pattern in another region. In a large image, we take a small shown in figure 5, has 7 weighted (trainable) layers. Among
section and pass it through all the points in the large image them, three (C1, C3, C5) convolutional layers, two (S2, S4)
(Input). While passing at any point we convolve them into a average pooling layers, one (F6) fully connected layer and
single position (Output). Each small section of the image that one output layer. Sigmoid function was used to include non-
passes over the large image is called filter (Kernel). The filters linearity before a pooling operation. The output layer used
are later configured based on the back propagation technique. Euclidean Radial Basis Function units (RBF) [21] to classify
Figure 2 shows typical convolutional operation. 10 digits.

Fig. 5: Architecture of LeNet-5 [8]


Fig. 2: Convolutional Layer
In table I we have shown different layers, size of the filter
B. Sub-sampling or Pooling Layer used in each convolution layer, output feature map size and
the total number of parameters required per layer of LeNet-5.
Pooling simply means down sampling of an image. It
takes small region of the convolutional output as input and TABLE I: Architecture of LeNet-5
sub-samples it to produce a single output. Different pooling
Layer filter # filter output size #Para-
techniques are there such as max pooling, mean pooling, size/stride meters
average pooling etc. Max pooling takes largest of the pixel Convolution(C1) 5 × 5/1 6 28 × 28 × 6 156
values of a region as shown in figure 3. Pooling reduces the Sub-sampling(S2) 2 × 2/2 14 × 14 × 6 12
Convolution(C3) 5 × 5/1 16 10×10×16 1516
number of parameter to be computed but makes the network Sub-sampling(S4) 2 × 2/2 5 × 5 × 16 32
invariant to translations in shape, size and scale. Convolution(C5) 5×5 120 1 × 1 × 120 48120
Fully Connected(F6) 2×2 14 × 14 × 6 10164
OUTPUT 84

1) Dataset used: To train and test LeNet-5, LeCun et al.


used the MNIST [22] database of handwritten digits. The
database contains 60k training and 10k test data. The input
image size of this model is basically 32 × 32 pixels which is
Fig. 3: Max Pooling operation larger than the largest character (20×20 pixels) in the database
as center part of the receptive field is rich in features. Input
C. Fully-connected Layer (FC Layer) images are size normalized and centred in a 28 × 28 field.
They have used data augmentation like horizontal translation,
Last section of CNN are basically fully connected layers as
vertical translation, scaling, squeezing and horizontal shearing.
depicted in figure 4. This layer takes input from all neurons
2) Training Details: The authors trained several versions
in the previous layer and performs operation with individual
of LeNet-5 using stochastic gradient descent (SGD) [23]
neuron in the current layer to generate output.
approach with 20 iterations for entire training data per session
with a decreased rate of global learning rate and a momentum
of 0.02. In 1990’s LeNet-5 was sufficiently good. LeNet-5 and
LeNet-5 (with distortion) achieved test error rate of 0.95% and
0.8% respectively on MNIST data set.
But as the amount of data, resolution of an image and the
number of classes of a classification problem got increased
Fig. 4: Fully-connected layer with time, we needed deeper convolutional network and pow-
erful GPU machine to train the model.
III. D IFFERENT M ODELS OF CNN FOR I MAGE B. AlexNet-2012:
C LASSIFICATION
In 2012 Krizhevky et al. designed a large deep CNN,
A. LeNet-5(1998): called AlexNet [12], to classify ImageNet [10] data. The
In 1998 LeCun et al. introduced the CNN to classify architecture of AlexNet is same as LeNet-5 but much bigger. It
handwritten digit. Their CNN model, called LeNet-5 [8] as is made up of 8 trainable layers. Among them, 5 convolutional

123

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
layers (conv layer) and 3 fully connected layers are there. The authors have noticed that removing any middle layer
Using rectified linear unit (ReLU) [24] non-linearity after degrades network’s performance. So, the result depends on the
convolutional and FC layers helped their model to be trained depth of the network. Also, they have used purely supervised
faster than similar networks with tanh units. They have learning approach to simplify their experiment, but they have
used local response normalization (LRN), called ”brightness expected that unsupervised pre-training would help if we can
normalization”, after the first and second convolutional layer have adequate computational power to remarkably increase the
which aids generalization. They have used max-pooling layer network size without increasing the amount of the correspond-
after each LRN layer and fifth convolutional layer. In figure 6 ing labelled dataset.
architectural details of AlexNet is shown. In table II we have C. ZFNet
shown different elements of AlexNet.
In 2014 Zeiler and Fergus presented a CNN called ZFNet
[14]. The Architecture of AlexNet and ZFNet is almost similar
except that the authors have reduced 1st layer filter size to 7×7
instead of 11×11 and used stride 2 convolutional layer in both
first and second layers to retain more information in those
layers’ features. In their paper, the authors tried to explain
the reason behind the outstanding performance of large deep
CNN. They have used a novel visualization technique which
is a deconvolutional network with multiple networks, called
Fig. 6: Architecture of AlexNet [12] deconvnet [28], to map activation at higher layers back to the
space of input pixel to recognize which pixels of the input
layer is accountable for a given activation in the feature map.
TABLE II: Details of different layers of AlexNet Basically, deconvnet is a reversely ordered convnet. It accepts
Layer filter padding # filter output size #Para feature map as input and applies unpooling using a switch. A
size/ meters switch is basically the position of maxima within a pooling
stride
Conv-1 11×11/4 0 96 55 × 55 × 96 34848
region recorded during convolution. Then they rectify it using
pool-1 3 × 3/2 27 × 27 × 96 ReLU non-linearity and uses the transpose version of filters
Conv-2 5 × 5/1 2 256 27×27×256 614400 to rebuild the activity in the layer below which activates the
pool-2 3 × 3/2 13×13×256 chosen activation.
Conv-3 3 × 3/1 1 384 13×13×384 981504
Conv-4 3 × 3/1 1 384 13×13×384 1327104
Conv-5 3 × 3/1 1 256 13×13×256 884736
pool3 3 × 3/2 6 × 6 × 256
FC6 1 × 1 × 4096 37748736
FC7 1 × 1 × 4096 16777216
FC8 1 × 1 × 1000 4096000

1) Dataset used: Krizhevsky et al. designed AlexNet for Fig. 7: Architecture of ZFNet [14]
classification of 1.2 million high-resolution images of 1000
classes for ILSVRC - 2010 and ILSVRC - 2012 [25] . There 1) Training Details: ZFNet used the ImageNet dataset of
are around 1.2 million/50K/150K training/validation/testing 1.3 million/50k/100k training/validation/testing images. The
images. On ILSVRC, competitors submit two kinds of error authors trained their model following [12]. The slight differ-
rates: top-1 and top-5. ence is that they have substituted the sparse connection of
2) Training Details: From the variable resolution image of layers 3, 4 and 5 of AlexNet with a dense connection in their
ImageNet, AlexNet used down-sampled and centred 256×256 model and trained it on single GTX-580 GPU for 12 days
pixels image. To reduce overfitting they have used runtime data with 70 epochs. They have also experimented their model with
augmentation as well as a regularization method called dropout different depths and different filter sizes on Caltech 101 [29],
[26]. In data augmentation, they have extracted translated Caltech-256 [30] and PASCAL-2012 [31] data set and shown
and horizontally reflected 10 random patches of 224 × 224 that their model also generalizes these datasets well.
images and also used principal component analysis (PCA) During training their visualization technique discovers dif-
[27] for RGB channel shifting of training images. The authors ferent properties of CNN such as the projections from each
trained AlexNet using stochastic gradient descent (SGD) with layer in ascending order shows that the nature of the features
batch size of 128, weight decay of 0.0005 and momentum of are hierarchical in the network. For this reason, firstly, the
0.9. The weight decay works as a regularizer and it reduces upper layers need a higher number of epochs than lower layers
training error also. Their initial learning rate was 0.01 reduced to converge and secondly, the network output is stable to
manually three times by 1/10 when value accuracy plateaus. translation and scaling. They have used a bunch of occlusion
AlexNet was trained on two NVIDIA GTX-580 3 GB GPUs experiments to check whether the model is sensitive to local
using cross-GPU parallelization for five to six days. or global information.

124

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
D. VGGNet with dimensionality reduction instead of the naive version
Simonyan and Zisserman used deeper configuration of of inception module. Figure 9a and figure 9b are showing
AlexNet [12], and they proposed it as VGGNet [15]. They both inception modules. Despite 22 layers, the number of
have used small filters of size 3 × 3 for all layers and made parameters used in GoogLeNet is 12 times lesser than AlexNet
the network deeper keeping other parameters fixed. They have but its accuracy is significantly better. All the convolution,
used total 6 different CNN configurations: A, A-LRN, B, C, D reduction and projection layers use ReLU non-linearity. They
(VGG16) and E (VGG19) with 11, 11, 13, 16, 16, 19 weighted have used average pooling layer instead of the fully connected
layers respectively. Figure 8 shows configuration of model D. layers. On top of some inception modules, they have used
auxiliary classifiers which are basically smaller CNNs, to
combat vanishing gradient problem and overfitting.

Fig. 8: Architecture of VGGNet (configuration D, VGG16)

The authors have used three 1 × 1 filters in the sixth,


ninth and twelfth convolution layer in model C to increase
non-linearity. Also, a pack of three 3 × 3 convolution layers
(with stride 1) has same effective receptive field as one 7 × 7
convolution layer. So, They have substituted a single 7 × 7 (a)
layer with a pack of three 3 × 3 convolution layers and this
change increases non-linearity and decreases the number of
parameters of the network.
1) Training Details: The training procedure of VGGNet
follows AlexNet except the cropping and scaling sizes of input
image for training and testing. Pre-initialization of certain
layers and uses of small filters helps their model to converge
after 74 epoch in spite of having a large number of parameters
and greater depth. They have trained configuration VGG A (b)
with random initialisation. Then using its first 4 convolution Fig. 9: (a) Naive version Inception Module (b) Dimensionality
layers and last 3 FC layers as pre-initialised layers, they reduction version [16]
gradually increased the number of weighted layers up to 19
and trained VGG A-LRN to E. They have randomly cropped 1) Training Details: GoogLeNet, a CPU based implemen-
image to 224×224 from isotropically rescaled training images. tation, was trained using DistBelief [34] distributed machine
They perform horizontal flipping, random RGB colour shifting learning system by using moderate amount of model and data
and scale jittering as data augmentation technique. The scale parallelization. They used asynchronous SGD with momentum
jittering in train/test phase, the blending of cropped (multi- 0.9 and a constant learning rate schedule. Using different
crop) and uncropped (dense) test images result in better sampling and random ordering of input images, they have
accuracy. trained 7 ensemble GoogLeNet with same initialization. Un-
The authors experienced that a deep network with small fil- like AlexNet they have used resized image of 4 scales with
ters performs better than a shallower one with larger filters. So shorter dimension of 256, 288, 320 and 352 respectively. The
the depth of the network is important in visual representation. total number of crops per image is 4 (scales) ×3 (left, right
and centre square/scale) ×6 (4 corner and centre 224 × 224
E. GoogLeNet
crop and the square resized to 224 × 224) ×2 (mirror image
The architecture of GoogLeNet [16], proposed by Szegedy of all six crops)=144.
et al., is different from conventional CNN. They have increased The result of inception architecture has proved that moving
the number of units in each layer using parallel filters called towards sparser architecture is realistic and competent idea.
inception module [32] of size 1 × 1, 3 × 3 and 5 × 5 in
each convolution layer (conv layer). They have also increased F. ResNet
the layers to 22. Figure 10 shows the 22 layers GoogLeNet. He et al. experienced that a deeper CNN stacked up with
While designing this model, they have considered the com- more layers suffers from vanishing gradient problem. Though
putational budget fixed. So that the model can be used in this problem is handled by normalized and intermediate initial-
mobile and embedded systems. They have used a series of ization, the deeper model shows worse performance on both
weighted Gabor filters [33] of various size in the inception train and test errors and it is not caused by overfitting. This
architecture to handle multiple scales. To make the architecture indicates that optimization of deeper network is hard. To solve
computationally efficient they have used inception module this problem the authors used pre-trained shallower model

125

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
Fig. 10: The architecture of GoogLeNet [16]

with additional layers to perform identity mapping. So that also shown that with increased depth the ResNet, it is easier
the performance of deeper network and the shallower network to optimize and it gains accuracy.
should be similar. They have proposed deep residual learning
G. DenseNet
framework [17] as a solution to the degradation problem. They
have included residual mapping (H(x) = F (x) + x) instead Huang et al. introduced Dense Convolutional Networks
of desired underlying mapping (H(x)) into their network and (DenseNet) [18], which includes dense block in conventional
named their model as ResNet [17]. CNN. The input of a certain layer in a dense block is the
concatenation of the output of all the previous layers as shown
in figure 12. Here, each layer is reusing the features of all pre-
vious layers, strengthening feature propagation and reducing
vanishing gradient problem. Also uses of small number of
filters reduced the number of parameters as well.

(a) (b)

Fig. 11: (a) Plain layer (b) Residual block [17]

ResNet architecture consists of stacked residual blocks of


3 × 3 convolutional layers. They have periodically doubled
the number of filters and used a stride of 2. Figure 11a
and 11b shows a plain layer and residual block. As a first
layer, they have used a 7 × 7 conv layer. They have not
used any fully connected layers at the end. They have used
different depth (34, 50, 101 and 152) ResNet in ILSVRC- Fig. 12: A typical dense block with 5 layers [18]
2014 competition. For the CNN with depth more than 50 they
have used ’bottleneck’ layer for dimensionality reduction and
to improve efficiency as GoogLeNet. Their bottleneck design
consists of 1 × 1, 3 × 3 and 1 × 1 convolution layer. Although
the 152 Layer ResNet is 8 times deeper than VGG nets, it has
lower complexity than VGG nets (16/19).
1) Training Details: To train ResNet, He et al. used SGD
with batch size of 128, weight decay of 0.0001 and momentum
of 0.9. They have used a learning rate of 0.1 reduced manually Fig. 13: A DenseNet with 3 Dense Block
two times at 32k and 48k iterations by 1/10 when value
accuracy plateaus and stopped at 64k iterations. They used Figure 13 shows a DenseNet with three dense blocks. In
weight initialization and Batch Normalization after every conv a dense block, the non-linear transformation functions are a
layer. The did not use dropout regularization method. composite function of batch normalization, ReLU and 3 × 3
The experiment of ResNet shows the ability to train deeper convolution operation. They have also used the 1×1 bottleneck
network without degrading the performance. The authors have layer to reduce dimensionality.

126

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
1) Training Details: Huang et al. trained DenseNet on convolutional layer is a 1D layer, no routing is used between
CIFAR [35], SVHN [36] and ImageNet dataset using SGD this layer and primary capsule layer.
with batch size 64 on both CIFAR and SVHN dataset, and 1) Training details: Training of CapsNet is performed on
with batch size 256 on ImageNet dataset. Initial learning rate MNIST images. To compare the test accuracy, they have used
was 0.1 and is decreased two times by 1/10. They have used one standard CNN (baseline) and two CapsNets with 1 and 3
weight decay of 0.0001, Nesterov momentum [37] of 0.9 and routing iterations respectively. They have used reconstruction
dropout of 0.2. loss as regularization method. Using a 3 layer CapsNet with
On C10 [38], C100 [39], SVHN dataset DenseNet, 3 routing iterations and with added reconstruction the authors
DenseNet-BC outperforms the error rates of previous CNN get a test error of 0.25%.
architectures. A DenseNet, doubly deeper than ResNet, gives Though CapsNet has shown outstanding performance on
similar accuracy on ImageNet datasets with very less (factor MNIST, it may not perform well with large scale image dataset
of 2) number of parameters. The authors experienced that like ImageNet. It may also suffer from vanishing gradient
DenseNet can be scaled to hundreds layers without optimiza- problem.
tion difficulty. It also gives consistent improvement if number
of parameters increases without degrading performance and I. SENet
overfitting. Also, it requires comparatively fewer parameters In 2017, Hu et al. have designed ”Squeeze-and-Excitation
and less computational power for better performance. network” (SENet) [20] and have become the winner of
ILSVRC-2017. They have reduced the top-5 error rate to
H. CapsNet
2.25%. Their main contribution is ”Squeeze-and-Excitation”
Conventional CNNs, described above, suffer from two (SE) block as shown in figure 15. Here, Ftr : X → U is a
problems. Firstly, Sub-sampling loses the spatial information convolutional operation. A squeeze function (Fsq ) performs
between higher-level features. Secondly, it faces difficulty in average pooling on individual channel of feature map U
generalizing to novel view points. It can deal with translation and produce 1 × 1 × C dimensional channel descriptor. An
but can not detect different dimension of affine transformation. excitation function (Fex ) is a self-gating mechanism made up
In 2017, Geoffrey E. Hinton proposed CapsNet [19] to handle of three layers - two fully connected layers and a ReLU non-
these problems. CapsNet has components called capsule. A linearity layer in between. It takes squeezed output as input
capsule is a group of neurons. So a layer of CapsNet is and produce a per channel modulation weights. By applying
basically composed with nested neurons. Unlike a typical the excited output on the feature map U, U is scaled (Fs cale)
neural network, a capsule is squashed as a whole vector rather to generate final output (X) of SE block.
than individual output unit squashing. So scalar output feature
detector of CNN is replaced by vector output capsules. Also
max-pooling is replaced by ”dynamic routing by agreement”
which makes each capsule in each layer to go to the next most
relevant capsules at the time of forward propagation.
Architecture of a simple CapsNet is shown in figure 14.
Fig. 15: A Sqeeze and excitation block [20]

This SE block can be stacked together to make SENet which


generalise different data set very well. The authors developed
different SENets including these blocks into several complex
CNN models such as VGGNet [15], GoogLeNet [16], ResNext
(Variant of ResNet) [40], Inception-ResNet [41], MobileNet
Fig. 14: A 3 layer CapsNet, used for handwritten digit recog- [42], ShuffleNet [43].
nition [19] 1) Training Details: The authors have trained and test
their model variants on ImageNet, CIFAR-10 and CIFAR-100.
The CapsNet, proposed by Sabour et al, is composed with They have trained original CNN models and those models
three layers - two conv layers and one FC layer. First conv with SE blocks, and compare speed accuracy trade-off. They
layer consist of 256 convolutional unit (CU) with 9×9 kernels have shown that their models outperform original models by
of stride 1 and uses ReLU as activation function. This layer increasing a little bit training/testing time.
detects local features and then sends it to the primary capsules
IV. C OMPARATIVE R ESULT
of second layer as input. Each primary capsule contains 8
CU with 9 × 9 kernel of stride of 2. In total primary capsule In table III, we have shown comparative performance of
layer has 32 × 6 × 6 8D capsules. The final layer (DigitCaps) different CNN (AlexNet to DenseNet) on ImageNet dataset.
has one 16D capsule per digit class. The authors have used Top-1 and top-5 error rate on validation dataset and top-5 error
routing between primary layer and DigitCaps layer. As the first rates on test dataset are also shown.

127

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
TABLE III: Comparative performance of different CNN configurations. The + indicates- DenseNet with Bottleneck layer and
compression (10 crop testing result).
Name Dataset Year Type of CNN #trained layer Top- Top- Top-
of The 1(val) 5(val) 5(test)
CNN
1 CNN 8 40.7% 18.2%
5 CNN - 38.1% 16.4% 16.4%
AlexNet ImageNet 2012
1 CNN - 39.0% 16.6% -
7 CNN - 36.7% 15.4% 15.3%
1 CNN 8 38.4 % 16.5%
5 CNN - (a) - 36.7 % 15.3% 15.3%
ZFNet ImageNet 2013
1 CNN with layers 3, 4, 5: 512, 1024, 512 maps-(b) - 37.5 % 16.0% 16.1%
6 CNN, combination of (a) & (b) - 36.0 % 14.7% 14.8%
ensemble of 7 ConvNets (3-D,2-C & 2-E) - 24.7% 7.5% 7.3%
ConvNet- D( multi-crop & dense) 16 24.4 % 7.2% -
VGGNet ImageNet 2014
ConvNet-E (Multi-crop & dense ) 19 24.4 % 7.1% -
ConvNet-E (Multi-crop & dense ) 19 24.4 % 7.1% 7.0%
Ensemble of multi-scale ConvNets D & E (multi-crop & - 23.7% 6.8% 6.8%
dense)
1 CNN with 1 crop 22 - - 10.07%
1 CNN with 10 crops - - - 9.15%
GoogLeNet ImageNet 2014
1 CNN with 144 crops - - - 7.89%
7 CNN with 1 crop - - - 8.09%
1 CNN with 10 crops - - - 7.62%
1 CNN with 144 crops - - - 6.67%
plain layer 18 27.94% -
ResNet-18 18 27.88% -
ResNet ImageNet 2015
Plain layer 34 28.54% 10.02
ResNet-34 (zero-padding shortcuts), 10 crop testing -(a) 34 25.03% 7.76
ResNet-34 (projection shortcuts to increase dimension, oth- 34 24.52% 7.46%
ers are identity shortcuts ), 10 crop testing-(b)
ResNet-34 (all shortcuts are projection), 10 crop testing-(c) 34 24.52% 7.46%
ResNet-50 (with bottleneck layer), 10 crop testing 50 22.85% 6.71%
ResNet-101 (with bottleneck layer), 10 crop testing 101 21.75% 6.05%
ResNet-152 (with bottleneck layer), 10 crop testing 152 21.43% 5.71%
1 ResNet-34 (b) 34 21.84% 5.71%
1 ResNet-34 (c) 34 21.53% 5.60%
1 ResNet-50 50 20.74% 5.25%
1 ResNet-101 101 19.87% 4.60%
1 ResNet-152 152 19.38% 4.49%
Ensemble of 6 models - 3.57%
DensNet-121 + 121 23.61% 6.66%
DenseNet-169 + 169 22.80% 5.92%
DenseNet ImageNet 2016
DenseNet-201 + 201 22.58% 5.54%
DenseNet-264 + 264 20.80% 5.29%
SE-ResNet-50 50 23.29% 6.62%
SE-ResNext-50 50 21.10% 5.49%
SENet ImageNet 2017
SENet-154 (crop size 320 × 320/299 × 229) - 17.28% 3.79%
SENet-154(crop size 320 × 320) - 16.88% 3.58%

V. C ONCLUSION R EFERENCES
In this study, we have discussed the advancements of
CNN in image classification tasks. We have shown here [1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
no. 7553, pp. 436–444, 5 2015.
that although AlexNet, ZFNet and VGGNet followed the [2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
architecture of conventional CNN model such as LeNet-5 2016, http://www.deeplearningbook.org.
their networks are larger and deeper. We have experienced [3] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in
that combining inception module and residual blocks with International 1989 Joint Conference on Neural Networks, 1989, pp. 593–
605 vol.1.
conventional CNN model, GoogLeNet and ResNet gained [4] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional archi-
better accuracy than stacking the same building blocks again tecture of monkey striate cortex,” Journal of Physiology (London), vol.
and again. DenseNet focused on feature reusing to strengthen 195, pp. 215–243, 1968.
[5] K. Fukushima, “Neocognitron: A self-organizing neural network model
the feature propagation. Though CapsNet reached state-of-the- for a mechanism of pattern recognition unaffected by shift in position,”
art achievement on MNIST but it is yet to perform as well as Biological Cybernetics, vol. 36, no. 4, pp. 193–202, Apr 1980. [Online].
previous CNNs performance on high resolution image dataset Available: https://doi.org/10.1007/BF00344251
such as ImageNet. The result of SENet on ImageNet dataset [6] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten
gives us the hope that it may turn out useful for other task zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
which requires strong discriminative features. Dec 1989.

128

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.
[7] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, [27] I. Jolliffe, Principal Component Analysis. Berlin, Heidelberg: Springer
W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition Berlin Heidelberg, 2011, pp. 1094–1096. [Online]. Available: https:
with a back-propagation network,” in Advances in Neural Information //doi.org/10.1007/978-3-642-04898-2 455
Processing Systems 2, D. S. Touretzky, Ed. Morgan-Kaufmann, [28] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive deconvolutional
1990, pp. 396–404. [Online]. Available: http://papers.nips.cc/paper/ networks for mid and high level feature learning,” in 2011 International
293-handwritten-digit-recognition-with-a-back-propagation-network. Conference on Computer Vision, Nov 2011, pp. 2018–2025.
pdf [29] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models
[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning from few training examples: An incremental bayesian approach tested
applied to document recognition,” Proceedings of the IEEE, vol. 86, on 101 object categories,” in 2004 Conference on Computer Vision and
no. 11, pp. 2278–2324, Nov 1998. Pattern Recognition Workshop, June 2004, pp. 178–178.
[9] Y. L. Cun, “A theoretical framework for back-propagation,” 1988. [30] G. Griffin, A. Holub, and P. Perona, “Caltech256 image
[10] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: dataset,” 2006. [Online]. Available: http://www.vision.caltech.edu/
A Large-Scale Hierarchical Image Database,” in CVPR09, 2009. Image Datasets/Caltech256/
[11] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman, “Labelme: [31] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisser-
A database and web-based tool for image annotation,” International man, “Pascal visual object classes challenge 2012 (voc2012) complete
Journal of Computer Vision, vol. 77, no. 1, pp. 157–173, May 2008. dataset.”
[Online]. Available: https://doi.org/10.1007/s11263-007-0090-8 [32] M. Lin, Q. Chen, and S. Yan, “Network in network,” CoRR, vol.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification abs/1312.4400, 2013. [Online]. Available: http://arxiv.org/abs/1312.4400
with deep convolutional neural networks,” in Advances in Neural [33] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, and T. Poggio, “Robust
Information Processing Systems 25, F. Pereira, C. J. C. Burges, object recognition with cortex-like mechanisms,” IEEE Transactions on
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc., Pattern Analysis and Machine Intelligence, vol. 29, no. 3, pp. 411–426,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/ March 2007.
4824-imagenet-classification-with-deep-convolutional-neural-networks. [34] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao,
pdf M. aurelio Ranzato, A. Senior, P. Tucker, K. Yang, Q. V.
[13] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Le, and A. Y. Ng, “Large scale distributed deep networks,” in
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and Advances in Neural Information Processing Systems 25, F. Pereira,
L. Fei-Fei, “Imagenet large scale visual recognition challenge,” Int. J. C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds. Curran
Comput. Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015. [Online]. Associates, Inc., 2012, pp. 1223–1231. [Online]. Available: http:
Available: http://dx.doi.org/10.1007/s11263-015-0816-y //papers.nips.cc/paper/4687-large-scale-distributed-deep-networks.pdf
[35] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from
[14] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
tiny images,” Tech. Rep., 2009.
tional networks,” in Computer Vision – ECCV 2014, D. Fleet, T. Pajdla,
[36] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng,
B. Schiele, and T. Tuytelaars, Eds. Cham: Springer International
“Reading digits in natural images with unsupervised feature learning,” in
Publishing, 2014, pp. 818–833.
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks 2011, 2011. [Online]. Available: http://ufldl.stanford.edu/housenumbers/
for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. nips2011 housenumbers.pdf
[Online]. Available: http://arxiv.org/abs/1409.1556 [37] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, importance of initialization and momentum in deep learning,” in
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” Proceedings of the 30th International Conference on Machine Learning,
in The IEEE Conference on Computer Vision and Pattern Recognition ser. Proceedings of Machine Learning Research, S. Dasgupta and
(CVPR), June 2015. D. McAllester, Eds., vol. 28, no. 3. Atlanta, Georgia, USA:
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image PMLR, 17–19 Jun 2013, pp. 1139–1147. [Online]. Available:
recognition,” in The IEEE Conference on Computer Vision and Pattern http://proceedings.mlr.press/v28/sutskever13.html
Recognition (CVPR), June 2016. [38] A. Krizhevsky, V. Nair, and G. Hinton, “Cifar-10 (canadian institute
[18] G. Huang, Z. Liu, and K. Q. Weinberger, “Densely connected for advanced research).” [Online]. Available: http://www.cs.toronto.edu/
convolutional networks,” CoRR, vol. abs/1608.06993, 2016. [Online]. ∼kriz/cifar.html
Available: http://arxiv.org/abs/1608.06993 [39] A. Krizhevsky, V. Nair, and G. E. Hinton, “Cifar-100 (canadian institute
[19] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between for advanced research).” [Online]. Available: http://www.cs.toronto.edu/
capsules,” CoRR, vol. abs/1710.09829, 2017. [Online]. Available: ∼kriz/cifar.html
http://arxiv.org/abs/1710.09829 [40] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual
[20] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” CoRR, transformations for deep neural networks,” CoRR, vol. abs/1611.05431,
vol. abs/1709.01507, 2017. [Online]. Available: http://arxiv.org/abs/ 2016. [Online]. Available: http://arxiv.org/abs/1611.05431
1709.01507 [41] C. Szegedy, S. Ioffe, and V. Vanhoucke, “Inception-v4, inception-resnet
[21] M. D. Buhmann, “Radial basis functions,” Acta Numerica, vol. 9, p. and the impact of residual connections on learning,” CoRR, vol.
138, 2000. abs/1602.07261, 2016. [Online]. Available: http://arxiv.org/abs/1602.
[22] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010. 07261
[Online]. Available: http://yann.lecun.com/exdb/mnist/ [42] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand,
[23] L. Bottou, “Large-scale machine learning with stochastic gradient M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural
descent,” in Proceedings of COMPSTAT’2010, Y. Lechevallier and networks for mobile vision applications,” CoRR, vol. abs/1704.04861,
G. Saporta, Eds. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186. 2017. [Online]. Available: http://arxiv.org/abs/1704.04861
[24] V. Nair and G. E. Hinton, “Rectified linear units improve [43] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely
restricted boltzmann machines,” in Proceedings of the 27th International efficient convolutional neural network for mobile devices,” CoRR, vol.
Conference on International Conference on Machine Learning, ser. abs/1707.01083, 2017. [Online]. Available: http://arxiv.org/abs/1707.
ICML’10. USA: Omnipress, 2010, pp. 807–814. [Online]. Available: 01083
http://dl.acm.org/citation.cfm?id=3104322.3104425
[25] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp.
211–252, 2015.
[26] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Improving neural networks by preventing co-
adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
[Online]. Available: http://arxiv.org/abs/1207.0580

129

Authorized licensed use limited to: National Aerospace Laboratories. Downloaded on December 11,2023 at 13:10:21 UTC from IEEE Xplore. Restrictions apply.

You might also like