0% found this document useful (0 votes)
58 views8 pages

Tiny Object Recognition

Uploaded by

Leandrob131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views8 pages

Tiny Object Recognition

Uploaded by

Leandrob131
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Lightweight Deep Convolutional Network for Tiny Object Recognition

Thanh-Dat Truong1 , Vinh-Tiep Nguyen2 and Minh-Triet Tran1


1 University of Science, Vietnam National University, HCMC, Vietnam
2 University of Information Technology, Vietnam National University, HCMC, Vietnam

Keywords: Object Recognition, Lightweight Deep Convolutional Neural Network, Tiny Images, Global Average Pooling.

Abstract: Object recognition is an important problem in Computer Vision with many applications such as image search,
autonomous car, image understanding, etc. In recent years, Convolutional Neural Network (CNN) based mod-
els have achieved great success on object recognition, especially VGG, ResNet, Wide ResNet, etc. However,
these models involve a large number of parameters that should be trained with large-scale datasets on power-
ful computing systems. Thus, it is not appropriate to train a heavy CNN with small-scale datasets with only
thousands of samples as it is easy to be over-fitted. Furthermore, it is not efficient to use an existing heavy
CNN method to recognize small images, such as in CIFAR-10 or CIFAR-100. In this paper, we propose a
Lightweight Deep Convolutional Neural Network architecture for tiny images codenamed “DCTI” to reduce
significantly a number of parameters for such datasets. Additionally, we use batch-normalization to deal with
the change in distribution each layer. To demonstrate the efficiency of the proposed method, we conduct exper-
iments on two popular datasets: CIFAR-10 and CIFAR-100. The results show that the proposed network not
only significantly reduces the number of parameters but also improves the performance. The number of pa-
rameters in our method is only 21.33% the number of parameters of Wide ResNet but our method achieves up
to 94.34% accuracy on CIFAR-10, comparing to 96.11% of Wide ResNet. Besides, our method also achieves
the accuracy of 73.65% on CIFAR-100.

1 INTRODUCTION In reality, objects taken from cameras are often


small or even tiny. Object detection is a method to
Object recognition is one of the important tasks in determine the object location in an image. Object de-
computer vision whose objective is to automatically tection pipeline involves the object recognition mod-
classify images into many classes. The result of im- ule. For each object proposal region, we need to rec-
age classification is an essential precondition of many ognize the object in this region. And some regions are
tasks such as understanding images, image search en- quite small or tiny. In the autonomous car, detecting
gine. Current approaches for image classification are and recognizing from far away is really challenging
based on machine learning. because objects are taken quite small.
Yann LeCun et. al. proposed Convolutional Neu- In the recent years, lifelogging rapidly becom-
ral Network (LeCun et al., 1989) in the early 1990’s ing a mainstream research topic. With the rich data
which demonstrates excellent performance at recog- captured over a long period of time, it will require
nition tasks. Several papers have shown that they both advanced methods that can provide an insight
can also deliver outstanding performance on more of the activities of an individual and systems capable
challenging visual recognition tasks: Ciresan et. al. of managing this huge amount of data. In the lifelog
(Ciresan et al., 2012) demonstrate state-of-the-art per- scenario there will be present several objects of small
formance on CIFAR-10 dataset. CNN has recently sizes, but the object detection part is equally impor-
enjoyed great success in large-scale image recogni- tant in this scenario.
tion, e.g CNN architecture is proposed by Krizhevsky The recent common methods deal with tiny ob-
et. al. (Krizhevsky et al., 2017a). In 2015, Karen ject recognition is to resize a small image into larger
Simonyan and Andrew Zisserman propose an archi- ones and use common networks having the best per-
tecture (Simonyan and Zisserman, 2014) which im- formance such as VGG (Simonyan and Zisserman,
proves the performance of the original architecture of 2014), Inception (Szegedy et al., 2015), ResNet (He
Krizshevsky. et al., 2016a), etc. to recognize. But the computa-

675
Truong, T-D., Nguyen, V-T. and Tran, M-T.
Lightweight Deep Convolutional Network for Tiny Object Recognition.
DOI: 10.5220/0006752006750682
In Proceedings of the 7th International Conference on Pattern Recognition Applications and Methods (ICPRAM 2018), pages 675-682
ISBN: 978-989-758-276-9
Copyright © 2018 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
INDEED 2018 - Special Session on INsights DiscovEry from LifElog Data

tional cost of their method is really large and cannot We also use data augmentation and whitening
be employed in real time. Although GPU with high data to improve accuracy. Our method only uses
performance can deal with this problem, the price of 21.33% number of parameters than the state-of-the-
GPU is really expensive and not suitable for small de- art method (Zagoruyko and Komodakis, 2016). How-
vices. Furthermore, resizing a tiny image into a large ever, we achieve the accuracy up to 94.34% and
image really does not get more information of an im- 73.65% on CIFAR-10 and CIFAR-100. With our re-
age. Additionally, training a large network takes a lot sult we achieved, it proves that our method not only
of time and requires the hardware to be really power- gets high accuracy but also reduce parameters signif-
ful enough. icantly.
A good solution for this problem is to keep the size The rest of this paper is organized as follows. Sec-
of the image and to build a network with fewer param- tion 2 presents related works. The proposed architec-
eters but it still has the ability to recognize with high ture of our network is presented in Section 3. Section
accuracy. On this basis, we propose a new method 4 presents our experimental configuration on CIFAR-
to employ very deep CNN called Lightweight Deep 10 and CIFAR-100. We compare our results to other
Convolutional Network for Tiny Object Recognition methods in section 5. Finally, Section 6 concludes the
(DCTI). Our proposed network has not only fewer pa- paper.
rameters but also high performance on the tiny im-
age. It has both good accuracy and minimal com-
putational cost. Through experiments, we achieved 2 RELATED WORKS
some good results which are quite effective for multi-
purposes. This is the motivations for us to continue
The earlier method for object recognition named Con-
to develop our method and build many systems which
volutional Neural Networks is proposed by Yann Le-
make use of object recognition such as understanding
cun et. al. (LeCun et al., 1989). It demonstrates
image systems, image search engine systems.
high performance on MNIST Dataset. Many current
Contributions. In our work, we consider in tiny architectures used for object recognition are based
images with size 32 × 32. We focus on exploiting on Convolutional Neural Networks (Graham, 2014),
local features with small convolutional filters size. (Krizhevsky et al., 2017a), (Zeiler and Fergus, 2013).
Therefore, we use convolutional filters size 3 × 3. It Very Deep Convolutional Neural Networks: a
fits with tiny images and helps to extract local fea- method proposed by Andrew Zisserman et. al. (Si-
tures. Besides that, it helps reducing parameters and monyan and Zisserman, 2014). It has good perfor-
to push network going deeper. mance on ImageNet Dataset. Very deep convolu-
In traditional approaches, the last layers use fully tional neural networks have two main architectures
connected layers to feed feature maps to feature vec- are VGG-16 and VGG-19. VGG-16 and VGG-19
tors. However, it increases more parameters and leads mean that there are 16 layers and 19 layers having
to over-fitting. Our network proposes using global av- parameters. The main contribution of its paper is a
erage pooling (Lin et al., 2013) instead of fully con- thorough evaluation of networks of increasing depth
nected layers. The purpose of this work is to help the using an architecture with very small (3 × 3) convo-
network directly project significant feature maps into lution filters, which shows that a significant improve-
the feature vectors. Additionally, global average pool- ment on the prior-art configurations can be achieved
ing layers do not employ parameters. So it has fewer by pushing the depth to 16-19 weight layer.
parameters and over-fitting is avoided. Network In Network: notice the limitations of
In deep networks, small changes can amplify layer using the fully connected layer, a novel network struc-
by layer. It leads to change distribution each layer. ture called Network In Network (NIN) to enhance the
This problem is called Internal Covariate Shift. To model discriminability for local receptive fields (Lin
tackle this problem, we use Batch-Normalization pro- et al., 2013). Global average pooling is used in this
posed by Ioffe et. al. (Ioffe and Szegedy, 2015). network instead of fully connected layer. The pur-
Once again, through experiments, we prove batch- pose of this work is to reduce parameters and enforc-
normalization is potential and efficient. It also helps ing correspondences between feature maps and cate-
faster learning. gories. It continues improving by Batch-normalized
Additionally, to prevent over-fitting, we use Maxout and has good performance on CIFAR-10
dropout. In common, dropout is put after fully con- dataset (Chang and Chen, 2015). In our work, we
nected layers. But in our network, we put it after also use global average pooling approach.
convolutional layers. Through experiment, this work Deep Residual Learning for Image Recogni-
helps improving accuracy and to avoid over-fitting. tion: one of the limitations when the network has

676
Lightweight Deep Convolutional Network for Tiny Object Recognition

more layers is the gradient is vanished in back-


propagation process. To avoid this problem, a method
proposed by Kaiming He (He et al., 2016a). It
presents a residual learning framework to easy the
training of networks that are substantially deeper than
those used previously. Deep Residual gets high per-
formance on ImageNet dataset, CIFAR dataset.
Going Deeper with Convolutions: one of the
important works when designing a network is that se-
lects kernels for each layer. Should we use a size of
the kernel of convolutional layers is 1 × 1, 3 × 3 or
5 × 5? To solve this problem, the group of Google’s
authors proposed a method codenamed Inception that
achieves the new state-of-the-art for classification and
detection in the ImageNet Large-Scale Visual Recog-
nition Challenge 2014 (Szegedy et al., 2015). The
idea of the method is to use all three size kernels
each convolutional layer. By a carefully crafted de-
sign, they increase the depth and width of the network
while keeping the computational budget constant.
Deep Networks with Internal Selective Atten-
tion through Feedback Connections: traditional
CNN are stationary and feed-forward. They neither
change their parameters during evaluation nor use
feedback from higher to lower layers. Real brains,
however, do. So does the Deep Attention Selective
Network (DasNet) architecture. DasNet’s feedback
structure can dynamically alter its convolutional fil-
ter sensitivities during classification. It harnesses the Figure 1: Overall architecture of Lightweight Deep Con-
power of sequential processing to improve classifica- volutional Network for Tiny Object Recognition (top for
tion performance, by allowing the network to itera- CIFAR-10, bottom for CIFAR-100).
tively focus its internal attention on some of its con-
From the original image size 32 × 32 and 3 color
volutional filters (Stollenga et al., 2014).
channels, we process multiple phases, after each
Recurrent Convolutional Neural Network for phase, we use max pooling with the pool size 2 × 2
Object Recognition: a prominent difference is that to reduce the size of feature maps down to two times.
CNN is typically a feed-forward architecture while The purposes of this work are to reduce variance, re-
in the visual system recurrent connections are abun- duce computation complexity (as 2 × 2 max pooling
dant. Inspired by this fact, its paper proposes a recur- reduces 75% data) and extract low level features from
rent CNN (RCNN) for object recognition by incorpo- a neighborhood. Through four max pooling layers,
rating recurrent connections into each convolutional we receive feature maps with the size 2 × 2.
layer (Liang and Hu, 2015).
In the first phase, the current input size is 32 × 32.
Therefore, we use convolutional filters size 5 × 5 to
deal with detail local features. Instead of using one
3 PROPOSED ARCHITECTURE convolutional layer with the kernel size 5 × 5, we use
two convolutional layers with kernel size 3 × 3. Using
two convolutional filers size 3 × 3 is equivalent to one
3.1 Overall Architecture convolutional filter size 5 × 5. By this way, we reduce
parameters and push network going deeper.
DCTI has 5 phases of convolutional layers (see Fig- In the second phase, the current input size is
ure 1). We use all filters with receptive field 3x3 16 × 16. We continue processing local feature with
for all convolutional layers. All hidden layers are convolutional filter size 5 × 5 on all feature maps. We
equipped with the rectification (ReLUs (Krizhevsky implement this by 2 convolutional layers with kernel
et al., 2017b)) non-linearity. We use dropout and size 3 × 3. The similarity with the first phase, using
batch-normalization after each convolutional layer. 2 convolutional filters help reducing parameters and

677
INDEED 2018 - Special Session on INsights DiscovEry from LifElog Data

increase more layers. of random variables with a known covariance matrix


In the third phase, the current input size is 8 × 8. into a set of new variables whose covariance is the
We want to hold on global feature maps, so we use identity matrix meaning that they are uncorrelated and
convolutional filters 7 × 7. We implement by three all have variance(Kessy et al., 2015).
convolutional layers with the kernel size 3 × 3, it is We use ZCA whitening transformation
equivalent to one layer 7 × 7. We incorporate three (Krizhevsky, ) to transform our data. We store
non-linear rectification layers instead of a single one, d × n − dimensional data points in the columns of a
which makes the decision function more discrimina- d × n matrix X. Assuming the data points have zero
tive. Second, we decrease the number of parame- mean.
ters than use one convolutional layers with kernel size First, we need to compute Σ = 1n XX T . Next, we
7 × 7. This can be seen as imposing a regulariza- computes the eigenvectors Σ. Suppose matrix U con-
tion on the 7 × 7 convolutional filters, forcing them tains the eigenvectors of Σ (one eigenvector per col-
to have a decomposition through the 3x3 filters (with umn, sorted in order from top to bottom eigenvector),
non-linearity injected in between). and the diagonal entries of the matrix S will contain
In the fourth and fifth phases, the input sizes are the corresponding eigenvalues (also sorted in decreas-
just 4 × 4 and 2 × 2. We continue dealing with global ing order).
features. We use convolutional filters size 5×5 for the 1
XZCAW hite = U.diag(S)− 2 .U T .X
fourth phase and convolutional filters size 3×3 for the
fifth size. It guarantees that convolutional filters still where: diag(S) means diagonal of matrix S. Exponent
fit with feature maps. We use two convolutional filters − 21 means each element of matrix has exponent − 12 .
size 3×3 instead of using one convolutional filter size
5×5. Finally, we receive feature maps with size 2×2, 3.4 Rectifier
it uses to directly feed to feature vectors.
In final, we use global average pooling layer to The most important feature of AlexNet is ReLUs
feed directly feature maps into feature vectors. From (Krizhevsky et al., 2017b) nonlinearity, which shows
feature vectors, we apply fully connected and softmax the importance of nonlinearity. Two additional major
to calculate probability each class. benefits of ReLUs are sparsity and a reduced likeli-
hood of vanishing gradient. But first recall the defini-
tion of a ReLUs is h = max (0, a).
3.2 Data Normalization One major benefit is the reduced likelihood of the
gradient to vanish. This arises when a > 0. In this
Since the range of values of raw data varies widely, regime, the gradient has a constant value. In contrast,
in some machine learning algorithms, objective func- the gradient of sigmoids becomes increasingly small
tions will not work properly without normalization. as the absolute value of x increases. The constant gra-
Another reason why data normalization is applied is dient of ReLUs results in faster learning.
that gradient descent converges much faster with data The other benefit of ReLUs is sparsity. Sparsity
normalization than without it. arises when a ≤ 0. The more such units that exist in
Data normalization makes the values of each fea- a layer the more sparse the resulting representation.
ture in the data have zero-mean (when subtracting Sigmoids, on the other hand, are always likely to gen-
the mean in the numerator) and unit-variance. This erate some non-zero value resulting in dense repre-
method is widely used for normalization in many sentations. Sparse representations seem to be more
machine learning algorithms (e.g., support vector beneficial than dense representations.
machines, logistic regression, and neural networks)
(Grus and Joel, 2015). This is typically done by cal- 3.5 Batch Normalization
culating standard scores. (Mohamad et al., 2013) The
general method of calculation is to determine the dis- Batch normalization potentially helps in two ways:
tribution mean and standard deviation for each fea- faster learning and higher overall accuracy. The im-
ture. Next, we subtract the mean from each fea- proved method also allows you to use a higher learn-
ture. Then we divide the values (mean is already sub- ing rate, potentially providing another boost in speed.
tracted) of each feature by its standard deviation. For very deep networks, small changes in the pre-
vious layers will amplify layer by layer and finally
3.3 Whitening Transformation cause some problem. The change in the distribu-
tion of layer inputs causes problems since the pa-
A whitening transformation or sphering transforma- rameter should adapt to new distribution with itera-
tion is a linear transformation that transforms a vector tions, which is called Internal Covariate Shift (Ioffe

678
Lightweight Deep Convolutional Network for Tiny Object Recognition

and Szegedy, 2015). To solve this problem, Google’s 4 EXPERIMENTS


researcher Ioffe proposed Batch Normalization (Ioffe
and Szegedy, 2015), which normalizes every layer’s We experiment our method on CIFAR-10 dataset.
inputs, thus it makes the network converge much Each image we have a vector result with 10-
faster, converge at lower error rate and reduce over- dimension, each element represents for the proba-
bility of corresponding class. To rate our architec-
fitting to some degree. ture, we use Cross-Entropy Cost Function as objec-
X −µ tive function.
Y =γ +β 1 N 10
λ
σ
∑ ∑ ti ln (yi ) + (1 − ti ) ln (1 − yi ) + 2 ∑ ||W ||2
j j j j
L=−
N
Where: µ = X, σ2 = (X − µ)2 , γ, β are learnable pa- j=1 i=1
rameters. In our experiment, we set γ = 1 and β = 0. Where t j is the target of input jth , x j is the output jth ,
W is the parameters, λ is a regularization parameter.
3.6 Global Average Pooling We use Stochastic gradient descent (sgd) to opti-
mize parameters. To reduce time training, we train on
This method is proposed by Min Lin et. al. (Lin GPU instead of CPU. We implement by MatConvNet
et al., 2013). They propose another strategy called of MatConvNet Team. CIFAR-10 dataset has 60000
global average pooling to replace the traditional fully images with 6000 images each class. We divide the
connected layers in CNN. The idea is to generate one dataset into 2 sets are training set and test set. Train-
feature map for each corresponding category of the ing set has 50000 images, test set has 10000 images.
classification task in the last block of the convolu- Configurations of computer use for this training are
tional layer. Instead of adding fully connected lay- CPU Core i3 4150, RAM 8GB, GPU GTX 1060 6GB.
ers on top of the feature maps, they take the aver- The training process takes about 10 hours to train.
age of each feature maps, and the resulting vector is We conduct the training model for 500 epochs.
fed directly into the softmax layer. One advantage of Use mini-batch sgd to solve the net, with batch-size
global average pooling over the fully connected lay- of 64. The momentum, base learning rate and base
ers is that it is more native to the convolution structure weight decay rate are set to 0.9, 0.1, 0.0005, respec-
by enforcing correspondences between feature maps tively. Lower down the learning rate every an epoch
and categories. Thus the feature maps can be easily by a factor of 0.9817.
interpreted as categories confidence maps. Another To accelerate training process and prevent over-
advantage is that there is no parameter to optimize in fitting, we use data augmentation while training. We
the global average pooling thus over-fitting is avoided flip image left to right with probability is 0.5 and ran-
at this layer. dom crop with padding is 4 (pads array with mirror
In our architecture, instead of being fed directly reflections of itself). We also experiment on CIFAR-
into the softmax layer, we use global average pooling 100 dataset. We use same experimental configura-
to extract feature vectors 512-dimension. Global av- tions with the experiment on CIFAR-10.
erage pooling sums out the spatial information, thus it In figure 2 , the red line represents for training
is more robust to spatial translations of the input. and the green line represents for testing. In 250 first
epochs, the objective and accuracy converge quickly
3.7 Dropout to stabilized result. After the result does not change
too much and the result is stabilized. Through that,
When we use a very deep CNN for small datasets, it our method can quickly converge to stabilized result
easily tends to over-fitting. The most common meth- quickly with regular configurations of computer and
ods reducing over-fitting is dropout (Srivastava et al., takes less time to training (500 epochs take 10 hours
2014). The standard way of using dropout is to set to train). It is equivalent when we train our method on
dropout at the fully connected layers where most pa- CIFAR-100 (Figure 3).
rameters are. However, the model is very deep and the
dataset is relatively small, we use a different way of
dropout setting. We set dropout for convolution lay- 5 RESULT
ers, too. Specifically, we set dropout rate as 0.3 for
the first and second group of convolution layers, 0.4 The accuracy we achieve on CIFAR-10 test set is
for the third and fourth group. The dropout rate for 94.34%. See some true predictions in Figure 4. The
the feature vectors 512D layer was set to be 0.5. input images are tiny, but our method can classify cor-
rectly. In some case, some classes are similarity such
as dog and cat, automobile and truck, etc. It lead to

679
INDEED 2018 - Special Session on INsights DiscovEry from LifElog Data

Figure 2: Objective (top) and Accuracy (bottom) training Figure 3: Objective (top) and Accuracy (bottom) training
plot (CIFAR-10). plot (CIFAR-100).

our method can wrongly classify (see some wrongly


predicted images in Figure 5). Considering in con-
fusion matrix (Table 1), dogs are easily wrongly pre-
dicted as cats, trucks are also easy wrongly predicted
as automobiles or deers are also wrongly predicted as
cats. Because of an abstracted level features, there is
not much difference between these classes.
Figure 4: a) Correctly predicted examples (a) bird (b) truck
We compare our architecture to the VGG archi- (c) horse d) dog.
tecture. We use fewer parameters than VGG architec-
ture. The number of parameters of our architecture
is about 7.6 million. In while, VGG-16 and VGG-
19 have more than 100 million parameters. Our re-
sult is event better than modified VGG-16 applied to
CIFAR-10 dataset (Liu and Deng, 2015).
We also compare the accuracy of our method with
other methods on CIFAR-10. See the result in Table 2.
Our result uses only 21.33% number of parameters of Figure 5: Same examples of wrongly predicted images (a)
the state-of-the-art method but we achieve accuracy Cats are wrongly predicted as dogs, (b)dogs are wrongly
up to 94.34%. Comparing with the other methods predicted as cats; (c) trucks are wrongly predicted as auto-
such as VGG, NiN, Dasnet, our method outperform mobiles; (d) birds are wrongly predicted as airplanes.
than their methods. Using only about 7.6M parame-
ters, we reducing parameters significantly and achieve method. Although our result does not reach the state-
convincing results. of-the-art, by our method we reduce parameters sig-
We also compare our method with the other meth- nificantly. Specifically, out method has fewer param-
ods on CIFAR-100. See the result in Table 3. eters than the state-of-the-art method 4.5 times. Ad-
Our method performs better than DasNet and NiN ditionally, we also reach good accuracy and low com-
method. Our results are nearly as close to ResNet putational cost.
1001 method but we use fewer parameters than its To achieve this result, we use very small convolu-

680
Lightweight Deep Convolutional Network for Tiny Object Recognition

Table 1: Confusion matrix (CIFAR-10).

Table 2: Some results of the comparative experiments (CIFAR-10).

Method Accuracy Params


Wide Residual Networks (Zagoruyko and Komodakis, 2016) 96.11% 36.5M
DCTI 94.34% 7.63M
CIFAR-VGG-BN-DROPOUT (Liu and Deng, 2015) 91.55% 14.7M
NiN (Lin et al., 2013) 91.20% 1.00M
DasNet (Stollenga et al., 2014) 90.78% 1.00M
Table 3: Some results of the comparative experiments (CIFAR-100).

Method Accuracy Params


Wide Residual Network (Zagoruyko and Komodakis, 2016) 81.15% 36.5M
ResNet 1001 (He et al., 2016b) 77.29% 10.2M
DCTI 73.65% 7.68M
DasNet (Stollenga et al., 2014) 66.22% 1.00M
NiN (Lin et al., 2013) 64.32% 1.00M

tional filters size to deal with local features and push duced computational cost. Although we cannot reach
network can going deeper. And the network can learn the state-of-the-art, the results we achieved proved
high level features. Furthermore, we use global av- that our method is promising. It was also demon-
erage pooling helps reducing parameters significantly strated that the representation depth is beneficial for
and is more native to the convolution structure by en- the recognition. Additionally, we proved very deep
forcing correspondences between feature maps and models were used to fit small datasets as long as the
feature vectors. The new way of putting dropout after input image is big enough so that it does not van-
convolutional layers help improving accuracy. It was ish as the model going deeper. Our results yet again
proved through our result we achieved. confirmed the performance of very deep CNN for the
pattern recognition task. In the future, we continue
improving our method to get higher performance and
reduce parameters.
6 CONCLUSION
In our research, we proposed a new method for object
recognition with tiny images. By using very small REFERENCES
convolutional filters, we pushed our network going
Chang, J. and Chen, Y. (2015). Batch-normalized maxout
deeper and dealt with local features. It also helped
network in network. CoRR, abs/1511.02583.
our network learn high level features. And by using
Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012).
global average pooling instead of fully connected lay- Multi-column deep neural networks for image classi-
ers, we reduced parameters significantly. Moreover, fication. In CVPR, pages 3642–3649. IEEE Computer
it helped the network directly project significant fea- Society.
ture maps into the feature vectors. Beside that, us- Graham, B. (2014). Fractional max-pooling. CoRR,
ing batch-normalization and dropout helped to accel- abs/1412.6071.
erate the learning process and preventing over-fitting. Grus and Joel (2015). Data science from scratch. CA:
Furthermore, it also improved performance and re- O’Reilly. pp. 99, 100. ISBN 978-1-491-90142-7.

681
INDEED 2018 - Special Session on INsights DiscovEry from LifElog Data

He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep resid-
ual learning for image recognition. In CVPR, pages
770–778. IEEE Computer Society.
He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity
mappings in deep residual networks. In ECCV (4),
volume 9908 of Lecture Notes in Computer Science,
pages 630–645. Springer.
Ioffe, S. and Szegedy, C. (2015). Batch normalization: Ac-
celerating deep network training by reducing internal
covariate shift. In ICML, volume 37 of JMLR Work-
shop and Conference Proceedings, pages 448–456.
JMLR.org.
Kessy, A., Lewin, A., and Strimmer, K. (2015). Optimal
whitening and decorrelation. arXiv.
Krizhevsky, A. Appendix a of learning multiple layers of
features from tiny images.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017a).
Imagenet classification with deep convolutional neu-
ral networks. Commun. ACM, 60(6):84–90.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017b).
Imagenet classification with deep convolutional neu-
ral networks. Commun. ACM, 60(6):84–90.
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W. E., and Jackel, L. D.
(1989). Backpropagation applied to handwritten zip
code recognition. Neural Computation, 1(4):541–551.
Liang, M. and Hu, X. (2015). Recurrent convolutional neu-
ral network for object recognition. In CVPR, pages
3367–3375. IEEE Computer Society.
Lin, M., Chen, Q., and Yan, S. (2013). Network in network.
CoRR, abs/1312.4400.
Liu, S. and Deng, W. (2015). Very deep convolutional
neural network based image classification using small
training sample size. In ACPR, pages 730–734. IEEE.
Mohamad, B., Ismail, and Usman, D. (2013). Standardiza-
tion and its effects on k-means clustering algorithm.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recognition.
CoRR, abs/1409.1556.
Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I.,
and Salakhutdinov, R. (2014). Dropout: a simple way
to prevent neural networks from overfitting. Journal
of Machine Learning Research, 15(1):1929–1958.
Stollenga, M. F., Masci, J., Gomez, F. J., and Schmidhuber,
J. (2014). Deep networks with internal selective at-
tention through feedback connections. In NIPS, pages
3545–3553.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S.,
Anguelov, D., Erhan, D., Vanhoucke, V., and Rabi-
novich, A. (2015). Going deeper with convolutions.
In Computer Vision and Pattern Recognition (CVPR).
Zagoruyko, S. and Komodakis, N. (2016). Wide residual
networks. In BMVC.
Zeiler, M. D. and Fergus, R. (2013). Stochastic pooling for
regularization of deep convolutional neural networks.
CoRR, abs/1301.3557.

682

You might also like