JCTN Avinash Rohini 417 425

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/337082426

Object Recognition Using Deep Learning

Article in Journal of Computational and Theoretical Nanoscience · September 2019


DOI: 10.1166/jctn.2019.8291

CITATIONS READS

14 4,063

3 authors:

Rohini Goel Avinash Sharma


Maharishi Markandeshwar (Deemed to be University), Mullana, Ambala, Haryana, … Maharishi Markandeshwar University, Mullana
9 PUBLICATIONS 41 CITATIONS 123 PUBLICATIONS 354 CITATIONS

SEE PROFILE SEE PROFILE

Rajiv Kapoor
Delhi Technological University
259 PUBLICATIONS 2,734 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Anomaly detection in group activities based on fuzzy lattices using Schrödinger equation View project

Emotion Recognition View project

All content following this page was uploaded by Avinash Sharma on 23 December 2020.

The user has requested enhancement of the downloaded file.


Copyright © 2019 American Scientific Publishers Journal of
All rights reserved Computational and Theoretical Nanoscience
Printed in the United States of America Vol. 16, 4044–4052, 2019

Object Recognition Using Deep Learning


Rohini Goel1 ∗ , Avinash Sharma2 , and Rajiv Kapoor3
1
Research Scholar, Department of Computer Science and Engineering, Maharishi Markandeshwar (Deemed to be University),
Mullana 133203, Ambala, India
2
Department of Computer Science and Engineering, Maharishi Markandeshwar (Deemed to be University), Mullana 133203, Ambala, India
3
Department of Electronics and Communication Engineering, Delhi Technical University, Delhi 110042, India

The deep learning approaches have drawn much focus of the researchers in the area of
object recognition because of their implicit strength of conquering the shortcomings of classical
approaches dependent on hand crafted features. In the last few years, the deep learning techniques
have been made many developments in object recognition. This paper indicates some recent and
efficient deep learning frameworks for object recognition. The up to date study on recently devel-
oped a deep neural network based object recognition methods is presented. The various bench-
mark datasets that are used for performance evaluation are also discussed. The applications of the
object recognition approach for specific types of objects (like faces, buildings, plants etc.) are also
highlighted. We conclude up with the merits and demerits of existing methods and future scope in
this area.
Keywords: Convolutional Neural Network (CNN), Faster R-CNN, Network on Convolution
Feature Map (NoC), Deep Expectation (DEX), Deep Residual Conv-Deconv
Network, A-ConvNet.
RESEARCH ARTICLE

1. INTRODUCTION is used as texture-based descriptors to provide the spectral


Due to the rapid growth in computer technology, computer variation information required for efficient image classifi-
can play an important role to complete routine daily tasks cation. The extended morphological profiles proposed by
of everyday’s life [1]. For a human, visual object classi- Benediktsson et al. [5]. To extract spatial features for high
fication and recognition are usual and offhand biological resolution urban image classification. In high resolution
visual system process, but for a computer it is not easy. images, the Gabor filter [6] and wavelet transform [7]
To imitate because of high mutability in object images of were also used for spatial feature extraction. The intra-
the same class with different viewing conditions. Object class variation of the building database, the handcrafted
recognition is a crucial defiance in the field of computer feature is not an efficient solution. Therefore, handcrafted
vision. As the implementation of object recognition on features are replaced by the feature extracted by sparse
machines is labyrinthine task, so potent and less compli- coding scheme proposed by Chenyadat [8]. The sparse-
cated object recognition is to be designed [2]. The digital constrained support vector machine (SVM) is another fea-
database of visual information is growing day by day to ture learning model presented by Tuia et al. [9]. The deep
manage and analysis this huge mass of visual information, features [10] are more efficient and powerful than low
those image analysis approaches are required which can level feature in scene classification, image classification
automatically get its semantics contexts. The objects in the and face recognition.
images are one of the most crucial contexts for the object The mostly used object proposal approaches are based
recognition task. Good image feature descriptions are the on super-pixel grouping (e.g., MCG [11], CPMC [12]
and Selective Search [13]) and based of sliding window
backbone of the good object recognition system [3].
(e.g., edge boxes [14], objectness in windows [15]). Other
In the previous decennary, the comprehensive study
than this, there are some object proposal methods which
of high resolution image classification has been carried
are taken up as detector independent external elements
out with handcrafted features from spatial and spectral
(e.g., selective search object detection). The R-CNN [16]
domains. The gray-level co-occurrence matrix (GLCM) [4]
method is used as an object detector to segregate the pro-
posal region in object categories or backgrounds. The insti-

Author to whom correspondence should be addressed. gating work by Viola and Jones utilizes Haar [17] features

4044 J. Comput. Theor. Nanosci. 2019, Vol. 16, No. 9 1546-1955/2019/16/4044/009 doi:10.1166/jctn.2019.8291
Goel et al. Object Recognition Using Deep Learning

and boosted classifies on sliding window. The HOG fea- an important contribution in leaf identification. Gabor
tures [18] are combined with linear SVMs [19] a sliding co-occurrence in texture classification was proposed by
window classifier and DPM [20] to generate deformable Cope et al. [40] Learning vector quantization (LVQ) with
graphical models. In overfeat approach [21], every slid- radial basis function (RBF) was proposed by Rashad
ing window of convolutional feature map is used with a et al. [41] to recognize texture features. The merger of
fully connected layer for efficient detection and classifi- gray level co-occurrence matrix (GLCM) and LBP was
cation. In SPP based detection method [22], the features proposed by Tang et al. [42] to extract the texture based
are merged from the proposed region on the convolution features. The most important features for leaf identifica-
features map and initialize to fully connected layer for tion is venation features. The legume classes based on leaf
classification. venation were proposed by Larese et al. [43]. The fea-
The typical supervised classification models are like a ture extracted from the vein pattern by using hit or missed
decision tree [23], random forest [24] and support vector transform (UHMT) and then trained CNN for recogni-
machine (SVM) [25]. A random forest approach is based tion. The age reckoning can be considered as regres-
on the construction of several decision trees during train- sion or classification issue. Support Vector Regression
ing and the integration of prediction of all the trees is (SVR) [44], Canonical Correlation Analysis (CCA) [45]
used for classification. SVM uses finite training samples is the famous regression techniques, whereas the classi-
to tackle high dimensional data. The random forest and cal Nearest Neighbor (NN) and support vector machines
SVM are the shallow models; they have limited ability (SVMs) [46] are used as classification approaches. Chen
to handle the nonlinear data as compared to deep net- et al. [47] proposed CA-SVR techniques. Hureta et al. [48]
works. For image classification, Chen et al. [26] proposed image texture and local presentation descriptor and Guo
a stacked auto encoder to predict the hierarchal feature of and Mu [49] utilize CCA and PBS to real age estimation.
hyperspectral image in the spectral domain. A deep belief Ye et al. [50] proposed a multireal CNN, Wang et al. [51]
network (DBN) [27] represents spectral based features for deployed deep learning features (DLA) where as Rotre
hyperspectral data classification. Mou et al. [28] intro- et al. [52] with CNN and SVR for efficient original age
duced recurrent neural network for classification of hyper- reckoning. For apparent age estimation, other than our
spectral images. The afore mentioned methods like auto work Liu et al. proposed a technique based on deep trans-
encoders, RBN, DBM are 1-D deep architectures. The 1-D fer learning and GoogleNet framework. Zihu et al. [53]
architecture processing may cause the loss of hyperspectral deployed GoogleNet with random forest and SVR. Yang

RESEARCH ARTICLE
imagery structural information. The CNN has the capa- et al. [54] used face and landmark detection and VGG-16
bility to automatically discover the contextual 2-D spatial framework for face alignment and modeling respectively.
features for image classification. There are various super- The standard framework of SAR-ATR has three phases:
vised CNN-based models used for spectral-spatial clas- detection discrimination and classification. The extraction
sification of hyperspectral remote sensing images. Chen phase extracts the targets from the SAR using CFAR detec-
et al. [29] proposed a supervised l2 regularized 3-D CNN tion [55]. Then the discriminator stage removes the false
based feature extraction model used for classification pur- alarms and selects the features necessary to detect the tar-
pose. Ghamisi et al. [30] proposed self improving CNN get. The last phase as classifier is used to classify each
model. Zhao and Du et al. [31] introduced a spectral, spa- input as one of the two classes (target and clutter) [56].
tial features based classification framework. In the transi- The classification stage may have any of three prototypes:
tion from supervised CNN for unsupervised CNN, Romero template matching, model based methods and machine
et al. [32] represent an unsupervised convolutional net- learning. The semiautomated image intelligence process-
work for spatial-spectral feature extraction adopting sparse ing (SAIP) [57] system is the popularly used template
learning to predict the network weights. base system. But the performance of this system degrades
The various types of feature extractor were used in in extended operating conditions (EOC) [58]. To over-
the past based on shape, texture and venation. Shape come the present issue, model based moving and stationary
based Elliptic Fourier and discriminant analysis to dis- target acquisition and recognition (MSTAR) [59] system
criminate different plant types was proposed by Neto was grown with the evaluation of trainable classifier such
et al. [33]. Shape based approaches were based on invari- as: artificial neural network (ANN) [60], SVM [61] and
ant moments and centroid-radii models [34]. The combi- adaboost etc. [62]. The machine learning prototypes have
nation of geometrical and invariant moment features was been adapted to SAR. Now days, the deep convolutional
introduced by Du et al. [35] to extract features. Shape networks (ConvNet) [63] presented a remarkable perfor-
context and HOG [36] have been initialized as shape mance object detection and recognition. The remaining
descriptors. Fourier descriptor, shape defining features paper is structured as: In Sections 2 and 3, Two-Step
(SDF) [37], hand crafted shape (HCS) [38] and histogram and Single-Step architecture object recognition techniques
of curvature over scale (HoCS) [39] is some important using deep learning are (see Fig. 1) discussed. Section 4
shape base feature descriptor. Texture features have also illustrates the various object recognition applications and

J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4045


Object Recognition Using Deep Learning Goel et al.

Deep Learning Based Object high redundancy of region proposal, it is a time consuming
Recognition Techniques process.
Due to the fixed size input, the R-CNN warps or
crops the region proposal to the required size. Either the
incomplete information exists due to cropping or distor-
Two Step One Step tion occurs due to the warping operation. These effects
Architecture Architecture can weaken the recognition accuracy. To rectify this prob-
lem, He et al. presented a new CNN architecture called
SPPnet [64]. In SPPnet architecture, unlike R-CNN the
5th convolutional layer (conv 5) is reused to represent the
random sized region proposals to fixed size feature vec-
R- CNN

YOLO
Faster R-CNN

Deep Residual Conv-

SSD
Fast R-CNN
SPP Net

DEX

Deconv Network
Object-Based CNN

tors. Due to the strength of local responses and spatial


position of the feature maps, the reusability of these fea-
ture maps is feasible. The layer next to final convolutional
layer is attributed as spatial pyramid pooling layer (SPP
layer). If the conv 5 layer has 256 feature map, then after
SPPnet layer each region proposal has final feature vec-
tor dimension of 256 × (12 + 22 + 42) = 5376 due to a
Fig. 1. Overview of deep learning based object recognition techniques. 3 level pyramid. SPPnet shows better results in correct
region proposal estimations as well as enhances the detec-
Section 5 reveals the various types of datasets used in dif- tion efficiency during the testing phase due to sharing of
ferent object recognition techniques using deep learning, computation cost before SPP layer.
Section 6 shows comparative analysis and we conclude in SPPnet has some drawbacks although it has shown
Section 7. remarkable improvements in accuracy and efficiency in
comparision to R-CNN. SPPnet expenses the additional
storage space due to its multi stage-pipeline architecture
2. TWO-STEP ARCHITECTURE similar to R-CNN. The fine tuning algorithm [22] is not
In 2014, Ross Girshick [16] proposed R-CNN for the able to update the convolutional layer before SPP layer.
quality enhancement of the candidate bounding box and Due to this, an unsurprising drop in accuracy occurs in
RESEARCH ARTICLE

extraction of high level features using deep architecture. deep network. To avoid these problems, Girshick [65]
R-CNN indicated the better result over the previous best introduced a novel architecture of CNN that is known as
result using PASCAL VOC. 2012 dataset. The R-CNN two Fast R-CNN. Like SPPnet, in the Fast R-CN, the com-
stages: in Generation of Region Proposal stage, in R-CNN, plete image is handled by conv layer to generate feature
about 2K region proposals are produced using the selective map. The ROI pooling layer extract feature vector of fixed-
search method. The accurate bounding boxes of the arbi- length from every region proposal. Every feature vector
trary sizes are generated very fast with reduced searching is passed through a number of fc layers to reach into the
space using selective search. It is due to the fact that selec- two output layers. One layer produces the probability of
tive search uses a bottom up grouping and saliency cues. C+1 catagories and another layer generates the bounding
In Deep Feature Extraction using CNN, the deep CNN box position with four real-value numbers. The pipelining
extracted the deep features from the cropped or warped of Fast R-CNN is fastened by sampling the mini-batches
region proposal. These final and robust 4096-dimensional hierarchically and comprising layer fc layer by using trun-
features are obtained due to high learning capacity and cated singular value decomposition (SVD).
strong expressive power of the CNN’s structures. The In Faster R-CNN, the region proposal algorithms [13]
region proposals are recorded as positive and negative are used to predict the object location in various object
(background) regions with help of category specific linear detection networks. Fast R-CNN [65] and SPPnet are used
SVMs. The bounding box regression adjusts these regions as detection network with reduced running time, but the
and then greedy non-maximum suppression (NMS) fil- problem is with computation of region proposals. The pro-
ter them to produce final bounding boxes for secured posed RPN contributes full-image convolutional features
object locations. Although R-CNN has various improve- towards detection network [22]. The object boundaries and
ments over traditional approaches still it has some dis- objectness score are predicted by this fully convolutional
advantages. Due to the requirement of fixed size input network [66], known as RPN. The Fast R-CNN is used
image, the re-computation of the CNN takes more time in as the detection network. Then, the Fast R-CNN and RPN
testing period. R-CNN training is a multi-stage pipeline are fixed together as a unified network by using convolu-
process. More storage space and time is required for train- tional feature with ‘attention’ mechanism. Ultimately RPN
ing of R-CNN because features of different region pro- tells the detection network, where to view and then detec-
posals are extracted and stored on the disk. Due to the tion network detects objects in that particular region. The

4046 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019


Goel et al. Object Recognition Using Deep Learning

deep VGG-16 is used as detection network. Region pro- 40% of the height and width of the face is taken on all
posals are engendered by drifting a small network over sides. The aligned face is squeezed to 256 × 256 pixel
feature map of last convolutional layer. The n ∗ n spatial before initializing to deep CNN. VGG-16 architecture is
window of input convolutional feature map is produced by deployed as CNN used to anticipate the age of a person.
the small network. The lower dimension feature (256-d for VGG-16 has 13-convolutional layer, with convolutional fil-
ZF [67] and 512-d for VGG) map each sliding window. ter of 3 × 3 and 3 fully connected layer. The CNN is the
The two sliding layers-a box regression layer (reg) and a fine tuned using new dataset IMDB-WIKI. The age reck-
box classification layer (cls) are fed by this feature. The oning is a regression issue if last layer is replaced by only
both networks (Region Proposal and Object Detection) one output neuron. The CNN for regression is relatively
had shown a common convolution layer. In this method, unstable, so the age prediction is considered as a classifi-
work is implemented on the ZF net (having 5 layers) and cation problem. The age value is represented by ranges of
VGG-16 [68] which has 13 sharable convolutional layers. age. The number of age range depends on the size of the
In the field of disaster rescue and urban planning, the training set.
classification and interpretation [69] accuracy and speed In Deep Residual Conv-Deconv Network, a novel
play a critical role for high resolution images [70]. The architecture of the neural network is proposed that is
recognition of complex pattern becomes challenging if the used for unsupervised spectral spatial feature learning
resolution of the images gets finer. In object based CNN, of hyperspectral image. The contemplated network, i.e.,
deep learning provides a potent way to efficiently reduce fully Conv-Deconv network, is based on encoder–decoder
the semantic gap of the complex patterns. The boundaries mechanism. The 3-D hyperspectral data is primarily
of the different objects are not captured by the deep learn- mutated into the lower dimensional space using encoder
ing techniques. To overcome this problem, the merger of (convolution sub network) and reproduced the original data
deep feature learning method and object-based classifica- by expansion through decoder (deconvolutional network).
tion strategy is proposed. This proposed method improves The residual learning and new unpooling technique is used
the accuracy of the high-resolution image classification. to fine tune the proposed network. This work presents that
The method involves two steps: extraction of deep features few neurons in the initial residual block have good capabil-
through CNN and then object based classification with the ities to predict the visual pattern in the objects. The unsu-
help of deep features. pervised spectral-spatial features are extracted using the
proposed network for remote sensing imagery. The con-

RESEARCH ARTICLE
3. SINGLE-STEP ARCHITECTURE volution subnetwork has a number of convolutional block
Szegedy et al. [71] proposed DNN based approach in where each block is made up of stack of convolutional lay-
which simple bounding box interference abstracts the ers with convolutional filter of size 3 × 3. The convolution
objects with the help of binary mask. This method is not layers have similar feature map size and the similar num-
so comfortable to extract overlapping objects. Pinheiro bers of filters. The count of channels of the feature map
et al. [72] formulated a CNN model which has two boosts with deeper convolutional block. The convolutional
branches. The first branch generates masks and the sec- layer is equipped with ReLU [76] activation function.
ond branch predicts the likelihood of the given patch of After each convolutional block, there is a max-pooling
an object. Erhan et al. [73] presented a multibox based layer to spatially shrink the feature maps. The deconvolu-
on regression to produce region proposals where as Yoo tional network used to reconstruct the input data from the
et al. [74] proposed a classification technique for object deep features. The deconvolutional network comprises of
detection using a CNN architecture named AttentionNet. unpooling [77] and convolutional layer. The convolutional
The Deep Expectation (DEX) approach for age esti- block configuration of the deconvolutional subnet is same
mation without using facial landmark and IMDB-WIKI as convolutional subnetwork. When the network is trained,
database of face images having age and gender label the learning of the network slow down due to the fact that
is introduced [75]. The actual age and apparent age the network converges to the high value of error. In this
reckoning are handled by CNN of VGG-16 architecture situation the optimization of the two networks is not easy.
pretrained on ImageNet. The age reckoning issue is con- The other problem is unpooling operation in the deconvo-
sidered as a deep classification issue with softmax. The lution subnetwork. This unpooling method simply avoids
main factor of the work: deep learned model, potent face the position of highest value which leads to depletion of
alignment and normal value calculation for age regres- edge data at the time of decoding. To settle these problems,
sion. Firstly, the face is aligned using angle and crops it the fully Conv-Deconv network is refined by including
for the successive strides. This is the potent method for residual learning and unpooling using max-pooling indices
face alignment rather than using facial landmarks because to retain a location of max. value.
its failure rate (∼1%). The cases where the face is not YOLO framework, proposed by Redmon et al. [78], pre-
detected, the full image is considered as the face. The con- dicts the confidence for multiple categories and bound-
text around face enhances the performance. So appended ing boxes. In this framework, initially the whole image is

J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4047


Object Recognition Using Deep Learning Goel et al.

divided into N × N grids and each grid cell predicts object like security, transport and medicine. The development
centered in that grid cell. It also anticipates the bounding in the field of artificial intelligence (AI) enhances the use
boxes along with their confidence score. The contributions of the deep learning technique for accurate age estimation.
of grid cells are only calculated for those that contain an The deep learning approaches [75] show the effectiveness
object. The YOLO contain 24 conv layers and 2 fc layers. and the robustness for age estimation in comparison to
Some of the conv layers build up a group of outset mod- traditional age estimation approaches.
ules having 1 ×1 reduction layer succceeded by 3 ×3 conv
layers. The network can handle image at rate 45 FPS in 4.3. Target Classification for SAR Images
real time and the Fast YOLO version can handle 155 FPS. The SAR-ATR (synthetic aperture radar automatically tar-
Which is better than other detectors. Due to strong spa- get recognition) algorithm comprises of a feature extractor
tial restraints on bounding box predictions [79], YOLO and a trainable classifier. The hand designed features are
is uncomfortable in dealing with groups of small objects. often extracted and impact the accuracy of the system.
YOLO generates coarse features due to multiple down- The deep convolutional networks achieved most advanced
sample operation because it is not so relaxed to speculate outcomes in many computer vision and speech recogni-
objects in new configuration. tion assignments by automatic learning of feature from
To avoid these problems, a novel approach single shot the huge data. As the use of convolutional networks
multibox detector (SSD) [80] is proposed by Liu et al. This for SAR-ATR encounter with serious overfitting prob-
approach is motivated by anchors used in multibox [81], lem. To overcome this issue, a new All-convolutional net-
RPN [74] and multiscale representation [82]. Unlike fixed work (A-ConvNet) is proposed. The A-ConvNet comprises
grid used in YOLO, the anchor boxes of distinct aspect of sparsely connected layers, without using only fully
ratio and scales are used in SSD to discretize the output connected layer. The A-ConvNet demonstrated superior
space of bounding boxes. The prediction from various fea- performance than traditional ConvNet on the classifica-
ture maps having different resolutions is combined by the tion [84] of target in the SAR image dataset.
network to tackle different size objects. In SSD architec-
ture, some feature layers are attached at the end of VGG16 4.4. Face Recognition
network to predict the offset of the default boxes hav- The identification of an individual using face from an
ing distinct scales and aspect ratio. The weighted sum of image or a database is known as face recognition [85].
confidence loss and localization loss is used to train the Due to the increasing volume of the dataset, machine
network. NMS is adopted on multiscale refined bounding
RESEARCH ARTICLE

learning approaches like deep neural network is used


boxes to get the final object detection. for the face recognition problem. The deep learning
approaches perform significantly with large datasets. Espe-
4. APPLICATIONS cially convolutional neural networks (CNN) attain a
4.1. Plant Identification tremendous recognition rate for face recognition problem.
The plant identification system is a sector of computer
vision which helps the botanists to identify the unknown 5. DATASET
plant species rapidly and easily. Various studies have been There are datasets having complex urban conditions of
carried out to increase the use of leaf data for prediction three very distinct cities, i.e., Beijing, Pavia and Vaihingen.
of plant species. In this method, the useful features of The Beijing view was captured by Worldview-2 in 2010.
the leaves are extracted by convolutional neural network The ROSIS sensor captured the scenes at the time of the
and the yield perception of extracted features based on flight over Pavaia and Italy. The PASCAL VOC 2007 [86]
the deconvolutional network. This method provides differ- dataset (consist of 5k trainval images and 5k test images of
ent orders of venation [83] that are better than shape [33] 20 object classes), PASCAL VOC 2012 [86] dataset, MS
information. The multilevel representation of leaf data COCO DATASET [87] (contains 80 object classes with
(from lower level to a higher level) has been observed 80k images on the training set and 40k images on valida-
according to species class. This work is helpful to design tion set and 20k images on test set) are used to assess the
a hybrid feature extraction model which enhances the per- performance of Faster R-CNN approach. The benchmark
formance of the plant classification system. of SegNet is performed on CamVid 11 road class segmen-
tation dataset [88] and SUN [89]. RGB-D indoor scene
4.2. Age Estimation database. The new leaf dataset Malayakew leaf dataset and
One of the most important attribute of identity and social the well known Flavia [90] leaf datasets are used to ana-
interaction is age. The prediction of age depends upon lyze the performance of the plant identification approach.
the various factors like posture, facial wrinkles, vocabu- In the DEX work, five distinct datasets for original
lary and information. Age estimation is used in the devel- and apparent age estimation are used. The IMDB-WIKI is
opment of numerous applications like intelligent human the new largest data of age reckoning. The MORPH and
machine interface, safety and protection in different areas FG-NET dataset are used for real age estimation whereas

4048 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019


Goel et al. Object Recognition Using Deep Learning

Table I. Comparative analysis. received in detailed. Firstly, the two step framework has
Database Result been received which familiarize the architectures used for
Researchers Techniques used (% accuracy) object recognition. Then, one step frameworks such as:
YOLO, SSD etc. are also reviewed. The various bench-
R. Girshick R-CNN PASCAL VOC, 798
et al. [16] ILSIRC
mark datasets and different application areas of object
X. Bai et al. [22] SPPnet PASCAL VOC 9342 recognition are also discussed. Finally, we conclude with
2007, ILSIRC promising future scope to get an intensive perspective of
R. Girshick Fast R-CNN PASCAL VOC 893 the object recognition. This paper provides worthwhile
et al. [65] wisdom and guidance for future progress in the field of
S. Ren et al. [66] Faster R-CNN PASCAL VOC 918
deep learning based object recognition. Based on litera-
2007, PASCAL
VOC 2012 ture review, there is scope for future improvement. The
W. Zhao et al. [69] Object Beijing, Pavia, 9904 object-based CNN for high-resolution imagery classifica-
based CNN VaiHingen tion method has no contextual information on a global
Rasmus Rothe DEX IMDB-Wiki 966 level. In future, the main focus will be on the con-
et al. [75]
textual information to further improve the performance
L. Mou [76] Deep residual Pavia, Indian 8739
conv-deconv pines
because the information about the relationship between
J. Redmon Yolo PASCAL VOC 906 image object affect the classification efficiency.
quad et al. [78] 2012, COCO In Segnet, the estimation model can be design to cal-
W. Liu et al. [80] SSD PASCAL VOC 832 culate uncertainty for prediction from deep segmentation
2012, COCO network.
The training dataset can be increase in future for the age
estimation approach DEX. More robust landmark detectors
LAP dataset used for apparent age estimation. The experi- can lead to better face alignment.
ments are carried out using two mainly used hyperspectral The possible future work is to explore the capability of
data i.e., Indian Pines and Pavia University. The Indian Deep Residual Conv-Deconv Network for Hyperspectral
Pines dataset is gathered over the Indian pine sites of Image Classification approach using APs and estimation
northwestern. India using airborne visible/infrared imag- profiles that extract spatial information in a robust and
ing spectrometer (AVIRIS) sensor. The Pavia University adaptive way.
dataset is acquired over university of Pavia by reflective

RESEARCH ARTICLE
optics system imaging spectrometer (ROSIS). The perfor-
mance of hyperspectral image classification approaches is References
1. Shokoufandeh, A., Keselman, Y., Demirci, M. F., Macrini, D. and
assessed using overall accuracy (OA), average accuracy Dickinson, S.J., 2012. Many to many feature matching in object
(AA) and Kappa coefficient. The experiments are carried recognition: A Review of three approaches. IET Computer Vision,
out with A-ConvNet using moving and stationary target 6(6), pp.500–513.
acquisition and recognition (MSTAR) [59] criterion dataset 2. Lillywhite, K. and Archibald, J., 2013. A feature construc-
under standard operation conditions (SOC) and extended tion method for general object recognition. Pattern Recognition,
New York, NY, USA, Elsevier Science Inc. Vol. 46, pp.3300–
operating conditions (EOC) [58]. 3314.
3. Martin, L., Tuysuzojlu, A., Karl, W. C. and Ishwa, P., 2015. Learning
based object identification and segmentation using dual energy CT
6. COMPARATIVE ANALYSIS images for security. IEEE Transaction on Image Processing, 24(11),
Some of the research work in the field of object recogni- pp.4069–4081.
tion using deep learning summarizes in Table I. The vari- 4. Puissant, A., Hirsch, J. and Weber, C., 2005. The utility of texture
analysis to improve perpixel classification for high to very high spa-
ous techniques used by different researchers are mentioned
tial resolution imagery. International Journal Remote Sensing, 26(4),
in the table. The resulted reported by those researchers pp.733–745.
are very encouraging but calculated for particular type of 5. Benediktsson, J.A., Palmason, J.A. and Sveinsson, J.R., 2005.
database. The important question is how these methods Classification of hyperspectral data from urban areas based on
will perform when used with other databases. Therefore extended morphological profiles. IEEE Transactions on Geoscience
and Remote Sensing, 43(3), pp.480–491.
the comparative analysis between techniques mentioned in 6. Bau, M.T.C., Sarkar, S. and Healey, G., 2010. Hyperspectral
literature is desirable. region classification using a three-dimensional gabor filterbank.
IEEE Transactions on Geoscience and Remote Sensing, 48(9),
pp.3457–346.
7. CONCLUSION 7. Huang, X., Zhang, L. and Li, P., 2008. A multiscale feature fusion
In contemporary object recognition approaches, the deep approach for classification of very high resolution satellite imagery
neural network based object recognition techniques have based on wavelet transform. International Journal Remote Sensing,
29(20), pp.5923–5941.
remarkable performance due to its powerful learning capa- 8. Cheriyadat, A.M., 2014. Unsupervised feature learning for aerial
bility. In this paper, the recent developments of deep neu- scene classification. IEEE Transactions on Geoscience and Remote
ral network based object recognition framework have been Sensing, 52(1), pp.439–451.

J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4049


Object Recognition Using Deep Learning Goel et al.

9. Volpi, D.M., Mura, M.D., Rakotomamonjy, A. and Flamary, R., 27. Chen, Y., Zhao, X. and Jia, X., 2015. Spectral–spatial classification
2014. Automatic feature learning for spatio-spectral image of hyperspectral data based on deep belief network. IEEE Journal
classification with sparse svm. IEEE Transactions on Geoscience Selected Topics Applied Earth Observations Remote Sensing, 8(6),
and Remote Sensing, 52(10), pp.6062–6074. pp.2381–2392.
10. Hinton, G.E., Osindero, S. and Teh, Y.W., 2006. A fast learning 28. Mou, L., Ghamisi, P. and Zhu, X.X., 2017. Deep recurrent neu-
algorithm for deep belief nets. Neural Computation, 18(7), pp.1527– ral networks for hyperspectral image classification. IEEE Trans-
1554. actions on Geoscience and Remote Sensing, 55(7), pp.3639–
11. Arbelaez, P., Pont-Tuset, J., Barron, J.T., Marques, F. and Malik, J., 3655.
2014. Multiscale combinatorial grouping. Computer Vision and Pat- 29. Chen, Y., Jiang, H., Li, C., Jia, X. and Ghamisi, P., 2016. Deep
tern Recognition (CVPR), pp.328–335. feature extraction and classification of hyperspectral images based
12. Carreira, J. and Sminchisescu, C., 2012. CPMC: Automatic object on convolutional neural networks. IEEE Transactions on Geoscience
segmentation using constrained parametric min-cuts. IEEE Transac- and Remote Sensing, 54(10), pp.6232–6251.
tions on Pattern Analysis and Machine Intelligence, 34(7), pp.1312– 30. Ghamisi, Y., Chen and Zhu, X.X., 2016. A self-improving convolu-
1328. tion neural network for the classification of hyperspectral data. IEEE
13. Uijlings, J.R., van de Sande, K.E., Gevers, T. and Smeulders, A.W., Transactions on Geoscience and Remote Sensing, 13(10), pp.1537–
2013. Selective search for object recognition. International Journal 1541.
of Computer Vision, 104(2), pp.154–171. 31. Zhao, W. and Du, S., 2016. Spectral–spatial feature extraction for
14. Zitnick, C.L. and Dollar, P., 2014. Edge Boxes: Locating Object hyperspectral image classification: A dimension reduction and deep
Proposals from Edges. European Conference on Computer Vision, learning approach. IEEE Transactions on Geoscience and Remote
pp.391–405. Sensing, 54(8), pp.4544–4554.
15. Alexe, B., Deselaers, T. and Ferrari, V., 2012. Measuring the object- 32. Romero, A., Gatta, C. and Camps-Valls, G., 2016. Unsupervised
ness of image windows. IEEE Transactions on Pattern Analysis and deep feature extraction for remote sensing image classification. IEEE
Machine Intelligence, 34(11), pp.2189–2202. Transactions on Geoscience and Remote Sensing, 54(3), pp.1349–
16. Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich Fea- 1362.
ture Hierarchies for Accurate Object Detection and Semantic Seg- 33. Neto, J.C., Meyer, G.E., Jones, D.D. and Samal, A.K., 2006. Plant
mentation. IEEE Conference on Computer Vision and Pattern Recog- species identification using elliptic fourier leaf shape analysis. Com-
nition, pp.580–587.
puters and Electronics in Agriculture, 50(2), pp.121–134.
17. Viola, P. and Jones, M., 2001. Rapid Object Detection Using a
34. Chaki, J. and Parekh, R., 2011. Plant leaf recognition using shape
Boosted Cascade of Simple Features. Proceeding of IEEE Computer
based features and neural network classifiers. International Journal
Society Conference on Computer Vision and Pattern Recognition,
of Advanced Computer Science and Applications, 2(10).
pp.I-511–I-518.
35. Du, J.X., Wang, X.F. and Zhang, G.J., 2007. Leaf shape based plant
18. Dalal, N. and Triggs, B., 2005. Histograms of Oriented Gradients
species recognition. Applied Mathematics and Computation, 185(2),
for Human Detection. Proceeding of IEEE Computer Society Con-
pp.883–893.
ference on Computer Vision and Pattern Recognition, pp.886–893.
RESEARCH ARTICLE

36. Mouine, S., Yahiaoui, I. and Verroust-Blondet, A., 2012. Advanced


19. Uijlings, J.R., van de Sande, K.E.T., Gevers and Smeulders, A.W.,
Shape Context for Plant Species Identification Using Leaf Image
2013. Selective search for object recognition. International Journal
Retrieval. Proceedings of the 2nd ACM International Conference on
of Computer Vision, 104, pp.154–171.
Multimedia Retrieval, p.49.
20. Felzenszwalb, P.F., Girshick, R.B., McAllester, D. and Ramanan, D.,
37. Aakif, A. and Khan, M.F., 2015. Automatic classification of
2010. Object detection with discriminatively trained part-based mod-
plants based on their leaves. Biosystems Engineering, 139,
els. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 32(9), pp.1627–1645. pp.66–75.
21. Sermanet, P. Eigen, D., Zhang, X., Mathieu, M., Fergus, R. and 38. Hall, D., McCool, C., Dayoub, F., Sunderhauf, N. and Upcroft, B.,
LeCun, Y., 2013. OverFeat: Integrated Recognition, Localization and 2015. Evaluation of Features for Leaf Classification in Challenging
Detection Using Convolutional Networks. International Conference Conditions. IEEE Winter Conference on Applications of Computer
on Learning Representations (ICLR 2014), p.16. Vision, pp.797–804.
22. He, K., Zhang, X., Ren, S. and Sun, J., 2014. Spatial Pyramid 39. Kumar, N. Belhumeur, P.N., Biswas, A. Kress, W.J., Lopez,
Pooling in Deep Convolutional Networks for Visual Recognition. I.C. and Soares, J.V., 2012. Leafsnap: A computer vision sys-
Proceeding in 13th European Conference on Computer Vision, tem for automatic plant species identification. ECCV Springer,
pp.346–361. pp.502–516.
23. Delalieux, S. Somers, B. Haest, B., Spanhove, T., Borre, J.V. and 40. Cope, J.S., Remagnino, P., Barman, S. and Wilkin, P., 2010. Plant
Mücher, C.A., 2012. Heathland conservation status mapping through Texture Classification Using Gabor Co-Occurrences. International
integration of hyperspectral mixture analysis and decision tree clas- Symposium on Visual Computing, Springer. pp.669–677.
sifiers. Journal Article in Remote Sensing of Environment, 126, 41. Rashad, M., ElDesouky, B. and Khawasik, M.S., 2011. Plants images
pp.222–231. classification based on textural features using combined classifier.
24. Ham, J., Chen, Y., Crawford, M.M. and Ghosh, J., 2005. Investiga- International Journal of Computer Science & Information Technol-
tion of the random forest framework for classification of hyperspec- ogy, 3(4), pp.93–100.
tral data. IEEE Transactions on Geoscience and Remote Sensing, 42. Tang, Z., Su, Y., Er, M. J., Qi, Zhang, F. L. and Zhou, J., 2015.
43(3), pp.492–501. A local binary pattern based texture descriptors for classification of
25. Melgani, F. and Bruzzone, 2004. L.Classification of hyperspec- tea leaves. Neurocomputing, 168, pp.1011–1023.
tral remote sensing images with support vector machines. IEEE 43. Larese, M.G., Namías, R., Craviotto, R.M., Arango, M.R., Gallo,
Transactions on Geoscience and Remote Sensing, 42(8), pp.1778– C. and Granitto, P.M., 2014. Automatic classification of legumes
1790. using leaf vein image features. Pattern Recognition, 47(1),
26. Chen, Y., Lin, Z., Zhao, X., Wang, G. and Gu, Y., 2014. Deep pp.158–168.
learning-based classification of hyperspectral data. IEEE Journal 44. Drucker, H., Burges, C.J.C., Kaufman, L., Smola, A.J. and
Selected Topics Applied Earth Observations Remote Sensing, 7(6), Vapnik, V., 1997. Support vector regression machines. Advances in
pp.2094–2107. Neural Information Processing Systems, 9, pp.155–161.

4050 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019


Goel et al. Object Recognition Using Deep Learning

45. Hardoon, D.R., Szedmak, S. and Shawe-Taylor, 2004. Canonical cor- IEEE Transactions on Pattern Analysis and Machine Intelligence,
relation analysis: An overview with application to learning methods. 39(6).
Neural Computation, 16(12), pp.2639–2664. 67. Zeiler, M.D. and Fergus, R., 2014. Visualizing and Understanding
46. Cortes, C. and Vapnik, 1995. Support-vector networks. Machine Convolutional Neural Networks. Proceeding 13th European Confer-
Learning, 20(3), pp.273–297. ence on Computer Vision, pp.818–833.
47. Chen, K., Gong, S., Xiang, T., and Loy, C., 2013. Cumula- 68. Simonyan, K. and Zisserman, A., 2015. Very Deep Convolutional
tive Attribute Space for Age and Crowd Density Estimation. Networks for Large-Scale Image Recognition. Proceeding in Inter-
IEEE Conference on Computer Vision and Pattern Recognition national Conference Learning Representations.
(CVPR). 69. Zhao, W., Du, S. and Emery, W.I., 2017. Object based convo-
48. Huerta, I., Fernández, C. and Prati, 2014. Facial Age Estimation lutional neural network for high resolution imagery classification.
Through the Fusion of Texture and Local Appearance Descriptors. IEEE Journal of Selected Topics in Applied Earth Observation and
IEEE European Conference on Computer Vision (ECCV). Remote Sensing, 10(7).
49. Guo, G. and Mu, 2014. A framework for joint estimation of age, 70. Zhao, W. and Du, S., 2016. Spectral-spatial feature extraction for
gender and ethnicity on a large database. Image and Vision Comput- hyperspectral image classification: A dimension reduction and deep
ing, 32(10), pp.761–770. learning approach. IEEE Transactions on Geoscience and Remote
50. Yi, D., Lei, Z. and Li, S.Z., 2014. Age Estimation by Multi- Sensing, 54(8), pp.4544–4554.
Scale Convolutional Network. Asian Conference on Computer Vision 71. Szegedy, C., Toshev, A. and Erhan, D., 2013. Deep neural networks
(ACCV). for object detection. Advances in Neural Information Processing Sys-
51. Wang, X., Guo, R. and Kambhamettu, 2015. Deeply-Learned Fea- tems 26 (NIPS 2013).
ture for Age Estimation. IEEE Winter Conference on Applications 72. Pinheiro, P.O., Collobert, R. and Dollár, P., 2015. Learning to seg-
of Computer Vision (WACV). ment object candidates. Advances in Neural Information Processing
52. Rothe, R., Timofte, R. and Van Gool, 2016. Some Like It Hot-Visual Systems 26.
Guidance for Preference Prediction. IEEE Conference on Computer 73. Szegedy, C., Reed, S., Erhan, Anguelov, D. and Ioffe, S., 2014. Scal-
Vision and Pattern Recognition (CVPR). able, High-Quality Object Detection. ArXiv:1412.1441.
53. Zhu, Y., Li, Y., Mu, G. and Guo, 2015. A Study on Apparent 74. Yoo, D., Park, S., Lee, J.-Y., Paek, A.S. and Kweon, I.S., 2015.
Age Estimation. IEEE International Conference on Computer Vision Attentionnet: Aggregating Weak Directions for Accurate Object
(ICCV) Workshops. Detection. IEEE Conference on Computer Vision and Pattern Recog-
54. Yang, X., Gao, B.B., Xing, C., Huo, Z.W., Wei, X.S., Zhou, Y., nition (CVPR).
Wu, J. and Geng, 2015. Deep Label Distribution Learning for Appar- 75. Rothe, R., Timofte, R. and Goo, L.V., 2018. Deep expectation of
ent Age Estimation. IEEE International Conference on Computer real and apparent age from a single image without facial landmarks.
Vision (ICCV) Workshops. International Journal of Computer Vision, 126, pp.144–157.
55. Cui, Y., Zhou, G., Yang, J. and Yamaguchi, Y., 2011. On the iterative 76. Mau, L., Ghamisi, P. and Zhu, X.X., 2018. Unsupervised spectral-
censoring for target detection in SAR image. IEEE Transactions on spatial feature learning via deep residual conv-deconv network for
Geoscience and Remote Sensing, 8(4), pp.641–645. hyperspectral image classification. IEEE Transactions on Geoscience

RESEARCH ARTICLE
56. Park, J.I. and Kim, K.T., 2014. Modified polar mapping classifier for and Remote Sensing, 56(1).
SAR automatic target recognition. IEEE Transaction on Aerospace 77. Dosovitskiy, A., Springenberg, J.T. and Brox, T., 2015. Learning
and Electronic Systems, 50(2), pp.1092–1107. to Generate Chairs, Tables and Cars with Convolutional Networks.
57. Novak, L.M., Owirka, G.J., Brower, W.S. and Weaver, A.L., 1997. Proceeding in IEEE Conference on Computer Vision Pattern Recog-
The automatic target-recognition system in SAIP. The Lincoln Lab- nition (CVPR), pp.1538–1546.
oratory Journal, 10(2), pp.187–201. 78. Redmon, J. and Farhadi, A., 2016. Yolo9000: Better, Faster,
58. Ross, T.D., Bradley, J.J., Hudson, L.J. and O’Connor, M.P., 1999. Stronger. ArXiv:1612.08242.
SAR ATR: So What’s the Problem? An MSTAR Perspective. Pro- 79. Redmon, J., Divvala, S., Girshick, R. and Farhadi, A. 2016. You
ceeding in 6th SPIE Conference Algorithms SAR Imagery, Vol. 3721, Only Look Once: Unified, Real-Time Object Detection. IEEE Con-
pp.566–579. ference on Computer Vision and Pattern Recognition (CVPR).
59. Keydel, E.R., Lee, S.W. and Moore, J.T., 1996. MSTAR Extended 80. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y.
Operating Conditions: A Tutorial. Proceeding in 6th SPIE Confer- and Berg, A.C., 2016. Ssd: Single Shot Multibox Detector. IEEE
ence Algorithms SAR Imagery, Vol. 2757, pp.228–242. European Conference on Computer Vision (ECCV).
60. Hirose, A., ed., 2013. Complex-Valued Neural Networks: Advances 81. Erhan, D., Szegedy, C., Toshev, A. and Anguelov, D., 2014. Scalable
and Applications. Hoboken, NJ, USA. Wiley-IEEE Press. Object Detection Using Deep Neural Networks. IEEE Conference
61. Zhao, Q. and Principe, J.C., 2001. Support vector machines for SAR on Computer Vision and Pattern Recognition (CVPR).
automatic target recognition. IEEE Transaction on Aerospace and 82. Bell, S., Zitnick, C.L., Bala, K. and Girshick, R., 2016. Inside-
Electronic Systems, 37(2), pp.643–654. outside net: Detecting Objects in Context with Skip Pooling and
62. Sun, Y.J., Liu, Z.P., Todorovic, S. and Li, J., 2007. Adaptive boost- Recurrent Neural Networks. IEEE Conference on Computer Vision
ing for SAR automatic target recognition. IEEE Transaction on and Pattern Recognition (CVPR).
Aerospace and Electronic Systems, 43(1), pp.112–125. 83. Charters, J., Wang, Z., Chi, Z., Tsoi, A.C. and Feng, D.D., 2014.
63. Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet EAGLE: A Novel Descriptor for Identifying Plant Species Using
classification with deep convolutional neural networks. Proceedings Leaf Lamina Vascular Features. ICME-Workshop, pp.1–6.
of Advances in Neural Information Processing Systems, pp.1097– 84. Dudgeon, D.E. and Lacoss, R.T., 1993. An overview of automatic
1105. target recognition. The Lincoln Laboratory Journal, 6(1), pp.3–10.
64. Bai, X., Wang, X., Latecki, L.J., Liu, W. and Tu, Z., 2010. Active 85. Wang, W., Yang, J., Xiao, J., Li, S. and Zhou, D., 2015. Face Recog-
Skeleton for Non-Rigid Object Detection. IEEE 12th International nition Based on Deep Learning. Springer International Publishing
Conference on Computer Vision (ICCV). Switzerland. pp.812–820.
65. Girshick, R., 2015. Fast R-CNN. IEEE International Conference on 86. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. and
Computer Vision, pp.1440–1448. Zisserman, A., 2007. The PASCAL visual object classes chal-
66. Ren, S., He., K., Girshick, R. and Sun, J., 2017. Faster R-CNN: lenge results. International Journal of Computer Vision, 88(2),
Towards real-time object detection with region proposal networks. pp.303–338.

J. Comput. Theor. Nanosci. 16, 4044–4052, 2019 4051


Object Recognition Using Deep Learning Goel et al.

87. Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., 89. Zitnick, C.L. and Dollar, P., 2014. Edge Boxes: Locating Object
Ramanan, D., Dollar, P. and Zitnick, C.L., 2014. Microsoft COCO: Proposals from Edges. Proceeding in 13th European Conference on
Common Objects in Context. Proceeding in European Conference Computer Vision, pp.391–405.
on Computer Vision, pp.740–755. 90. Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.-X., Chang, Y.-F. and
88. Song, S., Lichtenberg, S.P. and Xiao, J., 2015. SUN RGB-D: Xiang, Q.-L., 2007. A Leaf Recognition Algorithm for Plant Clas-
A RGB-D Scene Understanding Benchmark Suite. Proceeding sification Using Probabilistic Neural Network. IEEE International
in IEEE Conference on Computer Vision Pattern Recognition, Symposium on Signal Processing and Information Technology,
pp.567–576. pp.11–16.

Received: 20 April 2019. Accepted: 10 May 2019.


RESEARCH ARTICLE

4052 J. Comput. Theor. Nanosci. 16, 4044–4052, 2019

View publication stats

You might also like