0% found this document useful (0 votes)
42 views13 pages

Scalable Video Analytics

The proposed system uses bi-dimensional empirical mode decomposition (BEMD) to split images from illumination variant video datasets into intrinsic mode functions (IMFs). Reisz transform is then applied to the IMFs to generate monogenic components, including orientation, phase, and amplitude. Experiments show that the orientation component contributes most to visual object recognition accuracy. The system further improves accuracy by fusing higher IMFs and through an orientation component fusion strategy. It is deployed on a cloud platform using Spark for scalable, parallel training of convolutional neural networks on large video datasets. Experimental results demonstrate significantly higher accuracy compared to commonly used models like AlexNet and LeNet, especially on challenging datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views13 pages

Scalable Video Analytics

The proposed system uses bi-dimensional empirical mode decomposition (BEMD) to split images from illumination variant video datasets into intrinsic mode functions (IMFs). Reisz transform is then applied to the IMFs to generate monogenic components, including orientation, phase, and amplitude. Experiments show that the orientation component contributes most to visual object recognition accuracy. The system further improves accuracy by fusing higher IMFs and through an orientation component fusion strategy. It is deployed on a cloud platform using Spark for scalable, parallel training of convolutional neural networks on large video datasets. Experimental results demonstrate significantly higher accuracy compared to commonly used models like AlexNet and LeNet, especially on challenging datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

1

A Scalable Video Analytics System for Orientation


Fusion based Visual Object Recognition
Muhammad Usman Yaseen, Ashiq Anjum, and Nikolaos Antonopoulos

Abstract—Visual object recognition from live video streams are more close to the real life situations. These datasets contain
comes with numerous challenges such as variation in illumination a number of challenges including illumination, blur and noise.
conditions and poses. Convolutional neural networks have been The expression and illumination challenges when combined
used by the emerging multimedia and computer vision applica-
tions to perform intelligent visual object recognition, however, together can particularly vary the appearance of an object.
their accuracy severely degrades when they are applied on These variations are so intense that it becomes impossible
illumination variant video datasets. To address this problem, even for humans to recognize them. These challenges need
we propose an orientation fusion based visual object recogni- to be tackled with intelligent video processing methods to
tion system using convolutional neural networks. The proposed implement highly accurate visual object recognition systems.
cloud based video analytics system leverages the use of bi-
dimensional empirical mode decomposition (BEMD) to split an Convolutional Neural Networks (CNNs) have been used
image into intrinsic mode functions (IMFs). These intrinsic mode recently to perform visual object recognition on video datasets.
functions endure Reisz transform to produce monogenic object CNNs proved to be successful on a number of object detection
components. These components are then used for the training and classification tasks on large video datasets. They also
of convolutional neural networks. It has been observed that the have the generalization capability and can be trained on large
orientation component of the object leads to a higher accuracy
of 93 percent. We further propose a feature fusion strategy of scale video datasets belonging to different classes. However,
the orientation components which further improves the visual CNNs also struggle to perform well on the challenging datasets
recognition accuracy to 97 percent. The proposed cloud based and their accuracy severely drops especially for the case of
video analytics system has been demonstrated to process a large expression and illumination variant datasets.
number of video streams and the underlying infrastructure is able In order to achieve high visual object recognition accuracy
to scale based on the number and size of the video stream(s)
being processed. The extensive experimentations on publicly on the challenging datasets, we propose empirical mode de-
available image and video datasets reveal that the proposed composition (EMD) [1] based implementation of CNNs. We
system is significantly more accurate and scalable when compared split the input video dataset comprising of images and videos
to AlexNet and LeNet, which are two most commonly used deep into its intrinsic mode functions (IMFs) [2] by using EMD.
learning models for visual object recognition. Reisz transform [3] is then applied on the resulting IMFs to
Index Terms—Multimedia Data Anaytics; Convolutional Neu- generate the monogenic components. The local monogenic
ral Networks; Scalable Object Recognition; Cloud Computing components including phase, orientation and amplitude are
then analyzed to determine which of these components con-
tribute to a higher accuracy rate with the CNNs. Figure 1
I. I NTRODUCTION depicts the workflow of our proposed visual object recognition
system. It has been observed through experiments in section VI

V ISUAL object recognition is a vital component of any


multimedia data analytics system and aids in a number
of applications including medical image processing, visual
that the orientation component of the visual object contributes
to the higher accuracy rates. Inspired by this fact, we further
propose a feature fusion strategy based on the orientation
object tracking, interactive virtual reality games and many component of the IMFs. The higher IMFs are relatively noise
others. To have a highly accurate visual object recognition free and contain higher frequency components. We have fused
system, multimedia applications have exploited knowledge the higher IMFs into a single IMF to produce a high quality
from different domains including machine learning, image image which leads to further improvements in visual object
processing and distributed systems. recognition rates.
The most common challenges that the current visual object To achieve high scalability and video processing throughput,
recognition systems exerperience include pose and illumi- the proposed video analytics system has been deployed on
nation variations, facial expressions, aging conditions and cloud nodes running the Spark distributed framework [4].
mainly the scalability as the databases can contain hundreds of Iterative map-reduce paradigm has been used to perform
millions of images and videos. The multimedia data analytics parallel training on multiple compute nodes. The parallel and
systems suffer serious accuracy degradation when evaluated on distributed training processes large amounts of data rapidly
challenging video datasets. The challenging datasets are those and efficiently. The underlying cloud infrastructure is itera-
datasets which are captured under uncontrolled conditions and tively tuned for maximum resource utilization to support large
scale visual object recognition. It is customizable in terms of
Muhammad Usman Yaseen, Ashiq Anjum and Nikolaos Antonopoulos are
with the Department of Computing and Mathematics, University of Derby, scalability i.e. compute nodes can be added or removed with
Derby, UK, e-mail: (m.yaseen, a.amjum, [email protected]). the addition or deletion of data.
2

Fig. 1: Workflow of the Proposed System

1) systems based on shallow networks produce high dimensional


The following are the major contributions of this paper: feature vectors and are not suitable to work for large scale
Firstly, we pioneer to use EMD with CNNs to improve visual data processing.
object recognition accuracy on challenging video datasets. With the advent and recent tremendous success of deep
We study the orientation, phase and amplitude components networks, more and more research has been going on in
and show their performance in terms of visual recognition object detection and recognition using deep networks. These
accuracy. We show that the orientation component is a good deep networks can perform recognition of objects on large
candidate to achieve high object recognition accuracy for illu- scale data as compared to shallow networks but require more
mination and expression variant video datasets. Secondly, we computation resources and training time. Li et al. [6] pro-
propose a feature fusion strategy of the orientation components posed a scale-aware fast R-CNN model to detect pedestrians
to further improve the accuracy rate. We have shown that the in natural scenes. They used multiple subnetworks for the
orientation fusion approach significantly improves the visual detection of pedestrians with different scales. Each subnetwork
recognition accuracy under challenging conditions. Thirdly, we generated an output which were combined to produce a
scale and optimize the underlying cloud based infrastructure final detection. Attentive contexts were proposed by Li et al.
to improve visual object recognition time of the system so that [7] to perform object detection. They used local and global
it can be deployed on large video datasets. surrounding contexts which were fused together to perform
The following is the organization of the rest of the paper: accurate decision for object detection. Liu et al. [8] proposed
Section II reviews the related work for visual object recogni- a vehicle re-identification system using deep neural networks.
tion and highlights their strengths and weaknesses. Section III They used multimodality data such as visual features and
explains the approach of our visual object recognition system. contextual information to identify vehicles progressively from
Section IV details the architecture and implementation of the large-scale surveillance data.
proposed system. The experimental setup and the evaluation A number of CNNs models were also proposed for ob-
results are explained in Section V and VI respectively. Section ject classification. Krizhevsky et al. [9] proposed a deep
VII presents the conclusions and future work. convolutional neural network to perform image classification
from ImageNet dataset. The proposed CNN model consisted
II. R ELATED W ORK of 60 million parameters with five convolutional layers and
Object recognition has been an active area of research for three fully connected layers. Similarly LeCun et el. [10]
last few decades. Previously, researchers have been very active proposed another model of convolutional neural network with
in applying shallow networks for object recognition problems. less than one million parameters to perform classification of
The object recognition systems based on shallow networks use handwritten digits [11]. The proposed CNN model contained
hand crafted features and can be broadly divided into two cat- two convolutional layers followed by two subsampling layers
egories. Global features based systems which provide coarse leading to fully connected layers.
representation of an image but also produce high dimension CNNs have also been used to recognize objects from RGB-
feature matrix. The local feature based systems which work D data. Wang et al. [12] proposed a multi-modal CNN with
on smaller image areas provide more details of a specific local separate layers for color and depth. These layers were then
patch within an image. These are robust to noise and occlusion connected to discover features from each modality. Tang et al.
but require more computation time and resources. Yaseen et [13] proposed to use the underlying data structure and prior
al. [5] proposed a cloud based video analytics system using knowledge of data to recognize objects from RGB-D data. A
GPUs to reduce the computational complexity involved in the CNN based model for the estimation of human head pose in
feature extraction process. However, all the object recognition RGB-D data was proposed by mukherjee et al. [14]. CNN
3

Fig. 2: Proposed Approach

and regressor based models were fine-tuned and combined to the training database consists of a number of training samples
estimate the confidence for regression. and are given by;
Recent research showed that convolutional neural networks
can work well for the images or video data if it is in good qual-
“x1 = i1 , i2 , . . . , in
ity. However, the accuracy of a convolutional neural networks
severely degrades if it is applied on a challenging dataset and x2 = i1 , i2 , . . . , in
contains challenges including illumination and noise. We have x2 = i1 , i2 , . . . , in (2)
tackled this issue by using empirical mode decomposition [1] .. ..
and shifting the data from time domain to spatial-frequency . .
domain. EMD has been used in the past to perform recogni- xn = i1 , i2 , . . . , in ”
tion and classification. Ehsan et al. [15] proposed an image
fusion and enhancement technique using EMD to decompose Where, “i1 , i2 , . . . , in ” represent the individual images of
non-stationary signals into IMFs [2]. Linderhed et al. [16] each subject present in the training dataset. Each training sam-
proposed image empirical mode decomposition (IEMD) to ple “i” from each individual subject “x” undergoes through bi-
locally separate superimposed spatial frequencies from the dimensional empirical mode decomposition (BEMD) to have a
image. Liu et al. [17] presented 2DEMD to extract the local decomposition into its frequency components. EMD generates
features of the two-dimensional Intrinsic Mode Function for these frequency components by sifting process in which the
edge detection. Yaseen et al. [18] pioneered to utilize EMD on highest frequency components from the training sample are
video data in a parallel and distributed system. They used first extracted in each cycle or mode. Each mode stores the high
three IMFs and proposed a stack based hierarchy to perform frequencies as an IMF. These IMFs are stored in the decreasing
object classification on a challenging dataset. However, all order of their frequencies and the lowest IMF contains the
these works made the use of EMD with shallow networks lowest frequencies.
and did not exploit the use of EMD for deep networks. The sifting process [19] first determines the extrema points
from the training sample “k(i, j)” , where “i, j” are the
III. O BJECT RECOGNITION APPROACH AND dimensions of the training sample. These extrema points are
IMPLEMENTATION then connected to form upper and lower envelops. An average
This section describes the approach of our proposed visual of the upper and lower envelop is calculated to produce mean
object recognition system as shown in Figure. 2. The input envelop “mean(i, j)” as shown in Figure 3 and is given by;
training dataset “X” is represented by;

“T raining dataset X = x1 , x2 , . . . , xn ” (1) “mean(i, j) = (eupper(i, j) + elower(i, j))/2” (3)

Here, “x1 , x2 , . . . xn ” represent the individual subjects The mean envelop “mean(i, j)” is then subtracted from the
present in the training database. Each individual subject in training sample “k(i, j)” to produce “T 1” and is given by;
4

Fig. 3: Averaged Extrema Surfaces Fig. 4: Amplitude, Phase and Orientation of first three IMFs

After obtaining all the required IMFs of the input data,


“T lk = I(x, y)m(x, y)” (4) Riesz transform is applied to produce the monogenic data.
Monogenic data aids in studying the local components of the
The whole process is repeated till “T lk ” is a two dimen- input data. The local components are calculated from each
sional IMF. When the mean envelop “mean(i, j)” reaches IMF. The Riesz transform in the frequency domain is given
close to zero, this process is stopped, otherwise it keeps as;
on reiterating. The residual is obtained by removing the
original training sample “k(i, j)” from “T lk ”. If the residual Algorithm 2 Training weight vectors on local components
is represented by “Res(i, j)”, it is given by;
Input:
Input Dataset x1 , x2 , , xn , 192 x 168 image size
“Res(i, j) = i(i, j) − T lk ” (5) Output Target Label T 1-in-k vectors y1 , y2 , , yt
In order to obtain the next IMF, the whole procedure Number of back-propagation epochs R
is repeated on the residual “Res(i, j)” by considering it Number of convolution masks J
as a training sample. Repetition of this process on all the Activation function of convolution and subsampling g(.)
subsequent residuals results in a number of IMFs in the
decreasing order of their frequencies as shown in Figure 4. All Output:
the resultant IMFs and the residual can be grouped together to result: Recognition Labels fco ⇐ F ullyConnected
obtain the original training sample. This procedure is shown result ⇐ Sof tmax(fco )
in Algorithm 1. while epoch r: 1 → R do
while Training image number x: 1 → X do
Algorithm 1 Empirical Mode Decomposition Compute J hidden activation matrices z1 , z2 , ..., zj
• g(xk,l + wk,l + Bk,l )
Input:
Input Dataset x1 , x2 , , xn , 192 x 168 image size Downsample matrices z1 , z2 , ..., zj by a factor of 2
Width of each image W • g(↓2 xk,l + wk,l + bk,l )
Height of each image H Calculate weight and bias deltas
Number of iterations m PN
• 4Wt,k = lr i=1 (xi ∗ Dih ) + m4W(t−1,k)
Number of IMFs n PN
• 4Bt,k = lr i=1 Dih + m4B(t−1,k)
Output:
result: IMFs Calculate softmax activation vector ’a’
• l(i, xiT ) = M (ei , f (xiT ))
while !residue do Compute error yx − a
Let the proto-IMF be x̂(x,y) = x(w,h) Back propagate and update network weights
while IMF ¡= 3 do • Wt+1 = Wt − αδL(θt )
while !criteria do end
Identify local maxima and minima of (w,h) end
Find envelop elower (x, y)
Find envelop eupper (x, y)
Mean m(x,y)=(eupper (x, y) = elower (x, y) )/2
Extract detail h1 = x̂(x,y) - m(x,y) “fR (v) = I(v/v) × f (v) = h2 (v) × f (v)” (6)
x̂(x,y) = h1 The transfer function ’h2’ in spatial domain is given as;
end P
3
x̂(x,y) - j=1 hj (x, y)
end “fR (x) = I × (x/2π)x3 × f (x) = h2 (x) × F (x)” (7)
end The monogenic data is given by;
5

and classify among different classes.


The convolutional layers and sub-sampling layers of the
convolutional neural network used in our system are repre-
sented as;

Fig. 5: Orientation Fusion “Convoli,j = g(xi , j ∗ Wi , j + Bi , j)” (11)

“Subsampi,j = g(↓ xi , j ∗ wi , j + bi , j)” (12)


“fm (x) = f (x)(I, j) × fR (x)” (8)
The weight and bias deltas for convolutional layers and sub-
Let “Xamp ” , “Xpha ” , “Xori ” be the amplitude, phase and sampling layers are calculated as;
orientation spectrums of all the training samples present in the
N
database. These can be represented as; X
“4Wt,i = lr (xi ∗ Dih ) + m4W(t−1,i) ” (13)
i=1
“X1amp = iamp1 , iamp2 , . . . , iampn
N
X1pha = ipha1 , ipha2 , . . . , iphan
X
“4Bt,i = lr Dih + m4B(t−1,i) ” (14)
X1ori = iori1 , iori2 , . . . , iorin i=1
.. .. .. (9) We have used ReLu as the activation function in our
. . .
framework and is represented by g(.) in the above equation.
X34amp = iamp1 , iamp2 , . . . , iampn The weight and bias vectors are represented by “W ” and “B”
X34pha = ipha1 , ipha2 , . . . , iphan in the equations. The inputs are convolved with the weight
X34ori = iori1 , iori2 , . . . , iorin ” vectors of the network with the help of a two dimensional
convolution operation represented by “ ∗ ” in the equations.
Here, “xamp ” , “xpha . . . xori ” represent the amplitude, The sub-sampling layer down-samples the given input. The
phase and orientation spectrums of individual subjects present range of ReLu activation function goes over 0 to infinity
in the training database. We created a fused orientation spec- and can model positive real numbers. It works much better
trum [20] from the orientation spectrums of first two IMFs. for convolutional neural networks as compared to sigmoid
Since most of the illumination effects and noise present in function because it does not vanish as the value of “x”
the images resides in the lowest frequency bands, so we have increases.
discarded the lower IMFs and retained only the first two
The stochastic gradient descent and the momentum term
IMFs containing the high frequency components. The fused
used in the training of the network are given by;
spectrum is a composite spectrum which contains the elements
from both the spectrums. The fused spectrum merges the orien- “Wt+1 = Wt − αδL(θt )” (15)
tation spectrums of first two IMFs into a single surface adding
in the meaning of the original two orientation spectrums. The
fused spectrum shows the original two orientation spectrums “Vt+1 = ρvt − αδL(θt )” (16)
overlaid in different color bands as shown in Figure 5. The “Wt+1 = Wt + Vt+1 ” (17)
gray regions in the fused spectrum show where the two original
orientation spectrums posses the same intensities. On the other The softmax layer which is the last layer of the network is
hand, the colored regions show where both spectrums have dif- given by;
ferent intensities. These colored regions play an important role “l(i, xi T ) = M (ei , f (xi T ))” (18)
in enhancing the discriminative capabilities during the feature
extraction process. It can be visualized from the figure that the The proposed visual object recognition system is compute
fused intrinsic mode function contains significant information intensive as it is built upon convolutional neural networks
in terms of data points which leads to further improvements which require large training times. We have optimized the code
in the accuracy rates. Let “XF usedori ” represents the fused and tuned the hyper-parameters properly to perform training
orientation spectrum and “*” represents the fusion operation. in reasonable amount of time.
The fused orientation spectrum of all the subjects is then given The training process initiates by loading the dataset into
as; the memory. The initial parameters are initialized and the
network configurations are loaded to start the training process
N
X as shown in Algorithm 2. The dataset is divided into a number
“4XF usedori = (x(i)ori1 ∗ x(i)ori2 )” (10) of mini-batches as loading the data into the memory at once
i=1 is not feasible. The size of the mini-batch is dependent on
The network is trained on these datasets separately in the settings of the network configuration. The mini-batches
different experiments and their effects are studied on the facilitate in tackling the memory requirement issue. A mini-
overall performance of the system explaining which dataset batch of value 12 is used in the proposed system which is
gives the best performance in terms of accuracy to discriminate selected on the basis of experimentations.
6

Fig. 6: Convolutional Neural Network Architecture

We have made the use of N-dimensional arrays by using


Nd4j for java [21]. It has the capability to perform fast
numerical computing for java and consumes less memory. It
also supports the loading of data into the memory and training
of the network as two separate processes.
After loading the data into the memory, it is normalized. The
CNN works much better if the data is normalized especially
if it is based on the stochastic gradient descent approach for
training. In order to iterate over the data present in the memory, Fig. 7: Example Faces from Cropped Yale
a dataset iterator is created which draws the data from memory
in a vectorised format. The vectorised format is necessary for results are further discussed with the help of confusion matrix
CNN training. The dataset objects contain multiple training in terms of FalseNegatives, FalsePositives, TruePositives and
examples along with their labels. These examples and their TrueNegatives.
labels are stored in the n-dimensional array.
We have adopted the Local Response Normalization to aid TABLE I: Model Specifications
generalization. The Local Response Normalization simulates
Number of Layers 5
the behavior of actual neurons and generates a competition
Number of Epochs 40
amongst neuron outputs. We have utilized max pooling in
Number of Iterations 1
the pooling layer to perform sample based discretization or
OptimizationAlgo SGD
downsampling of an input representation (feature maps from
convolutional layer in our case). Max pooling decreases the Activation RELU
dimensionality, reduces the number of parameters to learn and Batch Size 12
also cuts down the overall computational cost. Seed 42
The value of learning rate is selected to be 0.0001. It has LearningRate 0.0001
been selected after a number of experiments. It is observed Regularization L2
during the experimentations that a high value of learning rate Updater RMSPROP
can cause divergence of the model away from the minimum Momentum 0.9
error. This can halt the learning process.
The input training examples are first filtered by 50 kernels The self generated video dataset and cropped yale [22]
having dimension and stride of 192 x 168 x 1 and 1 x 1 face database has been used to measure the efficiency of the
respectively. Stride controls how depth columns around the proposed system. The Yale face database has been captured to
spatial dimensions are assigned. The next layer then filters mimic the real world situations. Exceptional importance has
with 100 kernels having a dimension and stride of 5 x 5 been given to the illumination effects which occurs mostly in
and 1 x 1 respectively. The subsequent and preceding layers real life scenarios. The database consists of a variety of human
have associated kernels to each other with a nonZeroBias. The faces with diverse facial expressions, poses and illumination
max-pooling layer which is next to convolutional layers has a effects as shown in Figure 7. There are 34 subjects, each with
dimension of 2 x 2. All these layers end up to a fully connected 60 samples for training and testing. The images are gray scaled
layer. The neurons in the fully-connected layer are associated and captured at a resolution of 168x192 pixels. Every subject
to the neurons of the previous layer. The proposed CNN model present in the database demonstrates illumination variations. It
architecture is shown in Figure 6. also demonstrates variations in expressions. The self generated
video dataset also contains the illumination, pose and facial
IV. E XPERIMENTAL S ETUP expression challenges and is very similar to the yale face
The details of our experimental setup used to evaluate the database.
proposed system are presented in this section. The proposed The architecture of the proposed convolutional neural net-
system is evaluated on the following performance characteriza- work model is as follows: The network is built upon a total
tion: Accuracy, Precision, Recall and F1 Score. The generated of five layers. There are two convolutional layers in the
7

TABLE II: Model Configuraion


Layer Info. CNN 1 CNN 2
Input Size 1 50
Layer Size 50 100
# Parameters 1300 125100
Weight Init. XAVIER XAVIER
Updater RMSPROP RMSPROP
Kernel Size [5,5] [5,5]
Stride [1,1] [5,5]
Padding [0,0] [1,1]
Activation relu relu

TABLE III: Layer Configuration Fig. 8: Model Score vs. Iteration


Layer Info. Dense Layer Output Layer
Input Size 7200 500
Layer Size 500 34
# Parameters 3600500 17034
Weight Init. XAVIER XAVIER
Updater RMSPROP RMSPROP
Activation relu softmax

network with 50 and 100 kernels respectively and constitutes


nonZeroBias. There are two max pooling layers which are
followed by each convolutional layer in the network. There is
one output layer which is followed by one dense layer and a
momentum of 0.9 is selected. More detailed specifications of Fig. 9: Parameter Ratios
the model are shown in Table I:
The two convolutional layers have a layer size of 50 and
100 respectively. Each layer constitutes root mean square subsections. (i) Firstly, we describe the training of the
prop (RMSPROP) updater and rectified linear unit (ReLU) proposed model and visualize the performance of training
as an activation function. ’ReLU’ is the most commonly used parameters during the model training. The visualization of
activation function and is also known as half rectified function. weight vectors and other parameters during the training helps
It has a range from zero to infinity and helps to train the for tuning parameters properly. (ii) Secondly, we present and
network much faster as compared to other activation functions. discuss the results of the proposed system. We then make a
The kernel size of both the convolutional layers is [5,5]. More comparison with the two existing models and measure the
detailed specifications of both the layers are listed in Table II: improvements in terms of Accuracy, Precision, Recall and F1
The dense and output layers have a layer size of 500 and 34 Score. A discussion on the performance characterization of
respectively. Each layer constitutes ’RMSPROP’ updater. The the resultant confusion matrix in terms of True Positives, False
dense layer has ’ReLU’ as an activation function but the output Positives, True Negative and False Negative is also provided.
layer has softmax function for classification. More detailed (iii) Thirdly, we present the results of the scalability and
specifications of these layers are listed in Table III: performance of the system on the cloud based infrastructure.
In order to evaluate the proposed system, we have compared
it with two well-known models AlexNet and LeNet. The Deep Learning Model Training:
P Figure.
P 8 depicts the loss
AlexNet model consists of 13 layers in total. There are five function value “L(x) = LR xi −>X xi −>Ti l(i, xi T )” at
convolutional layers. Two local response normalization layers increasing number of iterations per unit time. These values
follow the first two convolutional layers. There are three are shown on the current mini-batch size. A decreasing trend
subsampling layers. Two of these subsampling layers follow in the graph can be observed from the figure over multiple
the local response normalization layers. There are also two iterations of time. Especially a rapid decline in the graph is
dense layers and an output layer. The LeNet model on the observed after the completion of 1500 iterations. A clear rapid
other hand has five layers with two convolutional layers, two decline after the 1500 iteration is observed till 3500 iterations.
max pooling layers and an output layer. The graph then keeps on decreasing until the value of loss
function approaches very close to zero. This converging trend
in the loss function graph shows that the model parameters
V. E XPERIMENTAL R ESULTS including network weights, learning rate and regularization are
The results and discussion of the proposed system are tuned properly.
presented in this section. This section is divided into three We have selected the learning rate for the proposed system
8

This indicates that the network is not exposed to the problem


of exploding gradients. A stable trend in both the graphs is also
a depiction of proper initialization ofPweights and selection of
good regularization scheme i.e. “λ2 i θi2 ”. The values of λ
ranged between 5 * 1e-2 and 5 * 1e-8 but 5 * 1e-4 remained
vital while training.
We have also generated the histogram of layer parameters
and layer updates. The layer updates are obtained
when the learning rate, regularization and momentum
“Vt+1 = ρvt − αδL(θt )” (ρ is varied from 0.6 to 0.9 with 0.9
being the best for training) are applied. We have observed
a “normal gaussian distribution” in both of the histograms
depicting that there is sufficient regularization in the network
and is free from exploding gradient problem. We believe that
Fig. 10: Gradients
this is due to the addition of gradient normalization in the
network.

Performance of the Deep Learning Model: The classifier’s


performance is measured using the following performance
characterizations: Accuracy, Recall, Precision and F1 score.
We have calculated these performance measures for the first
three IMFs for amplitude, phase and orientation components
and made a comparison of these to evaluate the best per-
forming components. The component which contributes most
in improving the accuracy of the classifier is then further
used for fusion. The number of epochs during the training
of the classifier has been varied from 5 to 40. We have also
calculated the training time of classifier for each epoch to have
an estimate of the total training time. A detailed discussion
of all the results is presented with the help of a confusion
Fig. 11: Activations
matrix. To show the efficacy of the proposed classifier, it has
been compared with two most famous state-of-the art CNN
models.
on the basis of a number of experiments. We have tested Table. IV shows the performance of the classifier for the
and visualized the model training on different learning rate amplitude component. The accuracy, recall, precision and F1
values. The learning rate value of 0.0001 helped the network scores are tabulated for the first three IMFs. The number of
to converge more rapdily as compared to 1e-4 and 1e-6. The epochs has been varied from 5 to 40. It can be seen from
normalization of data is performed properly with L2 normal- the table that the amplitude component could not perform
ization and stochastic gradient descent Wt+1 = Wt − αδL(θt ) better and remained unable to classify the test patterns. Even
which is depicted by the decreasing trend of the graph. with the increasing number of epochs, it remained unable
Figure 9 depicts the parameter ratios for each weight to improve the accuracy of the classifier significantly. The
vector of each layer. These ratios are the mean magnitudes same performance trend was observed for precision, recall
of the layer parameters. The mean magnitudes represent the and F1 score for all the epochs. We believe that the reason
mean or average value of the parameters on a number of for such a poor performance is due to the scarce availability
iterations which are shown on the y-axis of the graph. These of the data points in the amplitude component. It can be
mean magnitude values of the parameters are suggested to be observed from figure 4 that the amplitude component does not
between -3.0 and -4.0 on a log10 chart during the network provide significant number of data points required to perform
training. Figure 9 depicts that the values remain between an accurate classification. With the reduction in the noisy
the suggested range depicting appropriate initialization of all components it also discards the useful data points which could
network hyper-parameters. If the graph diverges away from aid in performing the accurate classification. This results in the
the suggested range during the training, it indicates that the drop of the overall accuracy rate of the classifier.
parameter initialization and selection is unstable and the model Table. V shows the performance of the classifier for the
remained unable to learn the required distinguishing features phase component. The accuracy, recall, precision and F1
from the training dataset. scores are again calculated for first three IMFs with the number
Figure 10 and Figure 11 depict the layer gradients and of epochs varying from 5 to 40. It can be seen from the table
layer activations for all the weight vectors of each layer. It is that the phase component performed much better to classify
observed during training from both the graphs that the layer the test patterns as compared to the amplitude component. The
gradients and activations tend to stabilize after some iterations. increasing number of epochs improved the accuracy of the
9

TABLE IV: Performance Measures of the Amplitude Component


IMF1 Amplitude IMF2 Amplitude IMF3 Amplitude
Epochs Acc Pre Rec F1 Acc Pre Rec F1 Acc Pre Rec F1
5 0.0191 0.0088 0.0191 0.0121 0.0206 0.0117 0.0206 0.0149 0.0162 0.0051 0.0162 0.0878
10 0.0147 0.0087 0.0147 0.0109 0.0235 0.0149 0.0235 0.0183 0.0235 0.0105 0.0235 0.0145
15 0.0162 0.0068 0.0162 0.0096 0.025 0.1108 0.025 0.0408 0.0279 0.0887 0.0279 0.0425
20 0.0294 0.1538 0.0294 0.0494 0.025 0.0119 0.025 0.0161 0.0309 0.057 0.0309 0.0401
25 0.0441 0.2107 0.0441 0.073 0.0279 0.0217 0.0279 0.0244 0.0324 0.0363 0.0324 0.0342
30 0.0588 0.2448 0.0588 0.0949 0.0471 0.152 0.0471 0.0719 0.0324 0.1225 0.0324 0.0512
35 0.075 0.2534 0.075 0.1157 0.0662 0.2121 0.0662 0.1009 0.0324 0.0396 0.0324 0.0356
40 0.0941 0.317 0.0941 0.1451 0.0794 0.1992 0.0794 0.1136 0.0338 0.0393 0.0338 0.0364

TABLE V: Performance Measures of the Phase Component


IMF1 Phase IMF2 Phase IMF3 Phase
Epochs Acc Pre Rec F1 Acc Pre Rec F1 Acc Pre Rec F1
5 0.0485 0.0569 0.0485 0.0524 0.0868 0.279 0.0868 0.1324 0.1191 0.3468 0.1191 0.1773
10 0.0912 0.3333 0.0912 0.1432 0.2441 0.6559 0.2441 0.3558 0.2353 0.4541 0.2353 0.31
15 0.1868 0.6105 0.1868 0.286 0.4721 0.7504 0.4721 0.5796 0.2971 0.4449 0.2971 0.3562
20 0.3309 0.7582 0.3309 0.4607 0.6412 0.7871 0.6412 0.7867 0.3338 0.4481 0.3338 0.3826
25 0.4588 0.7824 0.4588 0.5784 0.725 0.81 0.725 0.7651 0.3559 0.4513 0.3559 0.3979
30 0.5838 0.8142 0.5838 0.68 0.7691 0.8225 0.7691 0.7949 0.3853 0.4785 0.3853 0.4269
35 0.6676 0.8135 0.6676 0.7334 0.8015 0.835 0.8015 0.8179 0.4132 0.5164 0.4132 0.4591
40 0.7235 0.8132 0.7235 0.7658 0.8191 0.8436 0.8191 0.8312 0.4235 0.5307 0.4235 0.4711

classifier. Especially after 20 epochs the accuracy improved the improvements in higher accuracy rates.
significantly and at epoch number 40, it reached to an accuracy Inspired by this fact we have performed a feature fusion
of 72 percent. The improved performance rates were observed strategy for the orientation component to further improve the
for precision, recall and F1 scores as well. One reason for the accuracy rates. The feature fusion strategy is performed on the
better performance could be the availability of much higher first two intrinsic mode functions of the orientation component.
number of data points for the phase component as compared The two intrinsic mode functions are fused together in order
to the amplitude component as it can be seen from figure 4. to have a composite intrinsic mode function which could hold
Although the phase component for the third intrinsic mode the properties of both of the IMFs. The fused IMF is a numeric
function does not contain enough data points, however, the matrix which represents the combined orientation component
first two remained enough to have a descent classification rate. and is used for classification.
The performance of the classifier for the orientation compo- Table. VII shows the performance of the classifier on the
nent is depicted in Table VI. The same performance measures, fused intrinsic mode function. The number of epochs are
accuracy, precision, recall and F1 scores, are tabulated for the varied from 5 to 40 and accuracy, precision, recall and F1
first three IMFs of the orientation component. The number scores are recorded and tabulated in the table. It can be seen
of epochs are again varied from 5 to 40 for these set of that the overall accuracy rate of the fused IMF is greater than
experiments as well. A significant amount of improvement in all the previous accuracy numbers for the amplitude, phase and
the overall accuracy rate of the classifier has been observed orientation components. From epoch number 10 the classifica-
as compared to both amplitude and phase components. Even tion accuracy recorded to be 0.53 which is significantly better
at epoch number 10 the classification accuracy started at than the amplitude and phase and orientation. The accuracy
a reasonable rate of 0.3015 as compared to amplitude and kept on improving to 0.97 till the 40th epoch. Improvements
phase and kept on improving to 0.84 till the 40th epoch. The were also observed for precision, recall and F1 scores. The
improved performance rates were observed for precision, recall precision was recorded to be 0.9806 at 40th epoch. Similarly,
and F1 scores as well. The precision was recorded to be 0.9047 the recall and F1 scores are recorded to be 0.97 and 0.98
at the 40th epoch. Similarly, the recall and F1 scores are respectively. We believe that these improvements are due to
recorded to be 0.84 and 0.93 respectively which are quite good the further addition of data points in the orientation component
rates as compared to other two components. We figured out and reduced presence of noisy frequencies. It can be visualized
that these improvements are due to the maximum availability in Figure 5 that the fused intrinsic mode function contains
of data points and minimum amount of presence of noisy significant information in terms of data points which leads to
frequencies in the orientation component. This can be seen further improvements in higher accuracy rates.
clearly in Figure 4 that even third intrinsic mode function kept Table. VIII shows the overall performance of the proposed
reasonable number of data points which contributed towards system with the help of a confusion matrix in terms of
10

TABLE VI: Performance Measures of Orientation Component


IMF1 Orientation IMF2 Orientation IMF3 Orientation
Epochs Acc Pre Rec F1 Acc Pre Recall F1 Acc Pre Rec F1
5 0.1368 0.2477 0.1368 0.1762 0.2397 0.3485 0.2397 0.284 0.1574 0.3061 0.1574 0.2079
10 0.3015 0.6058 0.3015 0.4026 0.5838 0.6945 0.5838 0.6344 0.2956 0.4697 0.2956 0.3628
15 0.5162 0.747 0.5162 0.6105 0.7544 0.8146 0.7544 0.7834 0.4044 0.5224 0.4044 0.4559
20 0.6868 0.8166 0.6868 0.7461 0.8426 0.8668 0.8426 0.8546 0.4529 0.5561 0.4529 0.4992
25 0.7882 0.8542 0.7882 0.8199 0.8786 0.8974 0.8786 0.8838 0.4985 0.5641 0.4985 0.5293
30 0.825 0.8983 0.825 0.8601 0.9015 0.9212 0.9015 0.9112 0.5309 0.5734 0.5309 0.5513
35 0.8426 0.9054 0.8426 0.8729 0.925 0.9377 0.925 0.9313 0.5426 0.5857 0.5426 0.5633
40 0.8441 0.9047 0.8441 0.8734 0.9338 0.9414 0.9338 0.9376 0.5765 0.6116 0.5765 0.5935

TABLE VII: Performance Measures of Orientation Fusion gradually till the last epoch. The recall and F1 score curves
Orientation Fusion depict a similar trend in their curves and show significant im-
Epochs Accuracy Precision Recall F1 Score provements as compared to AlexNet and LeNet. The AlexNet
5 0.1897 0.5317 0.1897 0.2796 performed a bit better than the LeNet till epoch number 30
10 0.5338 0.7673 0.5338 0.6296 but it could not get better than score 0.3. A drop in the recall
15 0.8044 0.8611 0.8044 0.8318 curve for AlexNet has been observed after 30 epochs. The
20 0.9103 0.9272 0.9103 0.9187 same trend was observed for F1 score curve.
25 0.9544 0.9605 0.9544 0.9574 The proposed orientation fusion approach showed signif-
30 0.9676 0.9705 0.9676 0.9691 icant improvements over the two models on a challenging
35 0.9779 0.9791 0.9779 0.9785 dataset. The images present in the publicly available Yale Face
40 0.9794 0.9806 0.9794 0.98 Database have significant variations in expressions, pose and
illumination conditions. The illumination conditions impose
effects from different angles. Furthermore, expressions of the
Accuracy, Precision, Recall and F1 score. The overall accuracy individuals present in the database varied from normal to
of the system with the fused features is recorded to be 0.9794. happy, sad and sleepy as well. It was observed from the results
The confusion matrix depicted the precision of the system to that the proposed system is superior than state of the art in
be 0.9806 which shows that the proposed system is accurate tackling these challenges.
as well as precise. The recall and F1 scores of the system We believe that the reason behind these improvements is
are observed to be 0.9794 and 0.98 respectively. Most of the that the illumination effects are present in low frequency
test samples from all the subjects were classified correctly by components of the spectrum. The EMD separates the images
the classifier as depicted in the confusion matrix. There were into individual intrinsic mode functions in the decreasing
few samples from some subjects that were miss-classified. order. It then becomes easier to discard the low frequency
We believe that this miss-classification is due to the severe components from the image and retain only the high frequency
illumination effect in the test samples. We have also compared components. The fusion of the two intrinsic mode functions
the performance of the proposed system with the state-of- containing the highest frequencies is sufficient enough to
the-art well known deep learning models AlexNet [9] and correctly classify most of the training samples with high
LeNet [23]. Figure 12, Figure 13, Figure 14 and Figure. 15 accuracy rate and precision.
demonstrate and compare the performance improvements of
the proposed system with the AlexNet and LeNet models. As it Scalability of the System: In order to test the scalability
can be seen from Figure 12 that the proposed orientation fusion and performance of the proposed system, we have executed it
approach provides much higher accuracy rates as compared to on a spark based in-memory cloud infrastructure. The cloud
AlexNet and LeNet. From the very start, at iteration number infrastructure helps to parallelize the proposed approach by
5, the accuracy of the system is recorded to be 0.2 while the executing the chunks of data (subsets) on multiple nodes of the
accuracy of AlexNet and LeNet was below 0.1. The accuracy cloud. These subsets of data are transferred to neural network
kept on improving with the increasing number of epochs. At models executing on each node of the cloud. This helps to
epoch number 25, a significant improvement can be observed perform training of the deep convolutional network in parallel
in the graph as compared to the other two models which kept on multiple nodes of the cloud.
on improving till the last epoch. The cloud infrastructure used in this work to perform the
A similar kind of behavior can be observed in the precision, training of the proposed system has one master node working
recall and F1 scores as shown in Figures. 13, 14 and 15. with eight workers. The total dataset size used to test the
The precision of the system started from 0.55 from epoch scalability of the system is varied from 10 GB to 100 GB.
5 and showed a linear improvement over increasing number The dataset is then divided into subsets and each subset of
of epochs. Rapid improvements have been observed till epoch data is further divided into a number of mini-batches. Each
number 25 in the precision curve which kept on improving worker works on each mini-batch. The dataset is exported to
11

TABLE VIII: Confusion Matrix


Classification Scores
Accuracy 0.9794
Precision 0.9806
Recall 0.9794
F1 Score 0.98
{0=[0 x 19, 18], 1=[1 x 20], 2=[16, 2 x 18, 26], 3=[3 x 19, 31], 4=[4 x 20], 5=[5 x 19, 13], 6=[6 x 20], 7=[7 x 20],
8=[8 x 19, 30], 9=[9 x 20], 10=[16, 10 x 19], 11=[11 x 19, 27], 12=[12 x 20], 13=[26, 13 x 19], 14=[14 x 20],
15=[15 x 20], 16=[16 x 20], 17=[17 x 20], 18=[18 x 20], 19=[19 x 20], 20=[20 x 20], 21=[20, 21 x 19],
22=[22 x 20], 23=[23 x 20], 24=[24 x 20], 25=[25 x 20], 26=[26 x 20], 27=[27 x 20], 28=[28 x 20],
29=[24, 29 x 19], 30=[30 x 20], 31=[12 x 2, 31 x 18], 32=[32 x 20], 33=[33 x 19, 7]}

Fig. 12: Accuracy Fig. 13: Precision

Fig. 14: Recall Fig. 15: F1 Score

the distributed file system in a batched and serialized form. normalized data as it is based on stochastic gradient descent
learning approach. The stochastic gradient descent approach
The iterative map-reduce framework has been utilized to
during the training keeps their activations in the normalized
perform training of the system. Since the training of the
range which in turn improves the performance. In order to
convolutional neural network is an iterative process, we have
iterate over the dataset objects, a dataset iterator is created. The
made the use of iterative map-reduce instead of simple map-
dataset iterator iterates over the dataset object during training
reduce. The iterative map-reduce executes multiple passes of
and fetches the data in the vectorized form along with their
map-reduce operations and are much more suitable for convo-
labels. These are then stored in the N-dimensional arrays along
lutional neural networks. The iterative nature of the network
with their labels.
allows to perform a sequence of map-reduce operations in a
cascaded way. Furthermore, N-dimensional arrays have been
During the training process, the master node of the cloud
utilized to store the pixel values. As the convolutional neural
loads the initial parameters and configurations. The master
network is based on numerical computations, the use of N-
node then distributes the subsets of data to worker nodes
dimensional arrays helps to perform numerical computations
along with the initialization parameters. Each worker node
much quicker and also requires less memory.
then trains the partial model on each subset. Each model on
The data before performing the training is normalized. the worker node is trained on different shards of the data
The convolutional neural networks perform better with the in the form of mini-batches. The results from each partial
12

Fig. 16: Data Bundle Time Fig. 17: Data Transfer Time

model are then collected on the master node and are then
averaged using parameter averaging. This approach is quite
useful in our case as the number of worker nodes is small and
the parameters which are to be estimated are also small. The
parameter averaging is performed by obtaining the gradient of
each mini-batch from all the nodes. After the completion of
the training process, the master node holds the fully trained
model.
An important parameter which is to be adjusted is the rate
at which parameter averaging is to be performed. This is a Fig. 18: Analysis Time with Varying Cloud Nodes
critical parameter and can severely degrade the performance
of the network training if it is selected too low. A low value of
the parameter averaging rate can cause parameter initialization block size of cloud data storage. In order to have an estimate
overhead and also causes delay in the network communication. for the data transfer time to cloud nodes, we have measured
On the other hand, selecting it to a very high value can also this time for different sizes of data varying from 20GB to
degrade the performance. Another critical parameter is the 100GB. The data transfer time for each dataset size is plotted
data repartitioning rate. It is a key parameter and if selected in Figure 17 . It can be seen from the plot that the data transfer
correctly helps to use the cloud resources efficiently. We have time is directly proportional to the size of data that is being
selected a mini-batch size of 16 and a repartition rate of 0.6 transferred. For a dataset size of 20 GB, it requires almost
on the basis of experimentation. 0.36 hours. It takes almost two hours and eighteen minutes to
We have mainly focused on three parameters to measure the transfer 100 GB of data to the cloud storage. So the total data
performance of the system in terms of scalability. i) The total transfer time for a dataset size ranging from 20 GB to 100 GB
time needed to transfer different sizes of the datasets to the consumes 0.36 to 2.18 hours. An increase in the amount of
cloud storage. ii) Average execution time with varying dataset dataset size will also increase the data transfer time but this is
sizes iii) Average execution time with varying cloud nodes. a one-time process as the data could be retained in the cloud
The total size of the dataset is varied from 5GB to 100GB to storage for later use.
measure the scalability of the system. This dataset consists of To have an estimate of the average execution time of the
image database as well as a number of video streams consist- system on multiple nodes of the cloud infrastructure, we
ing of multiple subjects. These video streams are decoded in have varied the number of nodes and measured the total
order to have individual video frames from each video stream. execute time. Multiple experiments have been performed with
The number of decoded frames are dependent on the size of an increasing number of nodes in the cloud. The average
each video streams. We have further utilized a batch process in execution time for the system with an increase in the number
order to bundle the large number of individual decoded video of nodes gives an estimate of the amount of time required
frames and the images. The iterative map-reduce framework for system execution with multiple nodes. The analysis time
works better with the large bundled data as compared to taken for the system with multiple nodes is plotted in Figure
individual small chunks of data. The time required by the 18. This helps to have an idea that how many nodes should be
batch process to bundle the data depends on the size of the required to execute the system in a reasonable time frame. A
total dataset. Figure 16 shows the time taken by the batch decreasing trend in the analysis time is observed by increasing
process on different sizes of the dataset. It can be seen from the number of workers on each node. More nodes can be added
the figure that for a dataset size ranging from ten to hundred or removed from the cluster to increase or reduce the total
gigabytes, it takes almost 0.25 to 3.8 hours. execution time. However, decreasing the number of nodes in
The time required to transfer data from local storage to the cloud increases the amount of analysis tasks on each node
cloud storage depends on the amount of data that is being and can degrade the overall performance.
transferred. There are two main factors that contribute on the We have also measured the average execution time of the
total data transfer time i.e. bandwidth of the network and system with multiple dataset sizes on the cloud infrastruc-
13

complexity of the model and will facilitate experiments on a


much bigger dataset.

R EFERENCES
[1] Z. WU and N. E. HUANG, “Ensemble empirical mode decomposition:
A noise-assisted data analysis method,” Advances in Adaptive Data
Analysis, vol. 01, no. 01, pp. 1–41, 2009. [Online]. Available:
https://www.worldscientific.com/doi/abs/10.1142/S1793536909000047
[2] R. C. Sharpley and V. Vatchev, “Analysis of the intrinsic mode func-
tions,” Constructive Approximation, vol. 24, no. 1, pp. 17–47, 2006.
[3] M. Unser and D. Van De Ville, “Wavelet steerability and the higher-order
riesz transform,” IEEE Transactions on Image Processing, vol. 19, no. 3,
Fig. 19: Average Execution Time pp. 636–652, 2010.
[4] “Spark,” accessed: 2018-06-21.
[5] M. U. Yaseen, A. Anjum, O. Rana, and R. Hill, “Cloud-based scalable
ture. The average execution time taken for the system with object detection and classification in video streams,” Future Generation
Computer Systems, vol. 80, pp. 286 – 298, 2018. [Online]. Available:
each dataset size is plotted in Figure 19. This plot helps to http://www.sciencedirect.com/science/article/pii/S0167739X17301929
understand that how much time does the system requires for [6] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, “Scale-aware
execution on a specific dataset size. It can be seen from the fast r-cnn for pedestrian detection,” IEEE Transactions on Multimedia,
vol. 20, no. 4, pp. 985–996, April 2018.
figure that it takes almost 1.45 hours for the system to execute [7] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan, “Atten-
on a dataset size of 20GB. An increase in the dataset size tive contexts for object detection,” IEEE Transactions on Multimedia,
have a direct impact on the execution time. The execution vol. 19, no. 5, pp. 944–954, May 2017.
[8] X. Liu, W. Liu, T. Mei, and H. Ma, “Provid: Progressive and multi-
time increases to 7.29 hours for a dataset size of 100GB. modal vehicle reidentification for large-scale urban surveillance,” IEEE
We have also measured the execution time of the system by Transactions on Multimedia, vol. 20, no. 3, pp. 645–658, March 2018.
increasing the block size. This experiment was performed to [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
have an estimate that how much effect does it have on the total mation processing systems, 2012, pp. 1097–1105.
execution time. The default block of 128MB was varied to 256 [10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
MB and the same set of experiments were repeated. It can be applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
seen from Figure 19 that now it took 1.43 hours for the system [11] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,”
to execute on a dataset size of 20GB. The execution time AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist,
decreased to 6.8 hours for a dataset size of 100GB. However, vol. 2, 2010.
[12] A. Wang, J. Lu, J. Cai, T. J. Cham, and G. Wang, “Large-margin multi-
this is not a very big difference in the execution time which modal deep learning for rgb-d object recognition,” IEEE Transactions
shows that the variation in block size has a minor impact on on Multimedia, vol. 17, no. 11, pp. 1887–1898, Nov 2015.
the execution time of the system. [13] J. Tang, L. Jin, Z. Li, and S. Gao, “Rgb-d object recognition via incor-
porating latent data structure and prior knowledge,” IEEE Transactions
on Multimedia, vol. 17, no. 11, pp. 1899–1908, Nov 2015.
VI. C ONCLUSION AND F UTURE WORK [14] S. S. Mukherjee and N. M. Robertson, “Deep head pose: Gaze-direction
estimation in multimodal video,” IEEE Transactions on Multimedia,
An illumination and expression invariant video analytics vol. 17, no. 11, pp. 2094–2107, Nov 2015.
system for visual object recognition has been proposed. It [15] S. Ehsan, S. M. U. Abdullah, M. J. Akhtar, D. P. Mandic, K. D.
tackles the problem of illumination and expression variance McDonald-Maier et al., “Multi-scale pixel-based image fusion using
multivariate empirical mode decomposition,” Sensors, vol. 15, no. 5,
by proposing a feature fusion strategy based on the orientation pp. 10 923–10 947, 2015.
component of the intrinsic mode functions (IMF). The IMFs [16] A. Linderhed, “Image empirical mode decomposition: A new tool for
are generated by leveraging bi-dimensional empirical mode image processing,” Advances in Adaptive Data Analysis, vol. 1, no. 02,
pp. 265–294, 2009.
decomposition and first order Reisz transform is exploited to [17] Z. Liu and S. Peng, “Directional emd and its application to texture
produce an orientation component. The fused IMFs are further segmentation,” Science in China Series F: Information Sciences, vol. 48,
analyzed using convolutional neural networks. no. 3, p. 354, 2005.
[18] M. U. Yaseen, A. Anjum, and N. Antonopoulos, “Spatial frequency
The performance of the proposed system is demonstrated by based video stream analysis for object classification and recognition in
experimentations on publicly available datasets and compared clouds,” in 2016 IEEE/ACM 3rd International Conference on Big Data
with two state-of-the-art models namely AlexNet and LeNet. Computing Applications and Technologies (BDCAT), Dec 2016, pp. 18–
26.
It has been observed that the orientation component of the [19] E. Deléchelle, J. Lemoine, and O. Niang, “Empirical mode decomposi-
objects leads to higher accuracy of 93 percent. The proposed tion: an analytical approach for sifting process,” IEEE Signal Processing
feature fusion strategy of the orientation spectrums further Letters, vol. 12, no. 11, pp. 764–767, 2005.
[20] U. G. Mangai, S. Samanta, S. Das, and P. R. Chowdhury, “A survey of
improves the accuracy to 97 percent. The precision and recall decision fusion and feature fusion strategies for pattern classification,”
of the system are recorded to be 98 and 97 percent respectively. IETE Technical review, vol. 27, no. 4, pp. 293–307, 2010.
The system proved to be highly accurate and precise and out- [21] “Nd4j,” accessed: 2018-01-19.
[22] Yale, “The extended yale face database b (cropped),” 2001,
performed AlexNet and LeNet under uncontrolled conditions. accessed: 2018-01-19. [Online]. Available: http://vision.ucsd.edu/
We aim to enhance the capabilities of the proposed system ∼iskwak/ExtYaleDatabase/ExtYaleB.html

in future so that it can cope with other challenges including [23] Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL:
http://yann. lecun. com/exdb/lenet, p. 20, 2015.
rotation and translation variance. We also aim to execute the
proposed system on multiple nodes of a GPU-based cloud
infrastructure. Such as infrastructure will help to manage the

You might also like