Scalable Video Analytics
Scalable Video Analytics
Abstract—Visual object recognition from live video streams are more close to the real life situations. These datasets contain
comes with numerous challenges such as variation in illumination a number of challenges including illumination, blur and noise.
conditions and poses. Convolutional neural networks have been The expression and illumination challenges when combined
used by the emerging multimedia and computer vision applica-
tions to perform intelligent visual object recognition, however, together can particularly vary the appearance of an object.
their accuracy severely degrades when they are applied on These variations are so intense that it becomes impossible
illumination variant video datasets. To address this problem, even for humans to recognize them. These challenges need
we propose an orientation fusion based visual object recogni- to be tackled with intelligent video processing methods to
tion system using convolutional neural networks. The proposed implement highly accurate visual object recognition systems.
cloud based video analytics system leverages the use of bi-
dimensional empirical mode decomposition (BEMD) to split an Convolutional Neural Networks (CNNs) have been used
image into intrinsic mode functions (IMFs). These intrinsic mode recently to perform visual object recognition on video datasets.
functions endure Reisz transform to produce monogenic object CNNs proved to be successful on a number of object detection
components. These components are then used for the training and classification tasks on large video datasets. They also
of convolutional neural networks. It has been observed that the have the generalization capability and can be trained on large
orientation component of the object leads to a higher accuracy
of 93 percent. We further propose a feature fusion strategy of scale video datasets belonging to different classes. However,
the orientation components which further improves the visual CNNs also struggle to perform well on the challenging datasets
recognition accuracy to 97 percent. The proposed cloud based and their accuracy severely drops especially for the case of
video analytics system has been demonstrated to process a large expression and illumination variant datasets.
number of video streams and the underlying infrastructure is able In order to achieve high visual object recognition accuracy
to scale based on the number and size of the video stream(s)
being processed. The extensive experimentations on publicly on the challenging datasets, we propose empirical mode de-
available image and video datasets reveal that the proposed composition (EMD) [1] based implementation of CNNs. We
system is significantly more accurate and scalable when compared split the input video dataset comprising of images and videos
to AlexNet and LeNet, which are two most commonly used deep into its intrinsic mode functions (IMFs) [2] by using EMD.
learning models for visual object recognition. Reisz transform [3] is then applied on the resulting IMFs to
Index Terms—Multimedia Data Anaytics; Convolutional Neu- generate the monogenic components. The local monogenic
ral Networks; Scalable Object Recognition; Cloud Computing components including phase, orientation and amplitude are
then analyzed to determine which of these components con-
tribute to a higher accuracy rate with the CNNs. Figure 1
I. I NTRODUCTION depicts the workflow of our proposed visual object recognition
system. It has been observed through experiments in section VI
and regressor based models were fine-tuned and combined to the training database consists of a number of training samples
estimate the confidence for regression. and are given by;
Recent research showed that convolutional neural networks
can work well for the images or video data if it is in good qual-
“x1 = i1 , i2 , . . . , in
ity. However, the accuracy of a convolutional neural networks
severely degrades if it is applied on a challenging dataset and x2 = i1 , i2 , . . . , in
contains challenges including illumination and noise. We have x2 = i1 , i2 , . . . , in (2)
tackled this issue by using empirical mode decomposition [1] .. ..
and shifting the data from time domain to spatial-frequency . .
domain. EMD has been used in the past to perform recogni- xn = i1 , i2 , . . . , in ”
tion and classification. Ehsan et al. [15] proposed an image
fusion and enhancement technique using EMD to decompose Where, “i1 , i2 , . . . , in ” represent the individual images of
non-stationary signals into IMFs [2]. Linderhed et al. [16] each subject present in the training dataset. Each training sam-
proposed image empirical mode decomposition (IEMD) to ple “i” from each individual subject “x” undergoes through bi-
locally separate superimposed spatial frequencies from the dimensional empirical mode decomposition (BEMD) to have a
image. Liu et al. [17] presented 2DEMD to extract the local decomposition into its frequency components. EMD generates
features of the two-dimensional Intrinsic Mode Function for these frequency components by sifting process in which the
edge detection. Yaseen et al. [18] pioneered to utilize EMD on highest frequency components from the training sample are
video data in a parallel and distributed system. They used first extracted in each cycle or mode. Each mode stores the high
three IMFs and proposed a stack based hierarchy to perform frequencies as an IMF. These IMFs are stored in the decreasing
object classification on a challenging dataset. However, all order of their frequencies and the lowest IMF contains the
these works made the use of EMD with shallow networks lowest frequencies.
and did not exploit the use of EMD for deep networks. The sifting process [19] first determines the extrema points
from the training sample “k(i, j)” , where “i, j” are the
III. O BJECT RECOGNITION APPROACH AND dimensions of the training sample. These extrema points are
IMPLEMENTATION then connected to form upper and lower envelops. An average
This section describes the approach of our proposed visual of the upper and lower envelop is calculated to produce mean
object recognition system as shown in Figure. 2. The input envelop “mean(i, j)” as shown in Figure 3 and is given by;
training dataset “X” is represented by;
Here, “x1 , x2 , . . . xn ” represent the individual subjects The mean envelop “mean(i, j)” is then subtracted from the
present in the training database. Each individual subject in training sample “k(i, j)” to produce “T 1” and is given by;
4
Fig. 3: Averaged Extrema Surfaces Fig. 4: Amplitude, Phase and Orientation of first three IMFs
classifier. Especially after 20 epochs the accuracy improved the improvements in higher accuracy rates.
significantly and at epoch number 40, it reached to an accuracy Inspired by this fact we have performed a feature fusion
of 72 percent. The improved performance rates were observed strategy for the orientation component to further improve the
for precision, recall and F1 scores as well. One reason for the accuracy rates. The feature fusion strategy is performed on the
better performance could be the availability of much higher first two intrinsic mode functions of the orientation component.
number of data points for the phase component as compared The two intrinsic mode functions are fused together in order
to the amplitude component as it can be seen from figure 4. to have a composite intrinsic mode function which could hold
Although the phase component for the third intrinsic mode the properties of both of the IMFs. The fused IMF is a numeric
function does not contain enough data points, however, the matrix which represents the combined orientation component
first two remained enough to have a descent classification rate. and is used for classification.
The performance of the classifier for the orientation compo- Table. VII shows the performance of the classifier on the
nent is depicted in Table VI. The same performance measures, fused intrinsic mode function. The number of epochs are
accuracy, precision, recall and F1 scores, are tabulated for the varied from 5 to 40 and accuracy, precision, recall and F1
first three IMFs of the orientation component. The number scores are recorded and tabulated in the table. It can be seen
of epochs are again varied from 5 to 40 for these set of that the overall accuracy rate of the fused IMF is greater than
experiments as well. A significant amount of improvement in all the previous accuracy numbers for the amplitude, phase and
the overall accuracy rate of the classifier has been observed orientation components. From epoch number 10 the classifica-
as compared to both amplitude and phase components. Even tion accuracy recorded to be 0.53 which is significantly better
at epoch number 10 the classification accuracy started at than the amplitude and phase and orientation. The accuracy
a reasonable rate of 0.3015 as compared to amplitude and kept on improving to 0.97 till the 40th epoch. Improvements
phase and kept on improving to 0.84 till the 40th epoch. The were also observed for precision, recall and F1 scores. The
improved performance rates were observed for precision, recall precision was recorded to be 0.9806 at 40th epoch. Similarly,
and F1 scores as well. The precision was recorded to be 0.9047 the recall and F1 scores are recorded to be 0.97 and 0.98
at the 40th epoch. Similarly, the recall and F1 scores are respectively. We believe that these improvements are due to
recorded to be 0.84 and 0.93 respectively which are quite good the further addition of data points in the orientation component
rates as compared to other two components. We figured out and reduced presence of noisy frequencies. It can be visualized
that these improvements are due to the maximum availability in Figure 5 that the fused intrinsic mode function contains
of data points and minimum amount of presence of noisy significant information in terms of data points which leads to
frequencies in the orientation component. This can be seen further improvements in higher accuracy rates.
clearly in Figure 4 that even third intrinsic mode function kept Table. VIII shows the overall performance of the proposed
reasonable number of data points which contributed towards system with the help of a confusion matrix in terms of
10
TABLE VII: Performance Measures of Orientation Fusion gradually till the last epoch. The recall and F1 score curves
Orientation Fusion depict a similar trend in their curves and show significant im-
Epochs Accuracy Precision Recall F1 Score provements as compared to AlexNet and LeNet. The AlexNet
5 0.1897 0.5317 0.1897 0.2796 performed a bit better than the LeNet till epoch number 30
10 0.5338 0.7673 0.5338 0.6296 but it could not get better than score 0.3. A drop in the recall
15 0.8044 0.8611 0.8044 0.8318 curve for AlexNet has been observed after 30 epochs. The
20 0.9103 0.9272 0.9103 0.9187 same trend was observed for F1 score curve.
25 0.9544 0.9605 0.9544 0.9574 The proposed orientation fusion approach showed signif-
30 0.9676 0.9705 0.9676 0.9691 icant improvements over the two models on a challenging
35 0.9779 0.9791 0.9779 0.9785 dataset. The images present in the publicly available Yale Face
40 0.9794 0.9806 0.9794 0.98 Database have significant variations in expressions, pose and
illumination conditions. The illumination conditions impose
effects from different angles. Furthermore, expressions of the
Accuracy, Precision, Recall and F1 score. The overall accuracy individuals present in the database varied from normal to
of the system with the fused features is recorded to be 0.9794. happy, sad and sleepy as well. It was observed from the results
The confusion matrix depicted the precision of the system to that the proposed system is superior than state of the art in
be 0.9806 which shows that the proposed system is accurate tackling these challenges.
as well as precise. The recall and F1 scores of the system We believe that the reason behind these improvements is
are observed to be 0.9794 and 0.98 respectively. Most of the that the illumination effects are present in low frequency
test samples from all the subjects were classified correctly by components of the spectrum. The EMD separates the images
the classifier as depicted in the confusion matrix. There were into individual intrinsic mode functions in the decreasing
few samples from some subjects that were miss-classified. order. It then becomes easier to discard the low frequency
We believe that this miss-classification is due to the severe components from the image and retain only the high frequency
illumination effect in the test samples. We have also compared components. The fusion of the two intrinsic mode functions
the performance of the proposed system with the state-of- containing the highest frequencies is sufficient enough to
the-art well known deep learning models AlexNet [9] and correctly classify most of the training samples with high
LeNet [23]. Figure 12, Figure 13, Figure 14 and Figure. 15 accuracy rate and precision.
demonstrate and compare the performance improvements of
the proposed system with the AlexNet and LeNet models. As it Scalability of the System: In order to test the scalability
can be seen from Figure 12 that the proposed orientation fusion and performance of the proposed system, we have executed it
approach provides much higher accuracy rates as compared to on a spark based in-memory cloud infrastructure. The cloud
AlexNet and LeNet. From the very start, at iteration number infrastructure helps to parallelize the proposed approach by
5, the accuracy of the system is recorded to be 0.2 while the executing the chunks of data (subsets) on multiple nodes of the
accuracy of AlexNet and LeNet was below 0.1. The accuracy cloud. These subsets of data are transferred to neural network
kept on improving with the increasing number of epochs. At models executing on each node of the cloud. This helps to
epoch number 25, a significant improvement can be observed perform training of the deep convolutional network in parallel
in the graph as compared to the other two models which kept on multiple nodes of the cloud.
on improving till the last epoch. The cloud infrastructure used in this work to perform the
A similar kind of behavior can be observed in the precision, training of the proposed system has one master node working
recall and F1 scores as shown in Figures. 13, 14 and 15. with eight workers. The total dataset size used to test the
The precision of the system started from 0.55 from epoch scalability of the system is varied from 10 GB to 100 GB.
5 and showed a linear improvement over increasing number The dataset is then divided into subsets and each subset of
of epochs. Rapid improvements have been observed till epoch data is further divided into a number of mini-batches. Each
number 25 in the precision curve which kept on improving worker works on each mini-batch. The dataset is exported to
11
the distributed file system in a batched and serialized form. normalized data as it is based on stochastic gradient descent
learning approach. The stochastic gradient descent approach
The iterative map-reduce framework has been utilized to
during the training keeps their activations in the normalized
perform training of the system. Since the training of the
range which in turn improves the performance. In order to
convolutional neural network is an iterative process, we have
iterate over the dataset objects, a dataset iterator is created. The
made the use of iterative map-reduce instead of simple map-
dataset iterator iterates over the dataset object during training
reduce. The iterative map-reduce executes multiple passes of
and fetches the data in the vectorized form along with their
map-reduce operations and are much more suitable for convo-
labels. These are then stored in the N-dimensional arrays along
lutional neural networks. The iterative nature of the network
with their labels.
allows to perform a sequence of map-reduce operations in a
cascaded way. Furthermore, N-dimensional arrays have been
During the training process, the master node of the cloud
utilized to store the pixel values. As the convolutional neural
loads the initial parameters and configurations. The master
network is based on numerical computations, the use of N-
node then distributes the subsets of data to worker nodes
dimensional arrays helps to perform numerical computations
along with the initialization parameters. Each worker node
much quicker and also requires less memory.
then trains the partial model on each subset. Each model on
The data before performing the training is normalized. the worker node is trained on different shards of the data
The convolutional neural networks perform better with the in the form of mini-batches. The results from each partial
12
Fig. 16: Data Bundle Time Fig. 17: Data Transfer Time
model are then collected on the master node and are then
averaged using parameter averaging. This approach is quite
useful in our case as the number of worker nodes is small and
the parameters which are to be estimated are also small. The
parameter averaging is performed by obtaining the gradient of
each mini-batch from all the nodes. After the completion of
the training process, the master node holds the fully trained
model.
An important parameter which is to be adjusted is the rate
at which parameter averaging is to be performed. This is a Fig. 18: Analysis Time with Varying Cloud Nodes
critical parameter and can severely degrade the performance
of the network training if it is selected too low. A low value of
the parameter averaging rate can cause parameter initialization block size of cloud data storage. In order to have an estimate
overhead and also causes delay in the network communication. for the data transfer time to cloud nodes, we have measured
On the other hand, selecting it to a very high value can also this time for different sizes of data varying from 20GB to
degrade the performance. Another critical parameter is the 100GB. The data transfer time for each dataset size is plotted
data repartitioning rate. It is a key parameter and if selected in Figure 17 . It can be seen from the plot that the data transfer
correctly helps to use the cloud resources efficiently. We have time is directly proportional to the size of data that is being
selected a mini-batch size of 16 and a repartition rate of 0.6 transferred. For a dataset size of 20 GB, it requires almost
on the basis of experimentation. 0.36 hours. It takes almost two hours and eighteen minutes to
We have mainly focused on three parameters to measure the transfer 100 GB of data to the cloud storage. So the total data
performance of the system in terms of scalability. i) The total transfer time for a dataset size ranging from 20 GB to 100 GB
time needed to transfer different sizes of the datasets to the consumes 0.36 to 2.18 hours. An increase in the amount of
cloud storage. ii) Average execution time with varying dataset dataset size will also increase the data transfer time but this is
sizes iii) Average execution time with varying cloud nodes. a one-time process as the data could be retained in the cloud
The total size of the dataset is varied from 5GB to 100GB to storage for later use.
measure the scalability of the system. This dataset consists of To have an estimate of the average execution time of the
image database as well as a number of video streams consist- system on multiple nodes of the cloud infrastructure, we
ing of multiple subjects. These video streams are decoded in have varied the number of nodes and measured the total
order to have individual video frames from each video stream. execute time. Multiple experiments have been performed with
The number of decoded frames are dependent on the size of an increasing number of nodes in the cloud. The average
each video streams. We have further utilized a batch process in execution time for the system with an increase in the number
order to bundle the large number of individual decoded video of nodes gives an estimate of the amount of time required
frames and the images. The iterative map-reduce framework for system execution with multiple nodes. The analysis time
works better with the large bundled data as compared to taken for the system with multiple nodes is plotted in Figure
individual small chunks of data. The time required by the 18. This helps to have an idea that how many nodes should be
batch process to bundle the data depends on the size of the required to execute the system in a reasonable time frame. A
total dataset. Figure 16 shows the time taken by the batch decreasing trend in the analysis time is observed by increasing
process on different sizes of the dataset. It can be seen from the number of workers on each node. More nodes can be added
the figure that for a dataset size ranging from ten to hundred or removed from the cluster to increase or reduce the total
gigabytes, it takes almost 0.25 to 3.8 hours. execution time. However, decreasing the number of nodes in
The time required to transfer data from local storage to the cloud increases the amount of analysis tasks on each node
cloud storage depends on the amount of data that is being and can degrade the overall performance.
transferred. There are two main factors that contribute on the We have also measured the average execution time of the
total data transfer time i.e. bandwidth of the network and system with multiple dataset sizes on the cloud infrastruc-
13
R EFERENCES
[1] Z. WU and N. E. HUANG, “Ensemble empirical mode decomposition:
A noise-assisted data analysis method,” Advances in Adaptive Data
Analysis, vol. 01, no. 01, pp. 1–41, 2009. [Online]. Available:
https://www.worldscientific.com/doi/abs/10.1142/S1793536909000047
[2] R. C. Sharpley and V. Vatchev, “Analysis of the intrinsic mode func-
tions,” Constructive Approximation, vol. 24, no. 1, pp. 17–47, 2006.
[3] M. Unser and D. Van De Ville, “Wavelet steerability and the higher-order
riesz transform,” IEEE Transactions on Image Processing, vol. 19, no. 3,
Fig. 19: Average Execution Time pp. 636–652, 2010.
[4] “Spark,” accessed: 2018-06-21.
[5] M. U. Yaseen, A. Anjum, O. Rana, and R. Hill, “Cloud-based scalable
ture. The average execution time taken for the system with object detection and classification in video streams,” Future Generation
Computer Systems, vol. 80, pp. 286 – 298, 2018. [Online]. Available:
each dataset size is plotted in Figure 19. This plot helps to http://www.sciencedirect.com/science/article/pii/S0167739X17301929
understand that how much time does the system requires for [6] J. Li, X. Liang, S. Shen, T. Xu, J. Feng, and S. Yan, “Scale-aware
execution on a specific dataset size. It can be seen from the fast r-cnn for pedestrian detection,” IEEE Transactions on Multimedia,
vol. 20, no. 4, pp. 985–996, April 2018.
figure that it takes almost 1.45 hours for the system to execute [7] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and S. Yan, “Atten-
on a dataset size of 20GB. An increase in the dataset size tive contexts for object detection,” IEEE Transactions on Multimedia,
have a direct impact on the execution time. The execution vol. 19, no. 5, pp. 944–954, May 2017.
[8] X. Liu, W. Liu, T. Mei, and H. Ma, “Provid: Progressive and multi-
time increases to 7.29 hours for a dataset size of 100GB. modal vehicle reidentification for large-scale urban surveillance,” IEEE
We have also measured the execution time of the system by Transactions on Multimedia, vol. 20, no. 3, pp. 645–658, March 2018.
increasing the block size. This experiment was performed to [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural infor-
have an estimate that how much effect does it have on the total mation processing systems, 2012, pp. 1097–1105.
execution time. The default block of 128MB was varied to 256 [10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
MB and the same set of experiments were repeated. It can be applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
seen from Figure 19 that now it took 1.43 hours for the system [11] Y. LeCun, C. Cortes, and C. Burges, “Mnist handwritten digit database,”
to execute on a dataset size of 20GB. The execution time AT&T Labs [Online]. Available: http://yann. lecun. com/exdb/mnist,
decreased to 6.8 hours for a dataset size of 100GB. However, vol. 2, 2010.
[12] A. Wang, J. Lu, J. Cai, T. J. Cham, and G. Wang, “Large-margin multi-
this is not a very big difference in the execution time which modal deep learning for rgb-d object recognition,” IEEE Transactions
shows that the variation in block size has a minor impact on on Multimedia, vol. 17, no. 11, pp. 1887–1898, Nov 2015.
the execution time of the system. [13] J. Tang, L. Jin, Z. Li, and S. Gao, “Rgb-d object recognition via incor-
porating latent data structure and prior knowledge,” IEEE Transactions
on Multimedia, vol. 17, no. 11, pp. 1899–1908, Nov 2015.
VI. C ONCLUSION AND F UTURE WORK [14] S. S. Mukherjee and N. M. Robertson, “Deep head pose: Gaze-direction
estimation in multimodal video,” IEEE Transactions on Multimedia,
An illumination and expression invariant video analytics vol. 17, no. 11, pp. 2094–2107, Nov 2015.
system for visual object recognition has been proposed. It [15] S. Ehsan, S. M. U. Abdullah, M. J. Akhtar, D. P. Mandic, K. D.
tackles the problem of illumination and expression variance McDonald-Maier et al., “Multi-scale pixel-based image fusion using
multivariate empirical mode decomposition,” Sensors, vol. 15, no. 5,
by proposing a feature fusion strategy based on the orientation pp. 10 923–10 947, 2015.
component of the intrinsic mode functions (IMF). The IMFs [16] A. Linderhed, “Image empirical mode decomposition: A new tool for
are generated by leveraging bi-dimensional empirical mode image processing,” Advances in Adaptive Data Analysis, vol. 1, no. 02,
pp. 265–294, 2009.
decomposition and first order Reisz transform is exploited to [17] Z. Liu and S. Peng, “Directional emd and its application to texture
produce an orientation component. The fused IMFs are further segmentation,” Science in China Series F: Information Sciences, vol. 48,
analyzed using convolutional neural networks. no. 3, p. 354, 2005.
[18] M. U. Yaseen, A. Anjum, and N. Antonopoulos, “Spatial frequency
The performance of the proposed system is demonstrated by based video stream analysis for object classification and recognition in
experimentations on publicly available datasets and compared clouds,” in 2016 IEEE/ACM 3rd International Conference on Big Data
with two state-of-the-art models namely AlexNet and LeNet. Computing Applications and Technologies (BDCAT), Dec 2016, pp. 18–
26.
It has been observed that the orientation component of the [19] E. Deléchelle, J. Lemoine, and O. Niang, “Empirical mode decomposi-
objects leads to higher accuracy of 93 percent. The proposed tion: an analytical approach for sifting process,” IEEE Signal Processing
feature fusion strategy of the orientation spectrums further Letters, vol. 12, no. 11, pp. 764–767, 2005.
[20] U. G. Mangai, S. Samanta, S. Das, and P. R. Chowdhury, “A survey of
improves the accuracy to 97 percent. The precision and recall decision fusion and feature fusion strategies for pattern classification,”
of the system are recorded to be 98 and 97 percent respectively. IETE Technical review, vol. 27, no. 4, pp. 293–307, 2010.
The system proved to be highly accurate and precise and out- [21] “Nd4j,” accessed: 2018-01-19.
[22] Yale, “The extended yale face database b (cropped),” 2001,
performed AlexNet and LeNet under uncontrolled conditions. accessed: 2018-01-19. [Online]. Available: http://vision.ucsd.edu/
We aim to enhance the capabilities of the proposed system ∼iskwak/ExtYaleDatabase/ExtYaleB.html
in future so that it can cope with other challenges including [23] Y. LeCun et al., “Lenet-5, convolutional neural networks,” URL:
http://yann. lecun. com/exdb/lenet, p. 20, 2015.
rotation and translation variance. We also aim to execute the
proposed system on multiple nodes of a GPU-based cloud
infrastructure. Such as infrastructure will help to manage the