0% found this document useful (0 votes)
38 views7 pages

Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos

Uploaded by

Dinar TAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views7 pages

Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos

Uploaded by

Dinar TAS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

SN Computer Science (2020) 1:138

https://doi.org/10.1007/s42979-020-00143-w

ORIGINAL RESEARCH

Performance Analysis of Inception‑v2 and Yolov3‑Based Human


Activity Recognition in Videos
Tanveer Mustafa1 · Sunita Dhavale1 · M. M. Kuber2

Received: 28 March 2020 / Accepted: 1 April 2020 / Published online: 24 April 2020
© Springer Nature Singapore Pte Ltd 2020

Abstract
Recently, many deep learning solutions have been proposed for human activity recognition (HAR) in videos. However,
the HAR accuracies obtained using models is not adequate. In this work, we proposed two HAR architectures using Faster
RCNN Inception-v2 and YOLOv3 as object detection models. In particular, we considered the human activities like walk-
ing, jogging and running as activities to be recognized by our proposed architectures. We used the pretrained Faster RCNN
Inception-v2 and YOLOv3 object detection models. We then analyzed the performance of proposed architectures using
benchmarked UCF-ARG dataset of videos. The experimental results show that Yolov3-based HAR architecture outperforms
Inception-v2 in all scenarios.

Keywords  HAR · RCNN · YOLO · UCF-ARG​ · CNN · CCTV · GPU · RGB

Introduction to the processing, wherein we need to consider temporal


aspect along with the image frames. Therefore, video pro-
The recent improvements in deep learning techniques, viz., cessing is a 4D processing and challenging task. Activity
CNN have proved their capability in efficient classification recognition Many of the occasions, the recorded videos from
and recognition [1–4] of objects in images. The object loca- CCTVs are analyzed by human operator to detect the poten-
tion has been achieved accurately [5], and object detection tial threats associated with it. However, manual intervention
[6, 7] proved an efficient detection of objects in images. fails to provide real-time threat detection due to massive
However, the activity recognition [8–11] in videos is highly video data analysis requirement from multiple cameras.
focused area in the current research topics. The objective of Human errors also add another restriction for such manual
activity recognition includes analysis of video surveillance systems. Hence, there is a need to develop automated video
data and generation of behavior models. Unlike the image surveillance solutions. HAR includes walking, jogging, run-
processing, video processing increases another dimensional ning, etc. In this paper, we propose two architectures, viz.
Faster RCNN Inception-v2- and YOLOv3-based architec-
tures for HAR. The Faster RCNN Inception-v2 and YOLOv3
This article is part of the topical collection “Advances in models are used for detecting objects in images or video
Computational Approaches for Artificial Intelligence, Image frames. Subsequently, we pass the objects along with bound-
Processing, IoT and Cloud Applications” guest edited by Bhanu
ing boxes for background noise subtraction, threshold for
Prakash K N and M. Shivakumar.
determining the HAR type and centroid for HAR. We also
* Tanveer Mustafa compared the performance of the two proposed architecture
[email protected] with benchmarked UCF-ARG video dataset.
Sunita Dhavale Organization of the work done in this paper is as fol-
[email protected] lows. In "Related Work" section, we present an overview
M. M. Kuber of related work. In "HAR Approaches for Videos" sec-
[email protected] tion, we describe the detailed working of the proposed
1 architectures. In "Experiments" section, we explained
Defence Institute of Advanced Technology (DIAT), Pune,
India the experimental set and datasets used. In "Results and
2 Analysis" section, we presented the results and analysis on
R&DE, DRDO, Pune, India

SN Computer Science
Vol.:(0123456789)
138 
Page 2 of 7 SN Computer Science (2020) 1:138

the outcome generated by our architectures. The "Conclu- Deep learning-based movement detection [9] and activ-
sions" section contains conclusion and future work. ity recognition tasks have been explored by many authors
[20–22].

Related Work HAR Approaches for Videos

In recent days, the activity recognition in video process- In this work, we proposed two architectures for HAR based
ing has become high thrust research area. In particular, on Faster RCNN Inception-v2 and YOLOv3 as object detec-
HAR has become more prominent research area. In HAR, tion model in video frames. The description of the proposed
the objective is to recognize various human activities like architectures is given below.
walking, jogging, running, opening door, playing football,
etc. Michalis Raptis et al. [8] proposed a hidden Markov Faster RCNN Inception‑v2‑Based HAR Architecture
model-based HAR for six classes of activities, viz., hog-
ging, kicking, etc., and achieved an accuracy of 93% on Here, we describe the HAR using Inception-v2 model,
full video. Fernando Moya Rueda et al. [12] have proposed wherein Inception-v2 is used for detecting objects in video
sensors-based HAR for walking, searching, picking, etc., frames. The working procedure of the proposed architecture
and achieved an accuracy of 92.3% by using CNNs. Zeng is shown in Fig. 1.
et al. [9] proposed a CNN-based HAR-based mobile sen-
sors data and achieved an accuracy of 95%. Akram Bayat Working of proposed methodology
et al. [13] used acceleration generated by user cell phone In this work, we achieve the HAR by using the following
[14, 15] and then applied MLP, SVM and random forest- sequence of steps.
based machine learning techniques for HAR recognition. Pretrained Faster RCNN-Inception-v2 We have used
In these models, SVM classifier achieved an accuracy of the pretrained Faster-RCNN model trained on COCO data-
91%. Valentin Radu et al. [16] proposed a multimodal set as object detection model. This model is trained on 80
RBM-based HAR using accelerometer and gyroscope sig- classes of images. This model detects various objects, viz.,
nals and achieved an accuracy of 81%. These accuracies car, human, bikes, etc., in an image or video frame. It also
are less compared to human level of accuracy in activ- produces the bounding box for each of the classes in the
ity recognition. Therefore, we felt that the HAR accuracy images. In our work, this model has been used to detect
[17, 18] may be improved further. Keeping this view in human objects in a video frame along with associated green
mind, we proposed two approaches which are based on bounding box. The architecture of Inception-v2 module is
Faster RCNN Inception-v2 and YOLOv3 as object detec- shown in Fig. 2.
tion models. Video Processing and Human Object Detection In this
The object detection architectures [6, 7] have proved their section, we describe the video processing and object detec-
capability in efficient detection objects in images. Object tions in video frames.
detection architecture in [7] has been used in real-time video This module takes the video as input and samples the
processing as a lot of events were captured by Author Shinde video into sequence of frames such that the number of
et al. [19] achieved an accuracy of 87%. frames processed per second is 36, i.e., 36FPS. The video

Fig. 1  Faster RCNN-Inception-
v2 model-based HAR in videos

SN Computer Science
SN Computer Science (2020) 1:138 Page 3 of 7  138

Table 1  HAR accuracies of our architectures


Model Walking Jogging Running Accuracy (%)

Iteration1 1–15 15–40 >40 0


Iteration2 1–20 20–40 >40 42.1
Iteration3 1–30 30–40 >40 100

obtaining the centroid of the bounding box. Canny edge


detection [25] algorithm has been applied to obtain the edges
of the bounding box. Houghlines [26] is used to get the coor-
dinates of the edges. From the list of edges, we considered
the top left and bottom right coordinates and computed the
centroid of the object.
Centroid = ((x1 + x2)∕2, (y1 + y2)∕2) (1)
Activity Threshold In order to recognize various activities of
human, viz., walking, jogging and running on a level sur-
face, we derived the threshold policy for walking, jogging
Fig. 2  Inception-v2 module
and running. Under this assumption, we know that there
exist horizontal movements only, i.e., in 2D, the object
moves in x-axis only. Therefore, we applied threshold val-
frames are then passed to the Faster RCNN-inception-v2 ues in terms of the number of pixels along horizontal direc-
model. The Faster RCNN module in turn detects the objects tion for walking, jogging and running. To obtain a reason-
in frame along with their bounding boxes. It associates a able threshold value for discriminating the human activities
color-coded bounding box for each object detected. In par- (walking, jogging and running), we conducted experiments
ticular, it associates the human with green-colored bounding using UCF-ARG dataset [27]. We considered 48 videos of
box. We are interested in HAR only; therefore, we applied walking, 48 videos of jogging and 48 videos of running as
a filter to segregate the human objects (green bounding validation set [28]. We also considered different numbers
box) from rest. Having known that the model generates a of pixels as thresholds for each of the activities. After a
green-colored bounding box, we applied color code-based sequence of experiments, we finalized threshold values for
threshold technique such that only the objects with the the number of pixels which correctly discriminate the human
green color bounding box are selected. A color code-based activities. The threshold values shown in Table 1 are applied
threshold with RGB values ranging from [55,74,174] to on walking videos of UCF-ARG dataset [27] taken using
[76,255,255] has been applied so that only human objects ground camera.
with green bounding boxes are considered and rest of objects
are discarded.
Background Subtraction The objects in a frame may be YOLOv3‑Based HAR in Videos
associated with background noise; therefore, we need to sup-
press the background noise so that the bounding box con- In this section, we bring out our proposed architecture for
tains only object of interest. This helps in efficient detection HAR using YOLOv3 as object detection model. The pro-
of objects and avoids confusion. To remove the background posed architecture is shown in Fig. 3.
noise, we initially converted the frame into binary image Pretrained YOLOv3 We have used the pretrained
[23, 24], wherein the object of interest is coded with white YOLOv3 model trained on COCO dataset having 80
and the rest is with black. Therefore, summation of the pixels classes as object detection model. This model detects vari-
in a frame shall be greater than zero for any object to exist in ous objects, viz., car, human, bikes, etc. This produces the
the frame. However, it is possible that the frame may contain bounding box for each of the classes. In particular, human
noise pixels, and hence, we applied a threshold of 20,000. object is associated with bounding box (Fig. 4).
Keeping this threshold in the model, we are able to filter the YOLOv3:
frames containing human object (which have summation of The YOLOv3 architecture [29, 30] came to existence
pixel in a frame greater than 20,000) and discard rest. which has 53 layers trained on ImageNet. For task of detec-
Centroiding Approach In order to proceed with human tion, 53 more layers are stacked into existing layer led to a
object tracking, we applied a centroid-based method for total of 106 convolutional layers. The newer architecture

SN Computer Science
138 
Page 4 of 7 SN Computer Science (2020) 1:138

Fig. 3  Human activities (run-


ning) detection using Inception-
v2 model

Fig. 4  YOLOv3 object detection architecture

YOLOv3 contains residual skip connections and upsam- produces a bounding box around the objects, we extracted
pling. In YOLOv3, Softmax is not used instead inde- the top left and bottom right coordinates and computed the
pendent logistic classifiers and binary cross-entropy loss centroid of the object using equation (2)
is used. YOLOv3 predicts 10x times more bounding box
than YOLOv2. YOLOv3 uses nine anchor boxes. K-means
Centroid = ((x1 + x2)∕2, (y1 + y2)∕2) (2)
clustering is used to generate anchors which are assigned in Activity Threshold We applied the procedure mentioned in
descending order. "Faster RCNN Inception-v2-Based HAR Architecture" sec-
Video Processing and Human Object Detection We pro- tion for activity threshold and tracking.
cessed the video and sampled the video at 36FPS and then
passed frames to the YOLOv3 model. This model detects
various objects in a frame along their bounding boxes. How-
ever, we are interested in human objects only and hence Experiments
extracted human objects along with their bounding boxes
using class labels. In this section, we bring the software implementation, hard-
Centroiding In order to proceed with HAR, we applied ware architecture, training and test datasets considered in
a centroid-based method for obtaining the centroid of the the experiments.
bounding box. As the YOLOv3 detects the object and

SN Computer Science
SN Computer Science (2020) 1:138 Page 5 of 7  138

Implementation We implemented our proposed archi- objects detected in the frame are shown with corresponding
tectures using tensorflow and torch frameworks, in python bounding box. The human object along with bounding box
language on Windows Machine. is further processed by our model to recognize the human
Hardware Details activities in the videos (see Fig. 6).
The hardware details used for running the developed soft-
ware and conducting the experiments are mentioned below.
Windows 10 RAM: 8GB Processor: i7 GPU: NVIDIA Results and Analysis
GeForce GTX 1050Ti
Training Pretrained object detection Models, viz., Faster In this section, we present the results obtained by our archi-
RCNN Inception-v2 and YOLOv3 trained on COCO dataset, tectures. The output activity recognized by our architectures,
are used. viz., Faster RCNN Inception-v2 and YOLOv3 for walking,
jogging and running, against the labeled activities is given in
Testing the Proposed Architectures the UCF-ARG dataset [27]. We validated the proposed archi-
tectures using 48 test videos having labeled human activities,
Test Videos for Human Activities viz., walking, jogging and running taken from UCF-ARG
For testing the HAR capability of our architectures, we dataset [27]. The results obtained by the two architectures
considered benchmarked UCF-ARG dataset [27] of labeled are listed in Table 2.
human activities. It has ten types of human activities, and Faster RCNN Inception-v2-based architecture obtained
each activity has 48 videos taken from aerial, rooftop and an accuracy of 62.5% for walking, 35.5% for jogging and
ground. In this experiment, we considered three types of 43.7% for running, whereas the YOLOv3-based architecture
human activities, viz., walking, jogging and running taken correctly recognized all three human activities. The accura-
from ground camera. The duration of each video is 3 to 8 s. cies obtained by YOLOv3-based architecture are 100% for
We fed 48 videos labeled walking, jogging and running all three activities. In Fig. 7, we plotted the HAR accuracies
to the architectures. The activity recognition is done periodi- of Inception-v2- Vs YOLOv3-based architectures.
cally with a period of 30 frames. The same process contin- Analysis On analysis, we found that the results produced
ues till the end of the video. We then consolidate the type by YOLOv3-based architecture are matching with that of
of activities which our model recognized at the end of 30 the labeled activities of the videos. We attribute this highly
frames, 60 frames, 90 frames, etc. The final activity recog- accurate activity recognition capability of our architecture
nized by our model is composed of the activities which were to object detection capability YOLOv3 and the centroid
recognized made at the end of 30 frames, 60 frames and 90 approach which we proposed (see Fig. 8).
frames while processing the video. The final output depends
on the majority of the activities recognized by our model
while processing the video. Conclusion
Under the similar conditions of experimental setup, we
ran two architectures and captured the human activities rec- In this work, we proposed two architectures, viz., Faster
ognized by the two proposed approaches. RCNN Inception-v2-based and YOLOv3-based architec-
The following Fig.  5 shows a video frame which is tures for HAR in videos. We also developed a threshold of
labeled running is passed to our architectures, and the

Fig. 5  Faster RCNN Inception-v2-based HAR recognition Fig. 6  YOLOv3-based HAR

SN Computer Science
138 
Page 6 of 7 SN Computer Science (2020) 1:138

Fig. 7  Comparison of HAR accuracies of Inception-v2 Vs YOLOv3

this highly accurate activity recognition capability of our


architecture to object detection capability YOLOv3 and
the centroid approach which we proposed [14, 15, 17, 18,
23, 24, 28–30].

Acknowledgements  We would like to thank Nvidia for granting us


TITAN V GPU for carrying out deep learning-based research work.

Compliance with Ethical Standards 

Conflict of interest  The authors declare that they have no conflict of


interest.

Fig. 8  Human activities, viz., walking, jogging and running detection


References
Table 2  HAR accuracies of our architectures 1. Krizhevsky A, Sutskever I, Hinton G. E. Imagenet classification
with deep convolutional neural networks. In: Advances in neural
Model Walking (%) Jogging (%) Running (%) information processing systems. 2012. pp. 1097-1105.
2. Simonyan K, Zisserman A. Very deep convolutional networks for
Inception-v2 62.5 35.4 43.7 large-scale image recognition. arXiv preprint arXiv​:1409.1556.
YOLOv3 100 100 100 2014.
3. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Rabi-
novich A. Going deeper with convolutions. In: Proceedings of
the IEEE conference on computer vision and pattern recognition.
bounding boxes for human object detection, background 2015. pp. 1-9.
noise suppression technique checking the existence of 4. He K, Zhang X, Ren S, Sun J. Deep residual learning for image
objects in a frame and centroid method for HAR. We con- recognition. In: Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016. pp. 770-778.
sidered a pretrained architecture of Faster RCNN Incep- 5. Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C. Efficient
tion-v2 and YOLOv3 for object detection in videos. We object localization using convolutional networks. In: Proceedings
fed the HAR videos from UCF-ARG dataset [27] to our of the IEEE Conference on Computer Vision and Pattern Recogni-
models and performed a comparative study on the HAR tion. 2015. pp. 648-656.
6. Redmon J, Farhadi A. Yolov3: an incremental improvement. arXiv
capability. On analysis, we found that the results pro- preprint arXiv​:1804.02767​. 2018.
duced by YOLOv3-based architecture are matching with 7. Girshick R. Fast r-cnn. In Proceedings of the IEEE international
that of the labeled activities of the videos. We attribute conference on computer vision. 2015. pp. 1440-1448.

SN Computer Science
SN Computer Science (2020) 1:138 Page 7 of 7  138

8. Raptis M, Sigal L. Poselet key-framing: a model for human activ- 20. Yang CM, Liu HW (2015) U.S. Patent No. 9,214,987. U.S. Patent
ity recognition. In: Proceedings of the IEEE Conference on Com- and Trademark Office, Washington, DC
puter Vision and Pattern Recognition. 2013. pp. 2650-2657. 21. Ronao CA, Cho SB (2015) Deep convolutional neural networks
9. Zeng M, Nguyen L. T, Yu B, Mengshoel O. J, Zhu J, Wu P, Zhang for human activity recognition with smartphone sensors. In: Arik
J. Convolutional neural networks for human activity recognition S, Huang T, Lai W, Liu Q (eds) International conference on neural
using mobile sensors. In: 6th International conference on mobile information processing. Springer, Cham, pp 46–53
computing, applications and services. IEEE; 2014, November. pp. 22. Rad NM, Bizzego A, Kia SM, Jurman G, Venuti P, Furlanello
197-205. C (2015) Convolutional neural network for stereotypical motor
10. Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal movement detection in autism. arXiv preprint. arXiv​:1511.01865​
activity proposals for efficient detection of human actions in 23. Albukhary N, Mustafah Y. M. Real-time human activity recogni-
untrimmed videos. In: Proceedings of the IEEE conference on tion. In: IOP Conference Series: Materials Science and Engineer-
computer vision and pattern recognition. 2016. pp. 1914-1923. ing, Vol. 260: No. 1. IOP Publishing; 2017. p. 012017 .
11. Niu W, Long J, Han D, Wang Y. F. Human activity detection and 24. Ann O. C, Theng L. B. Human activity recognition: a review. In:
recognition for video surveillance. In 2004 IEEE International 2014 IEEE international conference on control system, comput-
Conference on Multimedia and Expo (ICME)(IEEE Cat. No. ing and engineering (ICCSCE 2014). IEEE; 2014, November. pp.
04TH8763), Vol. 1. IEEE; 2004, June. pp. 719-722. 389-393.
12. Moya Rueda F, Grzeszick R, Fink G, Feldhorst S, ten Hompel 25. Canny J. A computational approach to edge detection. In: Read-
M. Convolutional neural networks for human activity recognition ings in computer vision. Morgan Kaufmann; 1987. pp. 184-203.
using body-worn sensors. In: Informatics, Vol. 5, No. 2. Multidis- 26. Duda RO, Hart PE. Use of the Hough transformation to detect
ciplinary Digital Publishing Institute. 2018. p. 26. lines and curves in pictures. 1971; (No. SRI-TN-36). Sri Interna-
13. Bayat A, Pomplun M, Tran DA. A study on human activity rec- tional Menlo Park CA Artificial Intelligence Center.
ognition using accelerometer data from smartphones. Procedia 27. https​://www.crcv.ucf.edu/data/UCF-ARG.php. Accessed 10 Jun
Comput Sci. 2014;34:450–7. 2019.
14. Vishwakarma S, Agrawal A. A survey on activity recognition 28. Goodfellow I, Bengio Y, Courville A. Deep learning. USA: MIT
and behavior understanding in video surveillance. Vis Comput. Press; 2016.
2013;29(10):983–1009. 29. Burghouts GJ, Schutte K. Spatio-temporal layout of human actions
15. Kastrinaki V, Zervakis M, Kalaitzakis K. A survey of video pro- for improved bag-of-words action detection. Pattern Recogn Lett.
cessing techniques for traffic applications. Image Vis Comput. 2013;34(15):1861–9.
2003;21(4):359–81. 30. Huynh T, Schiele B. Analyzing features for activity recognition.
16. Radu V, Lane N. D, Bhattacharya S, Mascolo C, Marina M. K, In: Proceedings of the 2005 joint conference on Smart objects and
Kawsar F. Towards multimodal deep learning for activity recogni- ambient intelligence: innovative context-aware services: usages
tion on mobile devices. In: Proceedings of the 2016 ACM inter- and technologies. ACM; 2005, October. pp. 159-163.
national joint conference on pervasive and ubiquitous computing:
adjunct. ACM; 2016, September. pp. 185-188. Publisher’s Note Springer Nature remains neutral with regard to
17. Shinde S, Kothari A, Gupta V. YOLO based human action recog- jurisdictional claims in published maps and institutional affiliations.
nition and localization. Procedia Comput Sci. 2018;133:831–8.
18. Bulling A, Blanke U, Schiele B. A tutorial on human activity
recognition using body-worn inertial sensors. ACM Comput Surv
(CSUR). 2014;46(3):33.
19. Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: Pro-
ceedings of the IEEE conference on computer vision and pattern
recognition. 2017. pp. 7263-7271.

SN Computer Science

You might also like