Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos
Performance Analysis of Inception v2 and Yolov3 Based Human Activity Recognition in Videos
https://doi.org/10.1007/s42979-020-00143-w
ORIGINAL RESEARCH
Received: 28 March 2020 / Accepted: 1 April 2020 / Published online: 24 April 2020
© Springer Nature Singapore Pte Ltd 2020
Abstract
Recently, many deep learning solutions have been proposed for human activity recognition (HAR) in videos. However,
the HAR accuracies obtained using models is not adequate. In this work, we proposed two HAR architectures using Faster
RCNN Inception-v2 and YOLOv3 as object detection models. In particular, we considered the human activities like walk-
ing, jogging and running as activities to be recognized by our proposed architectures. We used the pretrained Faster RCNN
Inception-v2 and YOLOv3 object detection models. We then analyzed the performance of proposed architectures using
benchmarked UCF-ARG dataset of videos. The experimental results show that Yolov3-based HAR architecture outperforms
Inception-v2 in all scenarios.
SN Computer Science
Vol.:(0123456789)
138
Page 2 of 7 SN Computer Science (2020) 1:138
the outcome generated by our architectures. The "Conclu- Deep learning-based movement detection [9] and activ-
sions" section contains conclusion and future work. ity recognition tasks have been explored by many authors
[20–22].
In recent days, the activity recognition in video process- In this work, we proposed two architectures for HAR based
ing has become high thrust research area. In particular, on Faster RCNN Inception-v2 and YOLOv3 as object detec-
HAR has become more prominent research area. In HAR, tion model in video frames. The description of the proposed
the objective is to recognize various human activities like architectures is given below.
walking, jogging, running, opening door, playing football,
etc. Michalis Raptis et al. [8] proposed a hidden Markov Faster RCNN Inception‑v2‑Based HAR Architecture
model-based HAR for six classes of activities, viz., hog-
ging, kicking, etc., and achieved an accuracy of 93% on Here, we describe the HAR using Inception-v2 model,
full video. Fernando Moya Rueda et al. [12] have proposed wherein Inception-v2 is used for detecting objects in video
sensors-based HAR for walking, searching, picking, etc., frames. The working procedure of the proposed architecture
and achieved an accuracy of 92.3% by using CNNs. Zeng is shown in Fig. 1.
et al. [9] proposed a CNN-based HAR-based mobile sen-
sors data and achieved an accuracy of 95%. Akram Bayat Working of proposed methodology
et al. [13] used acceleration generated by user cell phone In this work, we achieve the HAR by using the following
[14, 15] and then applied MLP, SVM and random forest- sequence of steps.
based machine learning techniques for HAR recognition. Pretrained Faster RCNN-Inception-v2 We have used
In these models, SVM classifier achieved an accuracy of the pretrained Faster-RCNN model trained on COCO data-
91%. Valentin Radu et al. [16] proposed a multimodal set as object detection model. This model is trained on 80
RBM-based HAR using accelerometer and gyroscope sig- classes of images. This model detects various objects, viz.,
nals and achieved an accuracy of 81%. These accuracies car, human, bikes, etc., in an image or video frame. It also
are less compared to human level of accuracy in activ- produces the bounding box for each of the classes in the
ity recognition. Therefore, we felt that the HAR accuracy images. In our work, this model has been used to detect
[17, 18] may be improved further. Keeping this view in human objects in a video frame along with associated green
mind, we proposed two approaches which are based on bounding box. The architecture of Inception-v2 module is
Faster RCNN Inception-v2 and YOLOv3 as object detec- shown in Fig. 2.
tion models. Video Processing and Human Object Detection In this
The object detection architectures [6, 7] have proved their section, we describe the video processing and object detec-
capability in efficient detection objects in images. Object tions in video frames.
detection architecture in [7] has been used in real-time video This module takes the video as input and samples the
processing as a lot of events were captured by Author Shinde video into sequence of frames such that the number of
et al. [19] achieved an accuracy of 87%. frames processed per second is 36, i.e., 36FPS. The video
Fig. 1 Faster RCNN-Inception-
v2 model-based HAR in videos
SN Computer Science
SN Computer Science (2020) 1:138 Page 3 of 7 138
SN Computer Science
138
Page 4 of 7 SN Computer Science (2020) 1:138
YOLOv3 contains residual skip connections and upsam- produces a bounding box around the objects, we extracted
pling. In YOLOv3, Softmax is not used instead inde- the top left and bottom right coordinates and computed the
pendent logistic classifiers and binary cross-entropy loss centroid of the object using equation (2)
is used. YOLOv3 predicts 10x times more bounding box
than YOLOv2. YOLOv3 uses nine anchor boxes. K-means
Centroid = ((x1 + x2)∕2, (y1 + y2)∕2) (2)
clustering is used to generate anchors which are assigned in Activity Threshold We applied the procedure mentioned in
descending order. "Faster RCNN Inception-v2-Based HAR Architecture" sec-
Video Processing and Human Object Detection We pro- tion for activity threshold and tracking.
cessed the video and sampled the video at 36FPS and then
passed frames to the YOLOv3 model. This model detects
various objects in a frame along their bounding boxes. How-
ever, we are interested in human objects only and hence Experiments
extracted human objects along with their bounding boxes
using class labels. In this section, we bring the software implementation, hard-
Centroiding In order to proceed with HAR, we applied ware architecture, training and test datasets considered in
a centroid-based method for obtaining the centroid of the the experiments.
bounding box. As the YOLOv3 detects the object and
SN Computer Science
SN Computer Science (2020) 1:138 Page 5 of 7 138
Implementation We implemented our proposed archi- objects detected in the frame are shown with corresponding
tectures using tensorflow and torch frameworks, in python bounding box. The human object along with bounding box
language on Windows Machine. is further processed by our model to recognize the human
Hardware Details activities in the videos (see Fig. 6).
The hardware details used for running the developed soft-
ware and conducting the experiments are mentioned below.
Windows 10 RAM: 8GB Processor: i7 GPU: NVIDIA Results and Analysis
GeForce GTX 1050Ti
Training Pretrained object detection Models, viz., Faster In this section, we present the results obtained by our archi-
RCNN Inception-v2 and YOLOv3 trained on COCO dataset, tectures. The output activity recognized by our architectures,
are used. viz., Faster RCNN Inception-v2 and YOLOv3 for walking,
jogging and running, against the labeled activities is given in
Testing the Proposed Architectures the UCF-ARG dataset [27]. We validated the proposed archi-
tectures using 48 test videos having labeled human activities,
Test Videos for Human Activities viz., walking, jogging and running taken from UCF-ARG
For testing the HAR capability of our architectures, we dataset [27]. The results obtained by the two architectures
considered benchmarked UCF-ARG dataset [27] of labeled are listed in Table 2.
human activities. It has ten types of human activities, and Faster RCNN Inception-v2-based architecture obtained
each activity has 48 videos taken from aerial, rooftop and an accuracy of 62.5% for walking, 35.5% for jogging and
ground. In this experiment, we considered three types of 43.7% for running, whereas the YOLOv3-based architecture
human activities, viz., walking, jogging and running taken correctly recognized all three human activities. The accura-
from ground camera. The duration of each video is 3 to 8 s. cies obtained by YOLOv3-based architecture are 100% for
We fed 48 videos labeled walking, jogging and running all three activities. In Fig. 7, we plotted the HAR accuracies
to the architectures. The activity recognition is done periodi- of Inception-v2- Vs YOLOv3-based architectures.
cally with a period of 30 frames. The same process contin- Analysis On analysis, we found that the results produced
ues till the end of the video. We then consolidate the type by YOLOv3-based architecture are matching with that of
of activities which our model recognized at the end of 30 the labeled activities of the videos. We attribute this highly
frames, 60 frames, 90 frames, etc. The final activity recog- accurate activity recognition capability of our architecture
nized by our model is composed of the activities which were to object detection capability YOLOv3 and the centroid
recognized made at the end of 30 frames, 60 frames and 90 approach which we proposed (see Fig. 8).
frames while processing the video. The final output depends
on the majority of the activities recognized by our model
while processing the video. Conclusion
Under the similar conditions of experimental setup, we
ran two architectures and captured the human activities rec- In this work, we proposed two architectures, viz., Faster
ognized by the two proposed approaches. RCNN Inception-v2-based and YOLOv3-based architec-
The following Fig. 5 shows a video frame which is tures for HAR in videos. We also developed a threshold of
labeled running is passed to our architectures, and the
SN Computer Science
138
Page 6 of 7 SN Computer Science (2020) 1:138
SN Computer Science
SN Computer Science (2020) 1:138 Page 7 of 7 138
8. Raptis M, Sigal L. Poselet key-framing: a model for human activ- 20. Yang CM, Liu HW (2015) U.S. Patent No. 9,214,987. U.S. Patent
ity recognition. In: Proceedings of the IEEE Conference on Com- and Trademark Office, Washington, DC
puter Vision and Pattern Recognition. 2013. pp. 2650-2657. 21. Ronao CA, Cho SB (2015) Deep convolutional neural networks
9. Zeng M, Nguyen L. T, Yu B, Mengshoel O. J, Zhu J, Wu P, Zhang for human activity recognition with smartphone sensors. In: Arik
J. Convolutional neural networks for human activity recognition S, Huang T, Lai W, Liu Q (eds) International conference on neural
using mobile sensors. In: 6th International conference on mobile information processing. Springer, Cham, pp 46–53
computing, applications and services. IEEE; 2014, November. pp. 22. Rad NM, Bizzego A, Kia SM, Jurman G, Venuti P, Furlanello
197-205. C (2015) Convolutional neural network for stereotypical motor
10. Caba Heilbron F, Carlos Niebles J, Ghanem B. Fast temporal movement detection in autism. arXiv preprint. arXiv:1511.01865
activity proposals for efficient detection of human actions in 23. Albukhary N, Mustafah Y. M. Real-time human activity recogni-
untrimmed videos. In: Proceedings of the IEEE conference on tion. In: IOP Conference Series: Materials Science and Engineer-
computer vision and pattern recognition. 2016. pp. 1914-1923. ing, Vol. 260: No. 1. IOP Publishing; 2017. p. 012017 .
11. Niu W, Long J, Han D, Wang Y. F. Human activity detection and 24. Ann O. C, Theng L. B. Human activity recognition: a review. In:
recognition for video surveillance. In 2004 IEEE International 2014 IEEE international conference on control system, comput-
Conference on Multimedia and Expo (ICME)(IEEE Cat. No. ing and engineering (ICCSCE 2014). IEEE; 2014, November. pp.
04TH8763), Vol. 1. IEEE; 2004, June. pp. 719-722. 389-393.
12. Moya Rueda F, Grzeszick R, Fink G, Feldhorst S, ten Hompel 25. Canny J. A computational approach to edge detection. In: Read-
M. Convolutional neural networks for human activity recognition ings in computer vision. Morgan Kaufmann; 1987. pp. 184-203.
using body-worn sensors. In: Informatics, Vol. 5, No. 2. Multidis- 26. Duda RO, Hart PE. Use of the Hough transformation to detect
ciplinary Digital Publishing Institute. 2018. p. 26. lines and curves in pictures. 1971; (No. SRI-TN-36). Sri Interna-
13. Bayat A, Pomplun M, Tran DA. A study on human activity rec- tional Menlo Park CA Artificial Intelligence Center.
ognition using accelerometer data from smartphones. Procedia 27. https://www.crcv.ucf.edu/data/UCF-ARG.php. Accessed 10 Jun
Comput Sci. 2014;34:450–7. 2019.
14. Vishwakarma S, Agrawal A. A survey on activity recognition 28. Goodfellow I, Bengio Y, Courville A. Deep learning. USA: MIT
and behavior understanding in video surveillance. Vis Comput. Press; 2016.
2013;29(10):983–1009. 29. Burghouts GJ, Schutte K. Spatio-temporal layout of human actions
15. Kastrinaki V, Zervakis M, Kalaitzakis K. A survey of video pro- for improved bag-of-words action detection. Pattern Recogn Lett.
cessing techniques for traffic applications. Image Vis Comput. 2013;34(15):1861–9.
2003;21(4):359–81. 30. Huynh T, Schiele B. Analyzing features for activity recognition.
16. Radu V, Lane N. D, Bhattacharya S, Mascolo C, Marina M. K, In: Proceedings of the 2005 joint conference on Smart objects and
Kawsar F. Towards multimodal deep learning for activity recogni- ambient intelligence: innovative context-aware services: usages
tion on mobile devices. In: Proceedings of the 2016 ACM inter- and technologies. ACM; 2005, October. pp. 159-163.
national joint conference on pervasive and ubiquitous computing:
adjunct. ACM; 2016, September. pp. 185-188. Publisher’s Note Springer Nature remains neutral with regard to
17. Shinde S, Kothari A, Gupta V. YOLO based human action recog- jurisdictional claims in published maps and institutional affiliations.
nition and localization. Procedia Comput Sci. 2018;133:831–8.
18. Bulling A, Blanke U, Schiele B. A tutorial on human activity
recognition using body-worn inertial sensors. ACM Comput Surv
(CSUR). 2014;46(3):33.
19. Redmon J, Farhadi A. YOLO9000: better, faster, stronger. In: Pro-
ceedings of the IEEE conference on computer vision and pattern
recognition. 2017. pp. 7263-7271.
SN Computer Science