0% found this document useful (0 votes)
18 views4 pages

Deep2019 2

This paper discusses the use of Convolutional Neural Networks (CNN) and transfer learning for vision-based human activity recognition (HAR), achieving an accuracy of 96.95% with the VGG-16 model on the Weizmann Dataset. The authors highlight the advantages of non-wearable HAR systems over traditional methods, emphasizing the effectiveness of deep learning in feature extraction and classification. The study demonstrates that transfer learning significantly improves recognition accuracy compared to traditional machine learning approaches.

Uploaded by

Zain Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views4 pages

Deep2019 2

This paper discusses the use of Convolutional Neural Networks (CNN) and transfer learning for vision-based human activity recognition (HAR), achieving an accuracy of 96.95% with the VGG-16 model on the Weizmann Dataset. The authors highlight the advantages of non-wearable HAR systems over traditional methods, emphasizing the effectiveness of deep learning in feature extraction and classification. The study demonstrates that transfer learning significantly improves recognition accuracy compared to traditional machine learning approaches.

Uploaded by

Zain Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2019 29th International Telecommunication Networks and Applications Conference (ITNAC)

Leveraging CNN and Transfer Learning for


Vision-based Human Activity Recognition
Samundra Deep Xi Zheng
Department of Computing Department of Computing
Macquarie University Macquarie University
Sydney, Australia Sydney, Australia
Email: [email protected] Email: [email protected]

Abstract—With the advent of the Internet of Things (IoT), [4]. Marker-based method make use of optic wearable marker-
there have been significant advancements in the area of human based motion capture (MoCap) framework. It can accurately
activity recognition (HAR) in recent years. HAR is applicable capture complex human motions but this approach has some
to wider application such as elderly care, anomalous behaviour
detection and surveillance system. Several machine learning disadvantages. It require the optical sensors to be attached
algorithms have been employed to predict the activities per- on the human and also demand the need of multiple camera
formed by the human in an environment. However, traditional settings. Whereas, the vision based method make use of RGB
machine learning approaches have been outperformed by feature or depth image. It does not require the user to carry any
engineering methods which can select an optimal set of features. devices or to attach any sensors on the human. Therefore,
On the contrary, it is known that deep learning models such
as Convolutional Neural Networks (CNN) can extract features this methodology is getting more consideration nowadays,
and reduce the computational cost automatically. In this paper, consequently making the HAR framework simple and easy
we use CNN model to predict human activities from Wiezmann to be deployed in many applications.
Dataset. Specifically, we employ transfer learning to get deep
image features and trained machine learning classifiers. Our Most of the vision-based HAR systems proposed in the
experimental results showed the accuracy of 96.95% using VGG- literature used traditional machine learning algorithms for
16. Our experimental results also confirmed the high performance activity recognition. However, traditional machine learning
of VGG-16 as compared to rest of the applied CNN models. methods have been outperformed by deep learning methods
Index Terms—Activity recognition, deep learning, convolu- in recent time [5]. The most common type of deep learning
tional neural network.
method is Convolutional Neural Network (CNN). CNN are
largely applied in areas related to computer vision. It consists
I. I NTRODUCTION series of convolution layers through which images are passed
for processing. In this paper, we use CNN to recognise
Human activity recognition (HAR) is an active research area human activities from Wiezmann Dataset. We first extracted
because of its applications in elderly care, automated homes the frames for each activities from the videos. Specifically,
and surveillance system. Several studies has been done on we use transfer learning to get deep image features and
human activity recognition in the past. Some of the existing trained machine learning classifiers. We applied 3 different
work are either wearable based [1] or non-wearable based CNN models to classify activities and compared our results
[2] [3]. Wearable based HAR system make use of wearable with the existing works on the same dataset. In summary, the
sensors that are attached on the human body. Wearable based main contributions of our work are as follows:
HAR system are intrusive in nature. Non-wearable based HAR
system do not require any sensors to attach on the human or to 1) We applied three different CNN models to classify
carry any device for activity recognition. Non-wearable based human recognition activities and we showed the
approach can be further categorised into sensor based [2] and accuracy of 96.95% using VGG-16.
vision-based HAR systems [3]. Sensor based technology use
RF signals from sensors, such as RFID, PIR sensors and Wi- 2) We used transfer learning to leverage the knowledge
Fi signals to detect human activities. Vision based technology gained from large-scale dataset such as ImageNet [6] to
use videos, image frames from depth cameras or IR cameras the human activity recognition dataset.
to classify human activities. Sensor based HAR system are
non-intrusive in nature but may not provide high accuracy. The rest of the paper is as follows: Section II provide an
Therefore, vision-based human activity recognition system has overview of the related work in video-based HAR systems. We
gained significant interest in the present time. Recognising provide an overview of transfer learning in section III. Section
human activities from the streaming video is challenging. IV outline the research methodology, sources of data, research
Video-based human activity recognition can be categorised approach and discuss the experimental results. Conclusion and
as marker-based and vision-based according to motion features future work are drawn in Section V.

978-1-7281-3673-8/19/$31.00 ©2019 IEEE

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 11,2020 at 14:35:37 UTC from IEEE Xplore. Restrictions apply.
2019 29th Int ern at ional Telecommunication Networks and Applications Conference (ITNAC)

II. R ELAT ED W O RK III. T RA NS FER L EAR NI NG

Tran sfer learning [18] is a method of transferrin g knowledge


There have been a lot of researc h on vision-based human that a model has learned from earlier extensive trainin g to
activity recog nition in rece nt years . Most of the studied the cur rent model. The deep network models can be train ed
methods have depend on handcrafted feature extrac tion from with significantly less data with transfer learnin g. It has been
the videos/i mages and employed traditional classifiers for used to reduce trainin g time and impro ve accura cy of the
activity recog nition. The tradit ional approa ches often achieved model. In this work, we use transf er learning to leverage the
optimum results and exhibited high performan ces. However, knowledge gained from large-scale dataset such as ImageNet.
traditional methods are not feasible to deploy in real life We first extract the frames for each activities from the videos .
because handcrafted features are highly dependent on data and We use transf er learnin g to get deep image features and
are not robu st to the environment change. trained machin e learn ing classifiers. For all CNN models, pre-
Hidden Markov Model (HMMs) methods have been largely train ed weights on ImageNet are used as starting point for
used as the recog nition techniques in the past because of transfer learning. ImageNet [6] is a dataset containing 20000
its capability of temp oral pattern decodin g [7]. However, re- catego ries of activities. The knowl edge is transferred from pre-
searchers are more interested in using deep learnin g techniqu es train ed weights on ImageNet to Weizmann dataset, since set
becau se of its ability to autom atically extract the features and of activities recognised in this work fall within the dom ain
learn deep pattern structures [5] [7]. Deep learnin g methods of ImageNet. The featur es are extracted from the penultimate
have clearly ruled out tradit ional classification methods in the layer of CNNs. The basic idea of transfer learnin g is as shown
dom ain of co mputer vision [5] [8]. Deep learnin g techniqu es in Figure I.
have been largely employed rece ntly in the domain of com-
puter vision and have achieved trem endous results. There fore, Training from Scratch

video-based hum an activity recog nition using deep learning Convo lutio nal Neural Network (CNN)
Dog X
models have gained a lot of interest in recent years [5].
Learned Cat X
Zhu et al. [4] proposed an action cla ssification method by
Features
addin g a mixed-n orm regularization function to a deep LSTM Hand"

network. One of the most popul ar deep learnin g methods Bus X


in frames/ima ge processing is Convoluti onal Neural Network
(CNN). The re have been several works that utilized 2D-CNNs New Data Tra nsfer Learnin g
that take advantages of spatial cor relation between the video Fine-tune network Dog X
frame s and combine the outputs employing different strategies Pre- Cat X
New
[9] Many have also used additional input such as optica l flow trained
Task
CNN Hand"
to 2D-CNN to get temporal correlations informati on [10]. Sub-
Bus X
sequently, 3D-CNN s [I I] were introduced that demonstrated
exce ptional results in the classification of videos and frames .
Wang et al. [12] applied CNN to RGB and depth frames to Fig . I. Schematic diagram to demonst rate transfer learnin g.
automatically extract the features. The obtained features were
passed through a fully connected neural network and achieved The main approaches in transfer learnin g are: ( I). to pre-
an improved accurac y. Ji et al. [13] proposed a 3D CNN serve the origin al pre-trained neural model of large-scale
model which perform s 3D convolutions and extrac t spatial dataset and update weights of the trained model on the target
and temporal features by capturing the motion inform ation for dataset, and (2). use the pre-train ed neural model for feature
activity recog nition. Simonyan et al. [8] introdu ced Co nvNet, extraction and representation followed by a generic classifier
a two-stream convolution layer architecture that could achieve such as Supp ort Vector Machine or Logistic Regression.
good results despite of limited trainin g data. Khaire et al.
IV. IMP LEMENTATI ON
[14] proposed a model that train convnets from RGB -D
dataset and combined the softmax scor es from depth , motion A. Dataset
and skeleton images at the classification level to identify In order to evaluate the effectiveness of the models, we
the activities. Karpathy et al. [15] proposed the extension do experiments on benchmark activity recogniti on, namely
of CNN archite cture in the first convolutional layers over Weizmann dataset. It co nsists of 90 low-resoluti on video
a 40 video chunk. Si milarly, Tran et al. [16] used a deep sequences show ing 9 different peop le performin g 10 activities
3D CNN architecture (quiet similar to VGGn et [17]) that namely, bend , ja ck (or jumping-jack ), jump (or jump-forw ard-
utilise spatiotemporal co nvolutions and poolin g in all layers on-two-Iegs), pjump (or jump-in-pl ace-on-tw o-legs), run, side
to impro ve the acc uracy of the model. (or gallopsideways), skip , walk, wave l (waveone-hand), and
In co mparison, we are more interested to explore how wave2 (wave-two-hands).We used nine actions (not including
transf er learnin g can be leveraged with CNN models on pjump jump-in-place-on- two-legs) for our experiment. We
benchmark dataset to impro ve classification accuracy. first convert all videos into individual fra mes based on their

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 11,2020 at 14:35:37 UTC from IEEE Xplore. Restrictions apply.
2019 29th International Telecommunication Networks and Applications Conference (ITNAC)

TABLE II
activity. Table I shows the total number of frames per activity R ESULTS ON ACTIVITY R ECOGNITION BASED ON DIFFERENT CNN
based on the extracting frames for all 9 people. The entire MODELS IN TERMS OF ACCURACY SCORE , P RECISION , R ECALL , AND
dataset is divided into Training (70%), Validation(10%), and F1- SCORE
Testing (20%).
Model Accuracy Precision Recall F1-
TABLE I (in %) (in %) (in %) score
DATASET STATISTICS IN TERMS OF NUMBER OF FRAMES PER ACTIVITY
(in %)
VGG-16 96.95 97.00 97.00 97.00
Activity Number of Frames VGG-19 96.54 97.00 97.00 96.00
Inception-v3 95.63 96.00 96.00 96.00

Bend 639
Jack 729 TABLE III
Jump 538 P ERFORMANCE COMPARISON USING W EIZMANN DATASET
Run 346
Side 444 Model Accuracy (in %)
Skip 378
Walk 566 VGG-16 96.95
Wave1 653 Cai et al. [19] 95.70
Wave2 624 Kumar et al. [20] 95.69
Feng et al. [21] 94.10
Total 4917 Han et al. [22] 90.00

B. Discussion and Results


In order to classify activities, we experiment with 3 different
Convolutional Neural Networks (CNN) for activity recogni-
tion, namely VGG-16, VGG-19 and Google’s InceptionNet-v3.
We used transfer learning to leverage the knowledge gained
from large-scale dataset such as ImageNet. Transfer learning
technique transfer knowledge from pre-trained model to train
a new domain in neural network. We performed experiment
on Weizmann dataset using the knowledge learned from pre-
trained weights on ImageNet. The features are extracted from
the penultimate layers of CNNs. We applied transfer learning
on VGG-16 CNN model and achieved accuracy of 96.95% .
For VGG-16, image of dimensions 224 × 224 is given as an
input and features from fc1 layer are extracted which gives
4096-dimensional vector for each image.
We also applied transfer learning to other CNN models such Fig. 2. Confusion Matrix for recognising 9 activities on Weizmann Dataset
as VGG-19 and Google’s InceptionNet-v3 to examine the per- using VGG-16 Convolutional Neural Network
formance of the different CNN models. VGG-19 and Google’s
InceptionNet-v3 achieved 96.54% and 95.63% respectively.
Experimental results showed that VGG-16 performs better
than the rest of the CNN models after all the models have
been applied transfer learning. Table II reports accuracy score,
precision, recall, and f1-score of the applied CNN models. The
confusion matrix of 3 different CNN models are as shown in
Figure 2, 3 and 4.
We compared the results achieved from some of the other
approaches that do not employed transfer learning on Weiz-
mann dataset with our approach. The experiment results
showed that applying transfer learning to the same dataset
achieved better recognition scores. The recognition accuracy
is improved by 1 − 6% by applying transfer learning. The
comparison of results utilising transfer learning on VGG-16
model and other approaches are presented in Table III. The
comparison with state-of-the-art approaches is done to explore
how effective is transfer learning when leveraged with CNN Fig. 3. Confusion Matrix for recognising 9 activities on Weizmann Dataset
models for improving recognition scores. using VGG-19 Convolutional Neural Network

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 11,2020 at 14:35:37 UTC from IEEE Xplore. Restrictions apply.
2019 29th International Telecommunication Networks and Applications Conference (ITNAC)

[2] L. Yao, Q. Z. Sheng, X. Li, T. Gu, M. Tan, X. Wang, S. Wang,


and W. Ruan, “Compressive representation for device-free activity
recognition with passive rfid signal strength,” IEEE Transactions on
Mobile Computing, vol. 17, no. 2, pp. 293–306, 2018.
[3] I. Lillo, J. C. Niebles, and A. Soto, “Sparse composition of body poses
and atomic actions for human activity recognition in rgb-d videos,”
Image and Vision Computing, vol. 59, pp. 63–75, 2017.
[4] W. Zhu, C. Lan, J. Xing, W. Zeng, Y. Li, L. Shen, and X. Xie,
“Co-occurrence feature learning for skeleton based action recognition
using regularized deep lstm networks,” in Thirtieth AAAI Conference on
Artificial Intelligence, 2016.
[5] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in Proceedings of the IEEE conference on computer vision and pattern
recognition, 2015, pp. 1–9.
[6] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, June 2009,
pp. 248–255.
[7] A. Jalal, N. Sarif, J. T. Kim, and T.-S. Kim, “Human activity recognition
via recognized body parts of human depth silhouettes for residents
Fig. 4. Confusion Matrix for recognising 9 activities on Weizmann Dataset monitoring services at smart home,” Indoor and built environment,
using Inception-v3 Convolutional Neural Network vol. 22, no. 1, pp. 271–279, 2013.
[8] K. Simonyan and A. Zisserman, “Two-stream convolutional networks
for action recognition in videos,” in Advances in neural information
processing systems, 2014, pp. 568–576.
Figure 2, 3 and 4 show the confusion matrix of 3 different [9] G. Gkioxari, R. Girshick, and J. Malik, “Contextual action recognition
Convolutional Neural Networks (CNN) after applied transfer with r* cnn,” in Proceedings of the IEEE international conference on
computer vision, 2015, pp. 1080–1088.
learning, which were used to classify frames of different [10] L. Wang, Y. Xiong, Z. Wang, and Y. Qiao, “Towards good practices
activities using VGG-16, VGG-19 and Google’s InceptionNet- for very deep two-stream convnets,” arXiv preprint arXiv:1507.02159,
v3 respectively. It is evident from Figure 2, 3 and 4 that 2015.
[11] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Deep
VGG-16 has mis-classification in predicting run activity as end2end voxel2voxel prediction,” in Proceedings of the IEEE conference
skip, VGG-19 has mis-classification in predicting run activity on computer vision and pattern recognition workshops, 2016, pp. 17–24.
as skip, and skip as walk, and Google’s InceptionNet-v3 has [12] P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu, “Cooperative training
of deep aggregation networks for rgb-d action recognition,” in Thirty-
mis-classification in predicting run activity as skip which are Second AAAI Conference on Artificial Intelligence, 2018.
very similar in terms of their visual perception. Employing [13] S. Ji, W. Xu, M. Yang, and K. Yu, “3d convolutional neural networks
transfer learning on the CNN models have increased the for human action recognition,” IEEE transactions on pattern analysis
and machine intelligence, vol. 35, no. 1, pp. 221–231, 2013.
accuracy of the activity recognition. However, transfer learning [14] P. Khaire, P. Kumar, and J. Imran, “Combining cnn streams of rgb-d
technique used in our work with the knowledge transferred and skeletal data for human activity recognition,” Pattern Recognition
from pre-trained weight on Imagenet may be compromised, Letters, vol. 115, pp. 107–116, 2018.
[15] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and
since Imagenet contains images of several different categories. L. Fei-Fei, “Large-scale video classification with convolutional neural
networks,” in Proceedings of the IEEE conference on Computer Vision
V. C ONCLUSION and Pattern Recognition, 2014, pp. 1725–1732.
We used CNN models to predict the human activities from [16] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning
spatiotemporal features with 3d convolutional networks,” in Proceedings
Wiezmann Dataset. We experimented with 3 different Convo- of the IEEE international conference on computer vision, 2015, pp.
lutional Neural Networks (CNN) for activity recognition. We 4489–4497.
have employed transfer learning to get the deep image features [17] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
and trained machine learning classifiers. Our experimental [18] Z. Wharton, E. Thomas, B. Debnath, and A. Behera, “A vision-based
results showed the accuracy of 96.95% using VGG-16 with transfer learning approach for recognizing behavioral symptoms in
the implementation of transfer learning. Our experimental people with dementia,” in 2018 15th IEEE International Conference on
Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2018,
results showed that VGG-16 outperformed other CNN models pp. 1–6.
in terms of feature extraction. Our experimental results with [19] J. Cai, X. Tang, and R. Zhong, “Silhouettes based human action recogni-
transfer learning technique also showed high performance of tion by procrustes analysis and fisher vector encoding,” in International
Conference on Image and Video Processing, and Artificial Intelligence,
VGG-16 as compared to state-of-the-art methods. vol. 10836. International Society for Optics and Photonics, 2018, p.
In future, we aim to extend this study by developing the 1083612.
context-aware recognition system to classify human activities. [20] S. S. Kumar and M. John, “Human activity recognition using optical
flow based feature set,” in Proceedings of IEEE International Carnahan
Also, we will extend our work to recognise complex human conference on security technology (ICCST). IEEE, 2016, pp. 1–5.
activities such as cooking, reading books, and watching TV. [21] W. Feng, H. Tian, and Y. Xiao, “Research on temporal structure for
action recognition,” in Chinese Conference on Biometric Recognition.
R EFERENCES Springer, 2017, pp. 625–632.
[1] B. Bhandari, J. Lu, X. Zheng, S. Rajasegarar, and C. Karmakar, “Non- [22] P. Y. Han, K. E. Yee, and O. S. Yin, “Localized temporal representation
invasive sensor based automated smoking activity detection,” in Pro- in human action recognition,” in Proceedings of International Confer-
ceedings of Annual International Conference of the IEEE Engineering ence on Network, Communication and Computing. ACM, 2018, pp.
in Medicine and Biology Society (EMBC). IEEE, 2017, pp. 845–848. 261–266.

Authorized licensed use limited to: UNIVERSITY OF BIRMINGHAM. Downloaded on May 11,2020 at 14:35:37 UTC from IEEE Xplore. Restrictions apply.

You might also like