Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
Gradient Local Auto-Correlation Features For Depth Human Action Recognition - SpringerLink
Search Log in
Article Highlights
Abstract
Introduction
Human action classification is a dynamic research topic in computer vision and has Related work
applications in video surveillance, human–computer interaction, and sign-language Proposed system
recognition. This paper aims to present an approach for the categorization of depth video Experimental results and discussion
oriented human action. In the approach, the enhanced motion and static history images are Conclusion
computed and a set of 2D auto-correlation gradient feature vectors is obtained from them to References
describe an action. Kernel-based Extreme Learning Machine is used with the extracted Acknowledgements
features to distinguish the diverse action types promisingly. The proposed approach is Author information
thoroughly assessed for the action datasets namely MSRAction3D, DHA, and UTD-MHAD. Ethics declarations
The approach achieves an accuracy of 97.44% for MSRAction3D, 99.13% for DHA, and 88.37% Additional information
for UTD-MHAD. The experimental results and analysis demonstrate that the classification Rights and permissions
performance of the proposed method is considerable and surpasses the state-of-the-art human About this article
action classification methods. Besides, from the complexity analysis of the approach, it is turn
out that our method is consistent for the real-time operation with low computational Advertisement
complexity.
Article Highlights
The work proposes to process depth action videos through 3D Motion Trail Model
(3DMTM) to represent the video as a set of 2D motion and motionless images.
This work improves the above action representation by configuring all the 2D motion and
motionless images as binary-coded images with the help of the Local Binary Pattern
(LBP) algorithm.
Introduction
A large number of researchers have been attracted to human action classification problem due
to its wide range of real-world applications. The notable implementations cover visual
surveillance [1], smart homes [2], sports [3], entertainment [4], healthcare monitoring [5],
patient monitoring [6], elderly care [7], Virtual-Reality [8], human–computer interaction [9],
and so on.
Human actions refer to distinctive sorts of activities including walking, jumping, waving, etc.
However, the vivid variations in human body sizes, appearances, postures, motions, clothing,
camera motions, viewing angles, illumination changes make the action recognition task very
challenging. Over few years, a large number of researchers introduced several action or
activity recognition model by using data sensors like RGB video cameras [10], depth video
cameras [2], and wearable sensors [11]. Among these two video data sources, action
recognition research based on conventional RGB cameras (e.g. [12] has achieved great
progress in the last few decades. However, utilization of RGB cameras for action recognition
raises significant impediments such as lighting variations and cluttered background [13]. On
the contrary, depth cameras generate depth images, which are insensitive to lighting
variations and make background subtraction and segmentation easier. In addition, we can
obtain body shape and structure characteristics and the human skeleton information from
depth images.
Many previous attempts can be listed for efficient recognition systems such as DMM [14],
HON4D [15], Super Normal Vector [16], Skeletons Lie group [17] and etc. But, those existing
methods still face some crucial challenges such as depth video processing, appropriate feature
extraction and reliable performance of classification model. Considering the aforementioned
challenges, this study focuses to build an effective and efficient human action recognition
framework on depth action video sequences. The main objective of this work is to enhance the
classification accuracy by proposing an efficient recognition framework, which can overcome
the above challenges more effectively. More specifically, the action video is illustrated through
three 2D motion and three 2D static segments oriented images of the action. In fact, the
dynamic and motionless maps are derived from the implementation of 3DMTM [18] on a
video. However, the obtained representations are then enhanced with the help of LBP [19]
tool. The tool enriches the action illustration by encoding the motion and motionless maps
into binary pattern. Eventually, the outputs of the LBP are treated as input of GLAC [20] to
generate the auto-correlation gradient vectors. In fact, there are three feature vectors for the
action motion segments images and another three feature vectors for the action static
segments images. The first three vectors are concatenated to construct a motion information
based GLAC vector. Similarly, another single GLAC vector is gained by incorporating the
above mentioned last three vectors. For more boosting the proposed method, the
aforementioned two single action representation vectors are concatenated for building the
final action description. Finally, the action is recognized by passing the vector to a supervised
learning algorithm named Extreme Learning Machine with Kernel (KELM) [21].
Based on three publicly available data sets - MSRAction3D [22], DHA [23] and UTD-MHAD
[24], the proposed method is compared with handcrafted and deep learning methods
extensively. The computational efficiency assessment indicates that the proposed approach
offer feasibility for the real-time implementation. The working flow of the system is illustrated
in Fig. 1.
Fig. 1
This paper is organized as follows: in Sect. 2, we present some related literature review.
Section 3 describes research methodology. The results of experimental and discussions are
presented in Sect. 4. Finally, Sect. 5 concludes the work.
Related work
Feature extraction is a key step in Computer Vision research problems like object localization,
human gait recognition, face recognition, action recognition, text recognition and etc. As a
result, researchers have given more attention to extract features effectively. For example, for
object recognition, Ahmed et al. [25] introduced a Saliency map on RGB-D indoor data which
had numerous applications such as vehicle monitoring system, violence detection, driverless
driving system, etc. Hough voting and distinct features were used to measure the efficiency of
that work. To explore silhouettes of humans from noisy backgrounds, Jalal et al. [26] applied
embedded HMM for activity classifications where spatiotemporal body joints and depth
silhouettes were fused to improve accuracy. In another work, to recognize online human
action and activity, Jalal et al. [27] performed multi-features fusion along with skeleton joints
and shape features of humans. For feature extractions in activity recognition, Tahir et al. [28]
applied 1-D LBP and 1-D Hadamard wavelet transform along with Random Forest. On depth
video sequences, Kamal et al. [29] utilized modified HMM to complete another fusion process
of temporal joint features and spatial depth shape features. On the other hand, to recognize
facial expression Rizwan et al. [30] implemented local transform features where HOG and
LBP were used for feature extraction. Again, skin joint features by using skin color and self-
organized maps were used for activity recognition [31]. In another work, Kamal et al. [32]
employed distance parameters features and motion features. Yaacob et al. [33] introduced a
discrete cosine transform, particularly for gait action recognition.
In developing vision based handcrafted action recognition, researchers have also done struggle
in feature extraction for optimal action representation. The motion features of an action
through the simplified depth motion maps were extracted by works in DMM [14], DMM-CT-
HOG [34], DLE [35]. The texture features extracted by LBP were utilized in [36]. Recently,
Dhiman et al. [37] introduced Zernike moments and R-transform to create a powerful feature
vector for abnormal action detection. A genetic algorithms based system was proposed by
Chaaroui et al. [38] to improve the efficiency of the skeleton joint-based recognition system by
optimizing skeleton-joint subset. Vemulapalli et al. [17] represented human actions as curves
that contain skeletal action sequences. Gao et al. [39] proposed a model to recognize 3D
actions where they constructed a difference motion history image for RGB and depth
sequences. Then, they captured motions through multi-perspective projections. Next, they
extracted the pyramid histogram of oriented gradients. Finally, human action was identified
by combining multi-perspective and multi-modality discriminated and joint representation. In
the work by Rahmani et al [40], the features obtained from depth images were combined with
skeleton movements encoded by difference histogram and finally a Random Decision Forest
(RDF) was applied to obtain discriminant features for action classifications. On the other
hand, Luo et al. [41] represented features by sparse coding-based temporal pyramid matching
approach (ScTPM). They also proposed a capturing technique for spatio-temporal features
from RGB videos called Center-Symmetric Motion Local Ternary Pattern (CS-Mltp). Finally,
they explored the feature-level fusion and classifier-level fusion applying the above mentioned
features to improve recognition accuracy. Again, decisive pose features were used by imposing
another two distinct transformations called Ridgelet and Gabor Wavelet Transform to detect
human action [42]. Moreover, Wang et al. [43] studied ten Kinect-based methods for cross-
view and cross-subject action recognition on six dissimilar datasets and finally concluded that
skeletal-based recognition is superior to other processes for cross-view.
Deep learning models usually learn features automatically from raw depth sequences, which
are then useful to compute high level semantic representations. For example, 2D-CNN and
3D-CNN were employed by Yang and Yang to address the deep learning based depth action
classification [44]. Wang et al. [45] used to improve the action representation, unlike DMM,
Wang et al. proposed Weighted Hierarchical Depth Motion Maps (WHDMM). The WHDMM
was fed into CNN along three CNN streams to recognize actions. In another concept, before
passing to CNN, the depth video was described by Dynamic Depth Images (DDI), Dynamic
Depth Normal Images (DDNI) and Dynamic Depth Motion Normal Images (DDMNI) [46]. In
[47], a novel notion in action classification is introduced by using the RGB domain features as
depth domain features by domain adaptation. Motion History Images (MHI) from RGB videos
and DMM of depth videos are utilized together to generate a four-stream CNN architectures
[48]. By using inertial sensor data and depth data, Ahmad et al. [11] expressed a multimodal
𝑀 2 fusion process with the help of CNN and multi-class SVM. Very recently, Dhiman et al.
[49] have merged shape and motion temporal dynamics by proposing a deep view-invariant
human action system. To detect the human gesture and 3D action, Weng et al. [50] proposed
pose traversal convolution Network which applied joint pattern features from the human
body. They also represented human gesture and action as a sequence of 3D poses. A self-
supervised alignment method was used for unsupervised domain adaptation (UDA) [51] to
recognize human action. Busto et al. [52] expressed another concept for action recognition and
image classification called open set domain adaptation which works for unsupervised and
semi-supervised domain adaptation model.
Proposed system
Our proposed framework consists of feature extraction, action representation and action
classification. In this section, we discuss the three parts respectively. Figure 2 shows the
pipeline of the system.
Fig. 2
Feature extraction
For each action video, three motion and three static information images are firstly computed
by applying the 3DMTM [18] on the video. The 3DMTM yields the set
𝑀𝐻 𝐼𝑋𝑂𝑌 , 𝑀𝐻 𝐼𝑌 𝑂𝑍 , 𝑀𝐻 𝐼𝑋𝑂𝑍 of motion images and the set
𝑆𝐻 𝐼𝑋𝑂𝑌 , 𝑆𝐻 𝐼𝑌 𝑂𝑍 , 𝑆𝐻 𝐼𝑋𝑂𝑍 of static images by simultaneously stacking all the moving and
stationary body parts (along the front, side and top projection views) of an actor in a depth
map sequence.
Now, the MHIs and SHIs are converted to the binary coded form by the LBP [19]. In fact, the
later versions are more enhanced than the earlier version of those images. Figure 3 shows an
MHI and the corresponding 𝐵𝐶 − 𝑀𝐻𝐼 is represented by Fig. 4. It is clear that the motion
information of the action is improved in the 𝐵𝐶 − 𝑀𝐻𝐼 .
Fig. 3
Fig. 4
⎧ ‾‾‾‾‾‾‾‾‾‾‾
∂𝐼 2 ‾
⎪ √( ∂𝑥 ∂𝑦 )
∂𝐼 2
+ , if I = I(x, y)
⎪
⎪ ‾‾‾‾‾‾‾‾‾‾‾ 2‾
𝑚 =⎨ ( ∂𝑦 + ∂𝑧 ),
⎪√
∂𝐼 2 ∂𝐼
if I = I(y, z) , (1)
⎪ ‾‾‾‾‾‾‾‾‾‾‾‾
⎪
⎩ √( ∂𝑥 ∂𝑧 )
∂𝐼 2 ∂𝐼 2
+ , if I = I(x, z)
⎧ arctan ( ∂𝐼 ,
∂𝑦 )
∂𝐼
⎪ , if I = I(x, y)
⎪
∂𝑥
𝜃 =⎨ arctan ( ∂𝐼 , ∂𝑧 )
∂𝐼
, if I = I(y, z) ,
⎪ ∂𝑦
(2)
⎪
⎩ arctan ( ∂𝐼 ∂𝑧 )
∂𝐼
∂𝑥
, , if I = I(x, z)
The above orientation 𝜃 can be coded into D orientation bins by voting weights to the nearest
bins to form a sparse gradient orientation vector 𝑔 ∈ ℝ𝐷 .
Through the gradient orientation vector 𝑔 and the gradient magnitude m, the Kth order auto-
correlation function of local gradients can be written as
𝐹 (𝑑0 , … , 𝑑𝐾 , 𝑏 1 , … , 𝑏 𝐾 )
∫
= 𝑤 [(𝑚 (𝑟)𝑟 , 𝑚 (𝑟𝑟 + 𝑏 1 ) , … , 𝑚 (𝑟𝑟 + 𝑏 𝐾 )] 𝑔𝑑𝑜 (𝑟)
𝑟 𝑔𝑑 1 (3)
where 𝑏 1 , 𝑏 2 , … , 𝑏 𝐾 are the shifting vectors from the position vector 𝑟 (indicates the position
of each pixel in image I). 𝑔𝑑 indicates the dth element of 𝑔 and 𝑤 (. ) is a weighting function of
m functions. Indeed, the function 𝑤 (. ) is used as the auto-correlation’s weights. All the
shifting vectors are restricted to local neighbours since the local neighbouring gradients might
be immensely correlated. However, two types of correlations among gradients are obtained
from Eq. (3): spatial gradient correlations gained with the vectors 𝑏 𝑖 and orientational
gradient correlations attained through the multiplications of the values 𝑔𝑑 𝑖 . By changing the
values of K, 𝑏 𝑖 , and the weight w; Eq. (3) may take various forms. The lower values of K assists
to capture lower order auto-correlation features, which are rich geometric characteristics
together with the shifting vectors 𝑏 𝑖 . Because of image isotropic characteristic, the shifting
intervals are kept identical along the horizontal and vertical directions. For 𝑤 (. ), the min is
accepted for suppressing the impact of isolated noise around auto-correlations.
𝐹 1 : 𝐹𝐾=1 (𝑑0 , 𝑑1 , 𝑏 1 )
= ∑ min [𝑚(𝑟), 𝑟 𝑚 (𝑟𝑟 + 𝑏 1 )] 𝑔𝑑0 (𝑟)
𝑟 𝑔𝑑1 (𝑟𝑟 + 𝑏 1 ) (5)
𝑟∈𝐼
A single mask pattern can be used for Eq. (4), and there are four independent mask patterns
for Eq. (5) for computing the auto-correlations. The mask/spatial auto-correlation patterns of
(𝑟,
𝑟 𝑟 + 𝑏 1 ) are depicted with Fig. 5). Since there is a single mask pattern for 𝐹 0 and four mask
patterns for 𝐹 1 then the dimensionality of the above GLAC features (𝐹 0 𝑎𝑛𝑑 𝐹 1 ) is 𝐷 + 4𝐷2 .
Although the dimensionality of the GLAC features is high, the computational cost is low due to
the sparseness of 𝑔 . It is worth noting that the computational cost is invariant to the number
of bins, D, since the sparseness of 𝑔 does not depend on D.
Figure 7 shows an example of 0th and 1st order GLAC features with 8 orientation bins ( bins
are shown in Fig. 6). Based on texture features, an action with motion images can be described
as a vector 𝐸𝐴𝑀𝐹 = [𝐸𝐴𝑀 𝐹𝑋𝑂𝑌 , 𝐸𝐴𝑀 𝐹𝑌 𝑂𝑍 , 𝐸𝐴𝑀 𝐹𝑋𝑂𝑍 ] , where
𝐸𝐴𝑀 𝐹𝑋𝑂𝑌 , 𝐸𝐴𝑀 𝐹𝑌 𝑂𝑍 and 𝐸𝐴𝑀𝐹𝑋𝑂𝑍 are vectors, which are obtained by conveying the set
of binary coded motion information images to the 2D GLAC. In order to represent the static
image action based on texture features, the vector
𝐸𝐴𝑆𝐹 = [𝐸𝐴𝑆 𝐹𝑋𝑂𝑌 , 𝐸𝐴𝑆 𝐹𝑌 𝑂𝑍 , 𝐸𝐴𝑆 𝐹𝑋𝑂𝑍 ] is obtained by linking the enhanced auto-
correlation feature vectors extracted on multi-view static images. The EAMF is complementary
to the EASF, therefore we fuse these two vectors in to a single vector to get optimal
representation of an action. In our work (for all experiments), the dimension of the single
feature vector is of 4752. It is flexible to compute the feature vector due to the sparse vector 𝑔 .
The work in [20] provides more detail on GLAC (Fig. 7).
Fig. 5
Fig. 6
Fig. 7
Example of 0th and 1th order GLAC features
Action classification
To gain the promising classification outcome, the fused version of EAMF and EASF is passed
to Kernel based Extreme Learning Machine (KELM). The classification algorithm is discussed
in detail in this section. The KELM [21] is an enhancement of Extreme Learning Machine
(ELM) classifier [53]. By associating a suitable kernel with ELM, the KELM is derived to
improve the discriminatory power of the classification algorithm. The Radial Basis Function
(RBF) kernel is employed in our work. For an intuitive illustration, the classifier is described
as a single algorithm in Algorithm 1.
Fig. 8
Table 1 reports a notable accuracy of 97.44% of our approach. The table indicates that the
proposed approach is able to achieve better classification accuracy over other existing methods
considerably. It can be seen that our method overwhelms deep learning system described in
[44] by 6.34% and by 11.34% (see Table 1). To clarify the effectiveness of the feature
enhancement, the system based on only the auto-correlation features is also evaluated on this
dataset. The enhanced auto-correlation feature based system improves the recognition
accuracy by 5.5% over the system without feature enhancement on the same setup and
parameters. Figure 9 shows the confusion matrix corresponding to the accurate and incorrect
classification rates. Through Table 2, the failure cases of the approach are listed. The table
shows that the “horizontal wave” is confused with “hammer” by 8.3%; “draw x” is confusion
with “high wave” by 7.14% and “draw circle” by 21.43%. Also action named “draw tick” is
confused with “draw x” by 16.67%. Overall, among 20 actions, 17 actions are classified
correctly (i.e., 100% classification accuracy) and rest 3 actions are classified incorrectly being
confused with other actions.
Fig. 9
Our approach gains a significant classification accuracy of %99.13 on this dataset. It can be
seen from Table 3, our approach outperforms [23] by 12.33%, [55] by 7.83%, [71] by 3.69%,
[58] by 2.44%, [39] by 6.73% and [39] by 4.13%. For this dataset, the enhanced auto-
correlation method overcomes the auto-correlation method by an accuracy of 2.17%. The
confusion matrix of the dataset is shown in Fig. 10. Furthermore a table is figured out to show
the class based confusion information. Table 4 clarifies that the “skip” and “side-clap” are
misclassified with low confusion rates and other 21 actions are classified with 100% accuracy.
The wrong classification occurs when “skip” is confused with “jump” by 9.09% and “side-clap”
is confused with “side-box” by 9.09%.
Fig. 10
Fig. 11
System competency
The computational time and the complexity of key factors are considered to examine the
system’s efficiency .
Computational time
The system is evaluated on a Desktop whose configuration includes an Intel i5-7500 Quad-
core processor of 3.41 GHz frequency and a 16 GB RAM. There are six major components in
the system-i.e., MHI/SHI construction, binary coded MHI generation, binary coded SHI
generation, EAMF generation, EASF generation, and KELM classification. The operation time
of those components are figured out to measure the time competency of the recognition
system. The computational time (in millisecond) per action sample (with 40 depth frames on
average) for all the components is represented in Table 7. Note that the system needs less than
one second (i.e., 731.43 ± 48.83 ms) to process 40 depth frames. Consequently, it can be
claimed that our recognition method can be utilized as real-time recognition system.
Table 7 Computational time (mean ± std) of the key factors of the system
Computational complexity
In fact, the PCA and the KELM are the key components in the computational complexity
calculation of the introduced system. The PCA has complexity of 𝑂 (𝑚3 + 𝑚2 𝑟) [14] and and
the KELM has complextity of 𝑂 (𝑟3 ) [73]. As a result, the complexity of the system can be
revealed as 𝑂 (𝑚3 + 𝑚2 𝑟) + 𝑂 (𝑟3 ). Table 8 represents the calculated complexity and
compares with complexities of other existing methods. It can be seen that our method has
lower computational complexity than other methods listed in the table. our method is also
superior over them from the recognition perspective. Thus, our approach is superior in terms
of recognition accuracy as well as computational efficiency.
Conclusion
This paper has introduced an efficacious and efficient human action framework based on
enhanced auto-correlation features. The system uses the 3DMTM to derive three motion
information images and three motionless information images from an action video. Those
motion and static information-oriented maps are improved by engaging the LBP algorithm on
them. The outputs of the LBP are fed into GLAC descriptor to get an action description vector.
With the obtained feature vectors from GLAC, the action classes are distinguished through the
KELM classification model. The approach is extensively assessed in terms of the
MSRAction3D, DHA, and UTD-MHAD datasets. Because of our action representation
strategy, the proposed algorithm exhibits considerable performance over the existing
handcrafted as well as deep learning methods. It is also obvious that the enhanced auto-
correlation features-based method overwhelms the simple auto-correlation features-based
method successfully. Thus, the improvement of the features is significant to enhance the
system. Furthermore, the computational efficiency of the method specifies its suitability for
real-time operation.
It is worth mentioning, some misclassifications are observed in our method. Note that the
proposed method did not remove noise to improve the performance. The system only
employed the LBP as preprocessing method for edge enhancing. Besides the LBP, a noise
removing algorithm can be utilized to address the miss-classification issues of the proposed
approach and thus for further improvement of the overall recognition accuracy. The descriptor
can be more improved to increase the discriminatory power of the approach.
In our future work, we aim to build a deep model using the obtained 2D motion and static
images. Besides, the current approach is not evaluated on large and complex RGB datasets like
UCF101 and HMBD51. With a proper modification, the approach would be tested on these
datasets in the future. Furthermore, we have a plan to build a new recognition framework
using the GLAC descriptor on RGB and depth datasets jointly..
References
2. Kim K, Jalal A, Mahmood M (2019) Vision-based human activity recognition system using
depth silhouettes: a smart home system for monitoring the residents. J Electr Eng Technol
14(6):2567–2573
3. Zhuang Z, Xue Y (2019) Sport-related human activity detection and recognition using a
smartwatch. Sensors 19(22):5001
6. Gul MA, Yousaf MH, Nawaz S, Ur Rehman Z, Kim H (2020) Patient monitoring by
abnormal human activity recognition based on CNN architecture. Electronics 9(12):1993
7. Sebestyen G, Stoica I, Hangan A (2016) Human activity recognition and monitoring for
elderly people. In: 2016 IEEE 12th international conference on intelligent computer
communication and processing (ICCP). IEEE, pp 341–347
8. Sagayam KM, Hemanth DJ (2017) Hand posture and gesture recognition techniques for
virtual reality applications: a survey. Virtual Real 21(2):91–107
11. Ahmad Z, Khan N (2019) Human action recognition using deep multilevel multimodal
𝑀 2 fusion of depth and inertial sensors. IEEE Sens J 20(3):1445–1455
Article Google Scholar
12. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings
of the IEEE international conference on computer vision, pp 3551–3558
13. Chen C, Zhang B, Hou Z, Jiang J, Liu M, Yang Y (2017) Action recognition from depth
sequences using weighted fusion of 2D and 3D auto-correlation of gradients features.
Multimed Tools Appl 76(3):4651–4669
14. Chen C, Liu K, Kehtarnavaz N (2016) Real-time human action recognition based on depth
motion maps. J Real-Time Image Process 12(1):155–163
15. Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4D normals for activity recognition
from depth sequences. In: Proceedings of the IEEE conference on computer vision and
pattern recognition, pp 716–723
16. Yang X, Tian Y (2014) Super normal vector for activity recognition using depth
sequences. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 804–811
18. Liang B, Zheng L (2013) Three dimensional motion trail model for gesture recognition.
In: Proceedings of the IEEE international conference on computer vision workshops, pp
684–691
20. Kobayashi T, Otsu N (2008) Image feature extraction using gradient local auto-
correlations. In: European conference on computer vision. Springer, pp 346–358
21. Huang G-B, Zhou H, Ding X, Zhang R (2012) Extreme learning machine for regression
and multiclass classification. IEEE Trans Syst Man Cybern Part B (Cybern) 42(2):513–
529
22. Li W, Zhang Z, Liu Z (2010) Action recognition based on a bag of 3D points. In: 2010
IEEE computer society conference on computer vision and pattern recognition-
workshops. IEEE, pp 9–14
23. Lin Y-C, Hu M-C, Cheng W-H, Hsieh Y-H, Chen H-M (2012) Human action recognition
and retrieval using sole depth information. In: Proceedings of the 20th ACM
international conference on multimedia. ACM, pp 1053–1056
24. Chen C, Jafari R, Kehtarnavaz N (2015) Utd-mhad: a multimodal dataset for human
action recognition utilizing a depth camera and a wearable inertial sensor. In: IEEE
international conference on image processing (ICIP). IEEE, pp 168–172
25. Ahmed A, Jalal A, Kim K (2020) RGB-D images for object segmentation, localization and
recognition in indoor scenes using feature descriptor and Hough voting. In: 2020 17th
international Bhurban conference on applied sciences and technology (IBCAST). IEEE,
pp 290–295
26. Jalal A, Kamal S, Kim D (2015) Depth silhouettes context: a new robust feature for
human tracking and activity recognition based on embedded HMMs. In: 2015 12th
international conference on ubiquitous robots and ambient intelligence (URAI). IEEE, pp
294–299
27. Jalal A, Kim Y-H, Kim Y-J, Kamal S, Kim D (2017) Robust human activity recognition
from depth video using spatiotemporal multi-fused features. Pattern Recognit 61:295–
308
28. ud din Tahir SB, Jalal A, Batool M (2020) Wearable sensors for activity analysis using
SMO-based random forest over smart home and sports datasets. In: 2020 3rd
international conference on advancements in computational sciences (ICACS). IEEE, pp
1–6
29. Kamal S, Jalal A, Kim D (2016) Depth images-based human detection, tracking and
activity recognition using spatiotemporal features and modified HMM. J Electr Eng
Technol 11(6):1857–1862
30. Rizwan SA, Jalal A, Kim K (2020) An accurate facial expression detector using multi-
landmarks selection and local transform features. In: 2020 3rd international conference
on advancements in computational sciences (ICACS). IEEE, pp 1–6
31. Farooq A, Jalal A, Kamal S (2015) Dense RGB-D map-based human tracking and activity
recognition using skin joints features and self-organizing map. KSII Trans Internet Inf
Syst 9(5):1856–1869
Google Scholar
32. Kamal S, Jalal A (2016) A hybrid feature extraction approach for human detection,
tracking and activity recognition using depth sensors. Arab J Sci Eng 41(3):1043–1051
33. Yaacob NI, Tahir NM (2012) Feature selection for gait recognition. In: 2012 IEEE
symposium on humanities, science and engineering research. IEEE, pp 379–383
34. Bulbul MF, Jiang Y, Ma J (2015) Human action recognition based on DMMs, hogs and
contourlet transform. In: 2015 IEEE international conference on multimedia big data.
IEEE, pp 389–394
35. Bulbul MF, Jiang Y, Ma J (2015) Real-time human action recognition using DMMs-based
LBP and EOH features. In: International conference on intelligent computing. Springer,
pp 271–282
36. Bulbul MF, Islam S, Zhou Y, Ali H (2019) Improving human action recognition using
hierarchical features and multiple classifier ensembles. Comput J bxz123.
https://doi.org/10.1093/comjnl/bxz123
37. Dhiman C, Vishwakarma DK (2019) A robust framework for abnormal human action
recognition using -transform and Zernike moments in depth videos. IEEE Sens J
19(13):5195–5203
40. Rahmani H, Mahmood A, Huynh DQ, Mian A (2014) Real time action recognition using
histograms of depth gradients and random decision forests. In: IEEE winter conference
on applications of computer vision. IEEE, pp 626–633
41. Luo J, Wang W, Qi H (2014) Spatio-temporal feature extraction and representation for
RGB-D human action recognition. Pattern Recognit Lett 50:139–148
42. Vishwakarma DK (2020) A two-fold transformation model for human action recognition
using decisive pose. Cogn Syst Res 61:1–13
43. Wang L, Huynh DQ, Koniusz P (2019) A comparative review of recent kinect-based
action recognition algorithms. IEEE Trans Image Process 29:15–28
44. Yang R, Yang R (2014) DMM-pyramid based deep architectures for action recognition
with depth cameras. In: Asian conference on computer vision. Springer, pp 37–49
45. Wang P, Li W, Gao Z, Zhang J, Tang C, Ogunbona PO (2015) Action recognition from
depth maps using deep convolutional neural networks. IEEE Trans Hum Mach Syst
46(4):498–509
46. Wang P, Li W, Gao Z, Tang C, Ogunbona PO (2018) Depth pooling based large-scale 3-D
action recognition with convolutional neural networks. IEEE Trans Multimed
20(5):1051–1061
47. Chen J, Xiao Y, Cao Z, Fang Z (2018) Action recognition in depth video from RGB
perspective: a knowledge transfer manner. In: MIPPR 2017: pattern recognition and
computer vision, vol 10609. International Society for Optics and Photonics, p 1060916
48. Imran J, Kumar P (2016) Human action recognition using RGB-D sensor and deep
convolutional neural networks. In: 2016 international conference on advances in
computing, communications and informatics (ICACCI). IEEE, pp 144–148
49. Dhiman C, Vishwakarma DK (2020) View-invariant deep architecture for human action
recognition using two-stream motion and shape temporal dynamics. IEEE Trans Image
Process 29:3835–3844
50. Weng J, Liu M, Jiang X, Yuan J (2018) Deformable pose traversal convolution for 3D
action and gesture recognition. In: Proceedings of the European conference on computer
vision (ECCV), pp 136–152
51. Munro J, Damen D (2020) Multi-modal domain adaptation for fine-grained action
recognition. In: Proceedings of the IEEE conference on computer vision and pattern
recognition, pp 122–132
52. Busto PP, Iqbal A, Gall J (2018) Open set domain adaptation for image and action
recognition. IEEE Trans Pattern Anal Mach Intell 42(2):413–429
53. Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and
applications. Neurocomputing 70(1–3):489–501
54. Xia L, Aggarwal J (2013) Spatio-temporal depth cuboid similarity feature for activity
recognition using depth camera. In: Proceedings of the IEEE conference on computer
vision and pattern recognition, pp 2834–2841
55. Chen C, Jafari R, Kehtarnavaz N (2015) Action recognition from depth sequences using
depth motion maps-based local binary patterns. In: 2015 IEEE winter conference on
applications of computer vision (WACV). IEEE, pp 1092–1099
56. Rahmani H, Huynh DQ, Mahmood A, Mian A (2016) Discriminative human action
classification using locality-constrained linear coding. Pattern Recognit Lett 72:62–71
58. Zhang B, Yang Y, Chen C, Yang L, Han J, Shao L (2017) Action recognition using 3D
histograms of texture and a multi-class boosting classifier. IEEE Trans Image Process
26(10):4648–4660
59. Liang C, Chen E, Qi L, Guan L (2016) 3D action recognition using depth-based feature
and locality-constrained affine subspace coding. In: 2016 IEEE international symposium
on multimedia (ISM). IEEE, pp 261–266
60. Liu J, Shahroudy A, Xu D, Wang G (2016) Spatio-temporal LSTM with trust gates for 3D
human action recognition. In: European conference on computer vision. Springer, pp
816–833
61. Yang X, Tian Y (2017) Super normal vector for human activity recognition with depth
cameras. IEEE Trans Pattern Anal Mach Intell 39(5):1028–1039
62. Liu J, Shahroudy A, Xu D, Kot AC, Wang G (2018) Skeleton-based action recognition
using spatio-temporal LSTM network with trust gates. IEEE Trans Pattern Anal Mach
Intell 40(12):3007–3021
65. Keçeli AS, Kaya A, Can AB (2018) Combining 2D and 3D deep models for action
recognition with depth information. Signal Image Video Process 12(6):1197–1205
67. Zhang C, Tian Y, Guo X, Liu J (2018) Daal: deep activation-based attribute learning for
action recognition in depth videos. Comput Vis Image Underst 167:37–49
68. Nguyen XS, Mouaddib A-I, Nguyen TP, Jeanpierre L (2018) Action recognition in depth
videos using hierarchical Gaussian descriptor. Multimed Tools Appl 77(16):21617–21652
69. Bulbul MF, Islam S, Ali H (2019) Human action recognition using MHI and SHI based
GLAC features and collaborative representation classifier. Multimed Tools Appl
78(15):21085–21111
70. Jalal A, Kamal S, Kim D (2016) Human depth sensors-based activity recognition using
spatiotemporal features and hidden Markov model for smart environments. J Comput
Netw Commun 2016(17):1–11
Google Scholar
71. Chen C, Liu M, Zhang B, Han J, Jiang J, Liu H (2016) 3D action recognition using multi-
temporal depth motion maps and Fisher vector. In: IJCAI, pp 3331–3337
72. Wang P, Li W, Li C, Hou Y (2018) Action recognition based on joint trajectory maps with
convolutional neural networks. Knowl-Based Syst 158:43–53
73. Iosifidis A, Tefas A, Pitas I (2015) On the kernel extreme learning machine classifier.
Pattern Recognit Lett 54:11–17
Download references
Acknowledgements
This work is partially supported by University Grants Commission (UGC), Bangladesh.
Author information
Affiliations
Department of Mathematics, Jashore University of Science and Technology, Jashore,
7408, Bangladesh
Mohammad Farhad Bulbul
Corresponding author
Correspondence to Mohammad Farhad Bulbul.
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Download citation
DOI
https://doi.org/10.1007/s42452-021-04528-1
Keywords
3D action classification Depth action sequences
Over 10 million scientific documents at your fingertips Academic Edition Corporate Edition
Home Impressum Legal information Privacy statement California Privacy Statement How we use cookies Manage cookies/Do not sell my data Accessibility Contact us