I3D-Shufflenet Based Human Action Recognition
I3D-Shufflenet Based Human Action Recognition
Article
I3D-Shufflenet Based Human Action Recognition
Guocheng Liu 1 , Caixia Zhang 2 , Qingyang Xu 1, *, Ruoshi Cheng 1 , Yong Song 1 , Xianfeng Yuan 1
and Jie Sun 1
1 School of Mechanical, Electrical & Information Engineering, Shandong University, Weihai 264209, China;
[email protected] (G.L.); [email protected] (R.C.); [email protected] (Y.S.);
[email protected] (X.Y.); [email protected] (J.S.)
2 Mechanical & Electrical Engineering Department, Weihai Vocational College, Weihai 264210, China;
[email protected]
* Correspondence: [email protected]
Received: 7 October 2020; Accepted: 13 November 2020; Published: 18 November 2020
Abstract: In view of difficulty in application of optical flow based human action recognition due to
large amount of calculation, a human action recognition algorithm I3D-shufflenet model is proposed
combining the advantages of I3D neural network and lightweight model shufflenet. The 5 × 5
convolution kernel of I3D is replaced by a double 3 × 3 convolution kernels, which reduces the
amount of calculations. The shuffle layer is adopted to achieve feature exchange. The recognition and
classification of human action is performed based on trained I3D-shufflenet model. The experimental
results show that the shuffle layer improves the composition of features in each channel which
can promote the utilization of useful information. The Histogram of Oriented Gradients (HOG)
spatial-temporal features of the object are extracted for training, which can significantly improve the
ability of human action expression and reduce the calculation of feature extraction. The I3D-shufflenet
is testified on the UCF101 dataset, and compared with other models. The final result shows that the
I3D-shufflenet has higher accuracy than the original I3D with an accuracy of 96.4%.
1. Introduction
With the development of artificial intelligence, the progress of computer vision has received
special attention. At present, the world’s top scientific research teams and major scientific research
institutions are achieving rapid progress in the field of human action recognition. In the 1970s, a human
body description model was proposed by Professor Johansson [1], which had a great impact on human
body recognition. Video-based human action recognition methods traditionally make use of artificial
means to extract motion features. The traditional algorithms for human action recognition consist of
Histogram of Oriented Gradients (HOG) [2], Histogram of Optical Flow (HOF) [3], Dense Trajectory
(DT) [4] etc. The DT algorithm performs multi-scale division of each frame of the video. After division,
the features of each region are obtained by dense sampling based on the grid division method. The time
domain feature is extracted to generate the feature of trajectory and then the next trajectory position
is predicted when the features of the entire picture are obtained. In recent years, deep learning has
developed rapidly. It was widely used in the field of image recognition. With the development of
deep learning, it is widely used in the field of human action recognition, which greatly improves the
accuracy of human action recognition. Deep learning is a data processing and feature learning method.
Through lower-level features extraction of the image, the high-level image features or image attributes
can be shaped according to lower-level features, then the human action and movement features can
be extracted. At present, the performance of deep learning algorithms for big data is significantly
superior to traditional algorithms, and has achieved good performance in computer vision and speech
recognition. Convolutional Neural Network (CNN) was proposed in 2012 because it is appropriate for
image processing. However, the temporal features extraction is limited due to the loss of time domain
information during the feature extraction for video. The time domain features cannot be effectively
extracted by 2D-CNN. Under the background of the great success of convolutional networks in the
field of computer vision, graph neural networks (GCNs) based human action recognition become a
recent event [5]. Bruna et al. (2013) proposed the important research on graph convolutional networks,
and the space-based graph convolutional networks develops rapidly. These methods directly perform
convolution on the graph structure through gathering information about neighboring nodes. In recent
years, 3D neural networks have been fully developed. The concept of 3D neural networks was
first proposed by Gori et al. at 2005 [6] and further clarified by Scarselli et al. at 2009 [7]. The early
studies propagating the adjacent information iteratively through the recursive neural network until
reaching a stable point, then the representation of the target node is obtained. This process requires
massive calculation. Many recent studies are devoted to solving this problem. Ji et al. [8] proposed a
three-dimensional convolutional neural network for spatial-temporal features extraction. The 2D-CNN
can be extended to 3D-CNN. 3D convolutional neural network (Convolution Neural Network, CNN)
has a good effect in comprehensively image features extraction. The original two-dimensional network
still extracts the spatial features of static images, and the time domain features is extracted by the third
dimension, it can also convolve across multiple frames of images and extract the information before
and after the time series. However, with the augment of parameter, the training of the model becomes
difficult. A residual network structure was proposed by He et al. [9], which can not only handle the
above-mentioned problems caused by increasing the network depth, but also optimize and improve the
network performance. Two-stream [10] is dedicated to RGB images and optical flow graphs, then the
results of the two are subjected to a fusion. Motion stream (ResNet-50) [11] extracts the spatial-temporal
information by introducing the connection between spatial-temporal flow in a two-stream network
with fusion blocks. S3D-G (separable 3D CNN) [12] builds an effective video classification system,
seeks to strike a balance between speed and accuracy, and replaces many 3D convolutions with low-cost
2D convolutions. MFNet (Multi-Fiber Networks) [13] proposes a Multi-Fiber architecture, which cuts a
complex neural network into a lightweight network or a collection of optical fibers running through
the network. ARTNet(Appearance-and-Relation Networks) [14] is constructed by stacking multiple
general building blocks called SMART to simultaneously model the appearance and relationship
of RGB input in a separate and explicit manner. FASTER [15] is used for feature aggregation of
spatial-temporal redundancy. Its framework can integrate high-quality representations in expensive
models to capture subtle motion information, and the lightweight representations in cheap models to
cover scene changes in videos. The existing C3D (3DConvolutional Networks) neural network has
a wide application. The I3D (Inflated 3D ConvNet) neural network is improved based on the C3D
neural network [16], and the recognition speed and accuracy performance of the network is greatly
improved. However, both the C3D and I3D neural network have a traditional way for convolution,
some channels may be less useful information and consume the computational power.
The article is inspired by the ideas of GoogleNet and Shufflenet. The original convolution kernel in
I3D is replaced by a two-layer convolution kernel, which has the same effect but with lower calculation
amount. The channel fusion method is incorporated into the I3D network. The feature extracted by
the proposed model will be exchanged with different channel features through shuffle operation [17],
and more useful information will be used to improve the performance of human action recognition.
This article mainly includes four parts. The 3D convolutional network is introduced in Section 2,
the proposed model is introduced in Section 3, and the Section 4 is the experiments. Finally, it is
the conclusion.
Algorithms 2020, 13, 301 3 of 14
2. 3D Convolutional Network
3D Convolutional Neural Network is a popular convolutional neural network applied in the field of
human action recognition. 3D neural network can not only convolve two-dimensional images, but also
the time sequence. The 3D convolutional neural network has one more dimension than the 2D neural
network which can better extract the visual human action characteristics by the three-dimensional
convolution kernels.
The convolution process of 2D convolutional neural network can be expressed as:
X PX
i −1 Q
X i −1
xy pq (x+p)( y+q)
vij = ReLU (bij + Wijm v(i−1)m ) (1)
m p=0 q=0
xy
where vij is the i convolution result at the j position in the feature map (x,y) of the layer; ReLU() is
the activation function; bij is the deviation of the feature map; m is the index of the feature map in
pq
the layer i−1; Wijm is the value at the position of the feature map; Pi ,Qi is the width and height of the
convolution kernel.
Traditional 2D convolution is suitable for spatial feature extraction, and has difficulty with
continuous frame processing of video data. Compared with 2D convolution, 3D convolution adds
convolution operation for adjacent time dimension information, which can deal with the action
information of continuous video frame, the 3D convolution formula is expressed as follows:
X PX
i −1 Q
X i −1 T
Xi −1
xyz pqt (x+p)( y+q)(z+t)
vij = ReLU (bij + Wijm v(i−1)m ) (2)
m p=0 q=0 t=0
xyz
where vij represents the convolution result at the position i of the j feature map of the (x,y,z) layer;
pqt
ReLU() is the activation function; bij is the deviation of the Wijm feature map; m is the index of the feature
map in the (i−1) layer; is the value at the position (p,q,t) of the feature map, t is the time dimension that
is unique to 3D convolution; Pi , Qi , Ti are the width, height and depth of the convolution kernel.
Traditional deep learning network generally uses a single-size convolution kernel, the input data
are processed by the convolution kernel and then a feature set is generated. In the Inception module,
convolution kernels with different sizes are adopted to calculate and splice separately. The final feature
set no longer has the same uniform distribution, but some correlative features are gathered together and
generate multiple densely distributed feature subset. Therefore, for the input data, the corresponding
features of the distinguishing region are clustered together after different convolution processing,
while the irrelevant information is weakened, resulting in better feature set. The inception structure of
I3D in this article is shown in Figure 1 which contains two conv-pool layers and three inception layers.
The I3D network inherits the Inception module of Googlenet, and with different size convolution
kernels for feature extraction. According to the idea of Googlenet, one convolution is performed on
the previous convolution layer output, and an activation function is added after each convolution
layer. By concatenating the two three-dimensional convolutions, more nonlinear features are combined.
A branch consists of one or more convolution and pooling. In each Inception module, there are
four different branches for the input data. The convolution kernels with different sizes are adopted
respectively, and are spliced together finally. The I3D neural network adds a convolution operation
for adjacent temporal information, which can complete the action recognition of continuous frame.
In order to expedite the deep network training speed, a batch regularization module is added to the
network. The network is not sensitive to the initialization, so a larger learning rate can be employed.
I3D increases the depth of the network, eight convolutional layers and four pooling layers are used.
The size of the convolution kernel of each convolutional layer is 3 × 3 × 3, and the step size is 1 × 1 × 1
respectively, the number of filters is 64, 128, 256, 256, 512, 512. Each convolutional layer is followed
by a batch regularization layer, a ReLU layer and a pool layer except for the layers of the conv3a,
Algorithms 2020, 13, 301 4 of 14
conv4a, and conv5a. The kernel size of the first pooling layer is 1 × 2 × 2 and the step size is 1 × 2 × 2.
The kernel size and step size of the remaining pooling layers are 2 × 2 × 2, the spatial pooling only
works in the first convolutional layer, and spatial-temporal pooling works in the second, the fourth and
the sixth convolutional layers. Owe to the pooling layers, the output size of the convolution layers is
reduced by 1/4 and 1/2 respectively in space and time domain. Therefore, I3D is suitable for short-term
Algorithms 2020, 13, x features
spatial-temporal FOR PEERlearning.
REVIEW 4 of 14
(a) (b)
Figure 3. Module
Figure 3. Modulebefore
before
andand
afterafter the replacement.
the replacement. (a) Inception
(a) Inception module ofdiagram
module diagram of I3D; (b)
I3D; (b) Convolution
Convolution kernel replacement.
kernel replacement.
3.2. I3D-Shufflenet
3.2. I3D-Shufflenet Network
Network Framework
Framework
Figure4.4.Channel
Figure Channelshuffle.
shuffle.
Through
Throughthe theshuffling
shufflingoperation
operationofofthe the3D3Dconvolution
convolutionlayer layerand
andcombining
combiningthe theInception-V1
Inception-V1
module, channel fusion network is merged behind the 6th inception of the 3 ×
module, the channel fusion network is merged behind the 6th inception of the 3 × 3 × 3 3Dconvolution
the 3 × 3 3D convolution
start
startmodule
module in the I3D
in the I3Dnetwork.
network.ForFor thethe
I3DI3D model,
model, five five consecutive
consecutive RGB RGB
frames frames are divided
are divided into 10
into 10 frames and corresponding optical flow segments. The input of the
frames and corresponding optical flow segments. The input of the network has 10 frames apart, network has 10 frames
apart, consecutive
consecutive RGB RGBframes frames
and and corresponding
corresponding optical
optical flowflow segments.
segments. Throughthe
Through the33 ×× 33 ×× 33 3D
3D
convolutional
convolutionallayer with
layer 512512
with output channels,
output channels, ×3×
the 3 the 3 3× 3D
3 ×maximum merge layer
3 3D maximum andlayer
merge the complete
and the
connection layer, the spatial
complete connection layer,andthekinematic
spatial and characteristics
kinematic (5 × 7 × 7 feature
characteristics (5 grid,
× 7 corresponding
× 7 feature grid, to
time, X and Y dimensions) before the last average merge layer of Inception can be
corresponding to time, X and Y dimensions) before the last average merge layer of Inception can be obtained. Therefore,
the I3D shuffle
obtained. network
Therefore, thecan
I3Dbetter extract
shuffle image
network canfeatures.
better extract image features.
Figure 5. I3D-shufflenet.
4. Experiment
4.1. Data Set for Behavior Recognition
This experiment mainly used the UCF101 data set [18] which is currently the most mainstream
4.1. Data Set for Behavior Recognition
data set for human action recognition. The resolution of the UCF101 data set was 320 × 240, and there
wereThis experiment
101 types mainly
of actions. used
Each the
type ofUCF101 datacomposed
action was set [18] which is currently
of about six videosthetaken
mostby
mainstream
25 people.
data101
The set human
for human action in
behaviors recognition.
the UCF101 The resolution
data set (totalof27the
h) UCF101 data set
were divided intowas 320images
13320 × 240, and
withthere
two
were 101 types of actions. Each type of action was composed of about six videos taken by
groups (the training samples and the testing samples) by a ratio of 3:1. Part of the action samples of the25 people.
The 101 are
UCF101 human
shown behaviors
in Figurein6.the UCF101 data set (total 27 h) were divided into 13320 images with
Algorithms 2020, 13, x FOR PEER REVIEW 8 of 14
two groups (the training samples and the testing samples) by a ratio of 3:1. Part of the action samples
of the UCF101 are shown in Figure 6.
(e) Horse Riding (f) Ice Dancing (g) Basketball (h) Throw
Figure 6. Action
Action categories
categories of UCF101.
Figure
Figure 7.
7. Channel
Channel Fusion.
Fusion.
(4)
Among them, p is the correct answer, and q is the predicted value. The smaller the cross entropy
is, the closer the probability distribution is. On this basis, the Softmax function is adopted to calculate
the probability of each class, the formula is as follows:
(5)
Algorithms 2020, 13, 301 Figure 7. Channel Fusion. 9 of 14
ecx (5)
Sx = P c y , ∀x ∈ {1, 2, . . . , N} (5)
ye
Figure 8.
Figure 8. Diagram of loss.
where epoch_num is the current number of iterations; α0 is the initial learning rate.
The accuracy of the first 50 iterations of the original I3D and the I3D-shufflenet are shown in
Figure 9. It can be seen that the I3D-shufflenet had higher accuracy than the I3D after the 15th iteration.
(6)
where epoch_num is the current number of iterations; α0 is the initial learning rate.
where epoch_num is the current number of iterations; α0 is the initial learning rate.
The accuracy of the first 50 iterations of the original I3D and the I3D-shufflenet are shown in
The accuracy of the first 50 iterations of the original I3D and the I3D-shufflenet are shown in
Figure 9. It can be seen that the I3D-shufflenet had higher accuracy than the I3D after the 15th
Figure 9. It can be seen that the I3D-shufflenet had higher accuracy than the I3D after the 15th
iteration.2020, 13, 301
Algorithms 10 of 14
iteration.
Figure 9. Diagram
Figure Diagram of
of accuracy.
accuracy.
Figure 9.
9. Diagram of accuracy.
Figure 10
Figure 10 presents
presentsthe
theconfusion
confusionmatrix
matrixfor
forI3D
I3Dand
andthe
theI3D-shufflenet.
I3D-shufflenet.
Figure 10 presents the confusion matrix for I3D and the I3D-shufflenet.
Figure 12.
Figure 12. Feature
Feature map
map of
of models.
models.
Figure 13 shows the CAM (Gradient-weighted Class Activation Mapping) [20] result obtained
from boxing and Tai Chi videos. The figures show the important features for action recognition. The
Algorithms 2020, 13, 301 12 of 14
distinguishing area was the action part, which helped the I3D network to determine. Different cases
can be obtained in [13].
Figure
Figure 13.
13. Typical
Typical CAM
CAM output
output of
of boxing
boxing and
and Tai
Tai Chi.
4.8. Comparisons
4.8. Comparisons
Compared with
Compared with the
the I3D,
I3D, the
the training
training time
time of
of I3D-shufflenet
I3D-shufflenet on
on the
the UCF101
UCF101 dataset
dataset was
was reduced
reduced
by 15.3%
by 15.3% under
under the
the same
same settings.
settings. The
The running
running time
time of
ofthe
theof
oftypical
typicalhuman
humanaction
actionrecognition
recognition models
models
are shown in Table 2. The current accuracy of the neural networks on the UCF101 dataset
are shown in Table 2. The current accuracy of the neural networks on the UCF101 dataset are are shown in
Table 3.
shown in Table 3.
Table 2. Training time comparison of I3D-shufflenet and other networks (h).
Table 2. Training time comparison of I3D-shufflenet and other networks (h).
Model UCF101
Model UCF101
C3D
C3D 16.5
16.5
P3D
P3D 29.2
29.2
R3D
R3D 30.7
30.7
I3D
I3D 26.1
26.1
I3D-shufflenet
I3D-shufflenet 22.3
22.3
Table 3. Cont.
Algorithm Accuracy
Two-stream(SI+OF) [23] 93.9
C3D [16] 82.3
IDT [24] 85.9
TSN [25] 94.9
R(2+1)D BERT [26] 98.7
I3D 95.6
I3D-shufflenet 96.4
5. Conclusions
This article mainly studies the improvement and enhancement of channel fusion for the original
I3D neural network. The channel shuffle module is added to the inception module of the I3D network.
The original channel number is divided into three channels, and the shuffle operation is used to
improve the I3D network’s recognition accuracy by splitting and reorganizing the channels to better
extract image information. In addition, this paper improves the convolution kernel of the Inception
module in the I3D neural network. The training speed of the I3D is improved without performance
declining, and the performance of the proposed model is better than I3D. Further improving the model
and making use of information fusion will be our future work.
Author Contributions: Methodology, G.L.; formal analysis, C.Z., X.Y. and J.S.; supervision, Q.X. and R.C.;
project administration, Y.S. All authors have read and agreed to the published version of the manuscript.
Funding: This research was funded by the National Science and Technology Major Project (NO. 2018ZX01031201,
2018ZX09201011).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Johansson, G. Visual motion perception. Sci. Am. 1975, 232, 76–89. [CrossRef] [PubMed]
2. Žemgulys, J.; Raudonis, V.; Maskeliūnas, R.; Damaševičius, R. Recognition of basketball referee signals
from videos using Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM).
Procedia Comput. Sci. 2018, 130, 953–960. [CrossRef]
3. Li, T.; Chang, H.; Wang, M.; Ni, B.; Hong, R.; Yan, S. Crowded Scene Analysis: A Survey. IEEE Trans. Circ.
Syst. Vid. 2015, 25, 367–386. [CrossRef]
4. Wang, H.; Klaser, A.; Schmid, C.; Liu, C. Action Recognition by Dense Trajectories. In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 20–25 June 2011;
pp. 3169–3176.
5. Kipf, T.N.; Welling, M. Semi-Supervised Classification with Graph Convolutional Networks. arXiv 2016,
arXiv:1609.02907.
6. Gori, M.; Monfardini, G.; Scarselli, F. A New Model for Learning in Graph Domains. In Proceedings of the
2005 IEEE International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005;
pp. 729–734.
7. Scarselli, F.; Gori, M.; Tsoi, A.C.; Hagenbuchner, M.; Monfardini, G. The Graph Neural Network Model.
IEEE Trans. Neural Netw. 2009, 20, 61–80. [CrossRef] [PubMed]
8. Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans.
Pattern Anal. 2013, 35, 221–231. [CrossRef] [PubMed]
9. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778.
Algorithms 2020, 13, 301 14 of 14
10. Simonyan, K.; Zisserman, A. Two-Stream Convolutional Networks for Action Recognition in Videos.
In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal,
QC, Canada, 8–13 December 2014; pp. 568–576.
11. Feichtenhofer, C.; Pinz, A.; Wildes, R.P.B.I. Spatiotemporal Multiplier Networks for Video Action Recognition.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA,
21–26 July 2017; pp. 7445–7454.
12. Xie, S.; Sun, C.; Huang, J.; Tu, Z.; Murphy, K. Rethinking spatiotemporal feature learning: Speed-accuracy
trade-offs in video classification. In Proceedings of the 15th European Conference. Proceedings: Lecture
Notes in Computer Science (LNCS 11219), Tokyo, Japan, 29 October–2 November 2018; pp. 318–335.
13. Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J. Multi-Fiber Networks for Video Recognition; Lecture Notes in
Computer Science; Springer: Cham, Switzerland, 2018; pp. 364–380.
14. Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A. Action Recognition with Dynamic Image Networks. IEEE T
Pattern Anal. 2018, 40, 2799–2813. [CrossRef] [PubMed]
15. Zhu, L.; Tran, D.; Sevilla-Lara, L.; Yang, Y.; Feiszli, M.; Wang, H. FASTER Recurrent Networks for Efficient
Video Classification. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence
(AAAI-20), New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13098–13105.
16. Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D
Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago,
Chile, 7–13 December 2015; pp. 4489–4497.
17. Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network
for Mobile Devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856.
18. Soomro, K.; Zamir, A.R.; Shah, M. UCF101: A Dataset of 101 Human Actions Classes from Videos in the
Wild. arXiv 2012, arXiv:1212.0402.
19. Narayanan, B.N.; Beigh, K.; Loughnane, G.; Powar, N. Support Vector Machine and Convolutional Neural
Network Based Approaches for Defect Detection in Fused Filament Fabrication. Int. Soc. Opt. Photonic 2019,
11139, 1113913.
20. Narayanan, B.N.; Ali, R.; Hardie, R.C. Performance Analysis of Machine Learning and Deep Learning
Architectures for Malaria Detection on Cell Images. Int. Soc. Opt. Photonic 2019, 11139, 111390W.
21. Narayanan, B.N.; De Silva, M.S.; Hardie, R.C.; Kueterman, N.K.; Ali, R. Understanding Deep Neural Network
Predictions for Medical Imaging Applications. arXiv 2019, arXiv:1912.09621.
22. Feichtenhofer, C.; Pinz, A.; Zisserman, A. Convolutional Two-Stream Network Fusion for Video Action
Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas,
NV, USA, 27–30 June 2016; pp. 1933–1941.
23. Wang, L.; Li, W.; Van Gool, L. Appearance-and-Relation Networks for Video Classification. In Proceedings
of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT,
USA, 18–22 June 2018; pp. 1430–1439.
24. Bonanomi, C.; Balletti, S.; Lecca, M.; Anisetti, M.; Rizzi, A.; Damiani, E. I3D: A new dataset for testing
denoising and demosaicing algorithms. Multimed. Tools Appl. 2018, 79, 8599–8626. [CrossRef]
25. Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; van Gool, L. Temporal segment networks: Towards
good practices for deep action recognition. In Proceedings of the European Conference on Computer Vision,
Amsterdam, The Netherlands, 8–16 October 2016; pp. 20–36.
26. Kalfaoglu, M.E.; Alkan, S.; Alatan, A.A. Late Temporal Modeling in 3D CNN Architectures with BERT for
Action Recognition. arXiv 2020, arXiv:2008.01232v3.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional
affiliations.
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access
article distributed under the terms and conditions of the Creative Commons Attribution
(CC BY) license (http://creativecommons.org/licenses/by/4.0/).