28 - Action Recognition in Australian Rules Football Through Deep Learning
28 - Action Recognition in Australian Rules Football Through Deep Learning
1 Introduction
Action recognition has been explored by many researchers over the past decade.
The typical objective is to detect and recognize human actions in a range of
environments and scenarios. Action recognition, unlike object detection, needs to
consider both spatial and temporal information in order to make classifications.
In this paper we focus on using 3-dimensional Convolutional Neural Networks
(3D CNNs) to achieve action recognition for players in Australian rules football.
Australian rules football, commonly referred to as “footy” in Australia, is a
popular contact sport played between two 18-player teams on a large oval. The
premier league is the Australian Football League (AFL). The ultimate aim is to
kick the ball between 4 goal posts for a score (6 points if the ball goes through
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
D. Groen et al. (Eds.): ICCS 2022, LNCS 13352, pp. 563–576, 2022.
https://doi.org/10.1007/978-3-031-08757-8_47
564 S. K. Luan et al.
the middle two posts) or a minor score (1 point if the ball goes through the one
of the inner/outer posts). This is achieved by players doing a range of actions
to move the ball across the pitch. These include kicking, passing (punching the
ball), catching, running (up to 15 m whilst carrying the ball) and tackling.
The understanding of player actions and player movements in sports are cru-
cial to analyse player and team performances. Counting the number of effective
actions that take place during a match is key to this. This paper focuses on devel-
opment of a machine learning application that is able to detect and recognize
player actions through the use of deep artificial neural networks.
2 Literature Review
Prior to deep learning, approaches based on hand-engineered features for com-
puter vision tasks were the primary method used for action recognition. Improved
Dense Trajectories (IDT) [26] is representative of such approaches. This achieved
good accuracy and robustness, however hand engineering features is limited.
Deep learning architectures based on CNNs have achieved unparalleled perfor-
mance in the field of computer vision. Deep Video developed by Karpathy et al.
[17] was one of the first approaches to apply 2D CNNs for action recognition
tasks. This used pre-trained 2D CNNs applied to every frame of the video and
fusion techniques to learn spatio-temporal relationships. However, its perfor-
mance on the UCF-101 data set [20] was worse than IDT, indicating that 2D
CNNs alone are sub-optimal for action recognition tasks since they do not ade-
quately capture spatio-temporal information.
Two-stream networks such as [19] add a stream of optical flow information
[11] as a representation of motion besides the conventional RGB stream. The
approach used two parallel streams that were combined with fusion based tech-
niques. This approach was based on 2D CNNs and achieved similar results to
IDT. This approach sparked a series of research efforts focused on improving two-
stream networks. This included works focused on improvement in fusion [6], and
use of recurrent neural networks including Long Short-Term Memory (LSTM)
[4,15]. Other methods include Temporal Segment Networks (TSN) [27] capable
of understanding long range video content by splitting a video into consecutive
temporal segments, and multi-stream networks that consider other contextual
information such as human poses, objects and audio in video. The framework of
two-stream networks was widely adopted by many researchers, however, a major
limitation of two-stream networks was that optical flows require pre-processing
and hence require considerable hand-engineering of features. Generating optical
flows for videos can be both computationally and storage demanding. This also
affected the scale of training data sets required.
3D CNNs can be thought of as a natural way to understand video content.
Since video is a series of consecutive frames of images, a 3-dimensional convo-
lutional filter can be applied to both the spatial and temporal domain. Initial
research was explored by [13] in 2012, then in 2015 by Tran et al. [22] who
proposed a 3D neural network architecture called C3D using 3 × 3 × 3 con-
volutional kernels. They demonstrated that 3D CNNs were better at learning
Action Recognition in Australian Rules Football Through Deep Learning 565
A well-defined and high-quality data set is crucial for action recognition tasks.
This should contain enough samples for deep neural networks to extract motion
patterns, and offer enough variance for different scenarios and camera positions
for performance analysis. No such data set exists for AFL, hence we construct
our own action recognition data set for AFL games. In this process, we referred to
some well-known data sets for video content understanding including Youtube-
8M [1], UCF 101 [20], Kinetics-400 [18], SoccerNet [8] and others. All the training
and testing videos used here were retrieved from YouTube.
As AFL games are popular in Australia, there are more than enough videos on
YouTube, including real match recordings, training session recordings, tutorial
guides etc. However, manually creating and labelling data from video content
(individual frames) is a challenging and time-consuming task. In order to feed
enough frames and information for temporal feature extraction into deep learning
models, we set the standard that each video clip should be at least 16 frames in
length and it should be not a long-distance shot with low resolution of action
tasks.
Players in an AFL match are highly mobile hence actions only exist for
a very limited amount of time and are often interfered with by other players
through tackles. As a result, actions sometime may end up in failure. This brings
significant challenges to the construction process of the data set, e.g. judging the
actual completeness of actions. This work focuses on recognizing the patterns
and features of attempted actions, and pays less attention to whether the action
has been completed or not. All action clips within the data set have a high level
Action Recognition in Australian Rules Football Through Deep Learning 567
of observable features, where the actual completeness of those actions was less
of a concern.
In AFL games, some actions like marks (catching the ball kicked by a player
on the same team) have a specific condition that needs to be met. According
to AFL rules, a mark is only valid when a player takes control of the ball for
a sufficient amount of time, in which the ball has been kicked from at least
15 m away and does not touch the ground and has not been touched by another
player. We aim to identify specific action patterns based only on the camera
images and as such we do not consider the precision of whether the kicker was
15 m away. Marks can be separated into marks and contested marks, where the
latter is when multiple players attempt to catch (or knock the ball away) at the
same time.
The videos from YouTube comprise many meaningless frames. We clip videos
from longer videos and label them into five different classes:
(1) Kick: This class refers to the action whereby a player kicks the ball.
The ball could come from various sources: the player himself holding the ball in
front and dropping/kicking it, or kicking it directly off the ground.
(2) Mark: A player catches a kicked ball for sufficient time to be judged to
be in control of the ball and without the ball being touched/interfered with by
another player.
(3) Contested mark: Contested mark, is a special form of mark. This
refers to the action that one player is trying to catch the ball and one or more
opponents are either also trying to catch the ball at the same time or they are
trying to punch the ball away.
(4) Pass: A player passes (punches) the ball to another player in the same
team.
(5) Non-Action: This class includes players running, crowds cheering etc.
This class is used to control the model performance as during the match there
are many non-action frames. Without this class, the model would always try to
classify video content into the previous four classes.
The details of each class in the data set are shown in Table 1, and example of
each action class is shown in Fig. 1. Compared to other classes, the non-action
class has a relatively low number of instances in the data set. The reason is that
this class spans many different scenes, and too many instances in this class would
drive the attention of the model away from key features of the four key action
classes.
There are several challenges when using a data set for action recognition.
Some actions share the same proportion of representations. One example is mark-
ing and passing the ball. In a video clip of relatively long distance passing, if
the camera does not capture the whole passing process, e.g. it starts from some-
where in the middle, the representing features of this action might be similar to
a mark action, i.e. someone catches the ball. The data set could also be modified
by combing two classes of mark and contested mark, as sometimes it is hard to
identify a mark compared to a contested mark. If a player is trying to catch the
ball, and in the background an opponent is also trying to catch the ball, but
568 S. K. Luan et al.
Class # of instances
Training Testing Total
Kick 158 20 178
Contested mark 94 20 114
Mark 61 20 81
Pass 83 21 104
Non-action 66 21 87
Total 462 102 564
they do have not any physical contact at any time from one angle it may be
considered as a mark. From a different camera angle, where there appears to be
some degree of physical contact, it might seem more like a contested mark.
All model architectures are in 3D. I3D and I3D SlowFast models were based
on inflated 2D ResNet pre-trained on ImageNet. irCSN and R2+1D ResNet-152
were pre-trained on IG-65M, and all other models were trained from scratch. All
models used the Kinetics-400 data set for training [9].
The final training dataset was randomly split into training and validation
data sets in the ratio of 70% and 30% respectively. A sub-clip of 16 frames was
evenly sampled from each video clip at a regular interval depending on the clip’s
length. The number of input frames was selected as most actions happen in a
short time period. If the sampled frames were less than 16, replacements would
be randomly selected from the rest of the frames. The sampled frames would
then be processed by standard data augmentation techniques, where it would
570 S. K. Luan et al.
be first resized to a resolution of 340 × 256, while R2+1D resized the frames
to 171 × 128. The frames were then subject to a random resize with bi-linear
interpolation and a random crop size 224 × 224. The crop size for R2+1D was
112 × 112. Following this, the frames were randomly flipped along the horizontal
axis with a probability of 0.5, and normalized with means of (0.485, 0.456, 0406)
and standard deviations of (0.229, 0.224, 0.225) with respect to each channel.
The training process used stochastic gradient descent (SGD) as the optimizer,
with custom values of learning rate, momentum and weight decay, which were
specific to each model. The value of learning rate plays a very important role
in the model training process, where the correct learning rates will allow the
algorithms to converge, whereas the wrong learning rates will result in the model
not generalizing at all. Since we fine-tune pre-trained models, the initial learning
rate was set much lower than the original model. The common values of the
learning rate were 0.01 and 0.001, with a momentum of 0.9, a weight decay of
1e−5 , and learning rate policy set to either step or cosine, depending on each
model’s architecture and level of complexity. Cross entropy loss was used for the
model criterion with class weights taken into consideration since the training
data set was imbalanced between the different classes. The number of epochs
was set at 30 with an early stopping technique used to prevent over-fitting. The
epoch with the lowest validation loss was saved as the best weight.
The top-1 accuracy on the testing data set for the fine-tuned models is shown
in Table 3.
As seen, the best performing model was the R2+1D ResNet-152 model pre-
trained on the (very large) IG65M dataset. This achieved a top-1 accuracy of
77.45%. The final classification of action recognition results are shown in Table 4.
As seen, the classification for marks had the lowest recall of 0.55, while contested
marks had a recall of 0.85. This is possibly due to marks and contested marks
being difficult to distinguish in some circumstances due to the presence of other
players in the background. The classification for non-action has the lowest pre-
cision of 0.57 and the lowest f1 score of 0.65. The reason for this is that the
Action Recognition in Australian Rules Football Through Deep Learning 571
non-action class is very broad and contains many sub-classes, such as scenes
of audiences and players running and cheering. Splitting the class into multiple
distinct classes in the future may improve the non-action accuracy. Among all
classes, the classification of kicks has the highest f1-score at 0.89, since a kick
has arguably the most distinct and recognizable features.
The results for the top-1 accuracy of the AFL testing data set are gener-
ally consistent with the model performance using the Kinetics-400 dataset, how-
ever the R2+1D ResNet-50 model achieved some noteworthy improvements. The
model I3D ResNet-50 performed poorly with a top-1 accuracy of 56.86%, whilst
the model I3D ResNet-101 Non-Local only achieved an accuracy of 61.77%. It
might be inferred that the inflated 2D ResNets (I3D) are limited in their abil-
ity to capture spatio-temporal features, while R2+1D is more capable in this
regard as it utilizes the factorization of the 3D ResNet architecture. It was also
found that non-local blocks may not be suitable for Australian rules football, as
they are designed to capture long range temporal features. Actions in AFL are
relatively fast and diverse which results in the model under-performing.
It was found that the performance of models generally depends on their back-
bone architecture. The complexity of the ResNet architecture is closely related
to the prediction accuracy, hence it could be argued that the more complex the
architecture is, the more likely the model will generalize and make the right
predictions. Comparing ResNet-50 with ResNet-152, there is a significant differ-
ence in complexity and number of parameters, which could be one reason for
the relatively large performance difference. Another major factor to consider is
that both R2+1D ResNet-152 and irCSN used IG65M for model pre-training
and hence benefit from the very largely scaled data set. It is also interesting to
note that R2+1D uses a 112 × 112 resolution input after data augmentation,
whilst the rest of the models use a 224 × 224 input. Despite this R2+1D is still
able to produce some of the best results overall.
SlowFast and TPN networks both model visual tempos in video clips. When
incorporating I3D into SlowFast network, the model I3D SlowFast ResNet-101
performed evidently better than the other I3D models, indicating that the Slow-
Fast networks are capable at better extracting spatio-temporal features and that
modelling visual tempos improves the overall model performance. However, Slow-
Fast is a more strict framework that limits the number of frames of different
572 S. K. Luan et al.
streams, whilst TPN is more flexible due to its pyramid structure. As a result,
TPN ResNet-101 performed slightly better than SlowFast ResNet-101.
There are several important limitations to the presented models. Firstly,
incomplete actions will likely be classified as actions. As shown in Fig. 2(a), an
incomplete contested mark has been classified as a contested mark. This is due
to the incomplete action sharing a lot of similar features to a completed action.
The model does not always possess the ability to recognise whether the ball has
been cleanly caught (or not). Secondly, the model tends to perform poorly in
complex scenes and environments. From Fig. 2(b), it can be seen that there are
many players present in the background and a player is tackling another player
who has the ball. In this case, the model mis-classifies the scenario into a pass
as it is similar to the scenarios of pass in the training data set.
This paper explored the feasibility of action recognition for Australian rules foot-
ball using 3D CNN architectures. Various action recognition models including
state-of-the-art models pre-trained on large-scale data sets were utilised. We fine-
tune those models on a newly developed AFL data set, and reported a 77.45%
top-1 accuracy for the best performing model R2+1D ResNet-152. A smoothing
strategy allowed the algorithm to localize the frame range for actions in long
video segments. We also developed a team identification solution and an action
recognition application that showed both the potential and viability of applying
real time end-to-end action recognition to AFL matches.
There are many future extensions to the work. The team identification frame-
work opens up further improvements on action recognition in AFL matches
for specific teams. Actions such as pass and contested mark require additional
574 S. K. Luan et al.
References
1. Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark.
CoRR abs/1609.08675. arXiv: 1609.08675 (2016)
2. Carreira, J., Zisserman, A.: Quo Vadis, action recognition? A new model and the
kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pp. 4724–4733. IEEE, Honolulu (July 2017). https://doi.org/10.
1109/CVPR.2017.502, http://ieeexplore.ieee.org/document/8099985/
3. Deng, J., Dong, W., Socher, R., Li, L.J., Kai Li, Li Fei-Fei: ImageNet: a large-scale
hierarchical image database. In: 2009 IEEE Conference on Computer Vision and
Pattern Recognition, pp. 248–255. IEEE, Miami (June 2009). https://doi.org/10.
1109/CVPR.2009.5206848, https://ieeexplore.ieee.org/document/5206848/
4. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recogni-
tion and description. In: 2015 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 2625–2634. IEEE, Boston (June 2015). https://doi.org/
10.1109/CVPR.2015.7298878, http://ieeexplore.ieee.org/document/7298878/
5. Feichtenhofer, C., Fan, H., Malik, J., He, K.: SlowFast networks for video recogni-
tion. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV),
pp. 6201–6210. IEEE, Seoul (October 2019). https://doi.org/10.1109/ICCV.2019.
00630, https://ieeexplore.ieee.org/document/9008780/
6. Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network
fusion for video action recognition. In: 2016 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 1933–1941. IEEE, Las Vegas
(June 2016). https://doi.org/10.1109/CVPR.2016.213, http://ieeexplore.ieee.org/
document/7780582/
7. Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-
training for video action recognition. In: 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pp. 12038–12047. IEEE, Long Beach
(June 2019). https://doi.org/10.1109/CVPR.2019.01232, https://ieeexplore.ieee.
org/document/8953267/
8. Giancola, S., Amine, M., Dghaily, T., Ghanem, B.: SoccerNet: a scalable dataset for
action spotting in soccer videos. CoRR abs/1804.04527. arXiv: 1804.04527 (2018)
9. Guo, J., et al.: GluonCV and GluonNLP: deep learning in computer vision and
natural language processing. arXiv:1907.04433 (February 2020)
Action Recognition in Australian Rules Football Through Deep Learning 575
10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 770–778. IEEE, Las Vegas (June 2016). https://doi.org/10.1109/CVPR.2016.
90, http://ieeexplore.ieee.org/document/7780459/
11. Horn, B.K., Schunck, B.G.: Determining optical flow. Artif. Intell. 17(1–3), 185–
203 (1981). https://doi.org/10.1016/0004-3702(81)90024-2, https://linkinghub.
elsevier.com/retrieve/pii/0004370281900242
12. Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by
reducing internal covariate shift. In: Proceedings of the 32nd International Con-
ference on International Conference on Machine Learning, ICML 2015, JMLR.org,
Lille, France, vol. 37, pp. 448–456 (July 2015)
13. Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human
action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013).
https://doi.org/10.1109/TPAMI.2012.59
14. Jocher, G., Stoken, A., Chaurasia, A., Borovec, J.: NanoCode012, Taoxie, Kwon,
Y., Michael, K., Changyu, L., Fang, J., V, A., Laughing, tkianai, yxNONG,
Skalski, P., Hogan, A., Nadar, J., imyhxy, Mammana, L., AlexWang1900, Fati,
C., Montes, D., Hajek, J., Diaconu, L., Minh, M.T., Marc, albinxavi, fatih, oleg,
wanghaoyang0106: ultralytics/yolov5: v6.0 - YOLOv5n ‘Nano’ models, Roboflow
integration, TensorFlow export, OpenCV DNN support (October 2021). https://
doi.org/10.5281/zenodo.5563715
15. Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R.,
Toderici, G.: Beyond short snippets: deep networks for video classification. In:
2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pp. 4694–4702. IEEE, Boston (June 2015). https://doi.org/10.1109/CVPR.2015.
7299101, http://ieeexplore.ieee.org/document/7299101/
16. Kanungo, T., Mount, D., Netanyahu, N., Piatko, C., Silverman, R., Wu, A.: An effi-
cient k-means clustering algorithm: analysis and implementation. IEEE Trans. Pat-
tern Anal. Mach. Intell. 24(7), 881–892 (2002). https://doi.org/10.1109/TPAMI.
2002.1017616
17. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-
scale video classification with convolutional neural networks. In: 2014 IEEE Confer-
ence on Computer Vision and Pattern Recognition, pp. 1725–1732. IEEE, Colum-
bus (June 2014). https://doi.org/10.1109/CVPR.2014.223, https://ieeexplore.ieee.
org/document/6909619
18. Kay, W., et al.: The Kinetics Human Action Video Dataset. arXiv:1705.06950 (May
2017)
19. Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recog-
nition in videos. In: Proceedings of the 27th International Conference on Neural
Information Processing Systems, NIPS 2014, vol. 1, pp. 568–576. MIT Press, Mon-
treal (December 2014)
20. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions
classes from videos in the wild. arXiv:1212.0402 (December 2012)
21. Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations.
CoRR abs/1908.08530. arXiv: 1908.08530 (2019)
22. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spa-
tiotemporal features with 3D convolutional networks. In: 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV), pp. 4489–4497. IEEE, Santiago
(December 2015). https://doi.org/10.1109/ICCV.2015.510, http://ieeexplore.ieee.
org/document/7410867/
576 S. K. Luan et al.
23. Tran, D., Wang, H., Feiszli, M., Torresani, L.: video classification with channel-
separated convolutional networks. In: 2019 IEEE/CVF International Conference on
Computer Vision (ICCV), pp. 5551–5560. IEEE, Seoul (October 2019). https://doi.
org/10.1109/ICCV.2019.00565, https://ieeexplore.ieee.org/document/9008828/
24. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look
at spatiotemporal convolutions for action recognition. In: 2018 IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, pp. 6450–6459 (June 2018).
https://doi.org/10.1109/CVPR.2018.00675, iSSN: 2575-7075
25. Vaswani, A., et al.: Attention is all you need. In: Proceedings of the 31st Inter-
national Conference on Neural Information Processing Systems, NIPS 2017, pp.
6000–6010. Curran Associates Inc., Long Beach (December 2017)
26. Wang, H., Schmid, C.: Action recognition with improved trajectories. In: 2013
IEEE International Conference on Computer Vision, pp. 3551–3558 (December
2013). https://doi.org/10.1109/ICCV.2013.441, iSSN: 2380-7504
27. Wang, L., et al.: Temporal segment networks: towards good practices for deep
action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV
2016, Part VIII. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.
org/10.1007/978-3-319-46484-8 2
28. Wang, X., Girshick, R., Gupta, A., He, K.: Non-local neural networks. In: 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–
7803. IEEE, Salt Lake City (June 2018). https://doi.org/10.1109/CVPR.2018.
00813, https://ieeexplore.ieee.org/document/8578911/
29. Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep
association metric. In: 2017 IEEE International Conference on Image Processing
(ICIP), pp. 3645–3649. IEEE, Beijing (September 2017). https://doi.org/10.1109/
ICIP.2017.8296962, http://ieeexplore.ieee.org/document/8296962/
30. Yang, C., Xu, Y., Shi, J., Dai, B., Zhou, B.: Temporal pyramid network for action
recognition. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), pp. 588–597. IEEE, Seattle (June 2020). https://doi.org/10.
1109/CVPR42600.2020.00067, https://ieeexplore.ieee.org/document/9157586/
31. Zhu, Y., et al.: A comprehensive study of deep video action recognition.
arXiv:2012.06567 (December 2020)