Tennis Strokes Recognition From Generated Stick Figure Video Overlays
Tennis Strokes Recognition From Generated Stick Figure Video Overlays
Overlays
Boris Bačić1 a
and Ishara Bandara2 b
1Auckland University of Technology, Auckland, New Zealand
2Robert Gordon University, Aberdeen, U.K.
Keywords: Computer Vision, Deep Learning, Spatiotemporal Data Classification, Human Motion Modelling and
Analysis (HMMA), Sport Science, Augmented Broadcasting.
Abstract: In this paper, we contribute to the existing body of knowledge of video indexing technology by presenting a
novel approach for recognition of tennis strokes from consumer-grade video cameras. To classify four
categories with three strokes of interest (forehand, backhand, serve, no-stroke), we extract features as a time
series from stick figure overlays generated using OpenPose library. To process spatiotemporal feature space,
we experimented with three variations of LSTM-based classifier models. From a selection of publicly
available videos, trained models achieved an average accuracy of between 97%–100%. To demonstrate
transferability of our approach, future work will include other individual and team sports, while maintaining
focus on feature extraction techniques with minimal reliance on domain expertise.
a
https://orcid.org/0000-0003-0305-4322
b
https://orcid.org/0000-0002-7346-248X
397
Bačić, B. and Bandara, I.
Tennis Strokes Recognition from Generated Stick Figure Video Overlays.
DOI: 10.5220/0010827300003124
In Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2022) - Volume 5: VISAPP, pages
397-404
ISBN: 978-989-758-555-5; ISSN: 2184-4321
Copyright c 2022 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
feature extraction techniques relying on (3) further sub-event processing i.e., phasing analysis
common-sense visual observation? via produced ensemble of ESN (Bačić, 2016a).
3. If so, can we develop a multi-stage video Prior work on video analysis applied Histograms
processing and modelling framework that is of Oriented Gradients (HOG), Local Binary Pattern
transferable to other sports? (LBP) and Scale Invariant Local Ternary Pattern
(SILTP) for human activity recognition (HAR) in
1.2 Background and Prior Work surveillance (Lu, Shen, Yan, & Bačić, 2018). A pilot
case study on cricket batting balance (Bandara &
Advancements in motion pattern indexing can not Bačić, 2020) used recurrent neural networks (RNN)
only be evaluated by improving classification and pose estimation to generate classification of
performance for a specific task, low-cost real-time batting balance (from rear or front foot). This prior
computing and extending the number of labelled work on privacy-preservation filtering is aligned with
events of interest, but also on their universal privacy-preserving elderly care monitoring systems
applicability to various sources such as 3D motion and with extracting diagnostic information for
data (Bačić & Hume, 2018), video (Bloom & silhouette-based augmented coaching (Bačić, Meng,
Bradley, 2003; D. Connaghan, Conaire, Kelly, & & Chan, 2017; Chan & Bačić, 2018). It is also
Connor, 2010; Martin, Benois-Pineau, Peteri, & generally applicable to usability and safety of spaces
Morlier, 2018; Ramasinghe, Chathuramali, & where human activity occurs such as smart cities,
Rodrigo, 2014; Shah, Chockalingam, Paluri, Pradeep, future environments and traffic safety (Bačić, Rathee,
& Raman, 2007), and sensor signal processing & Pears, 2021).
(Anand, Sharma, Srivastava, Kaligounder, &
Prakash, 2017; Damien Connaghan et al., 2011; Kos,
Ženko, Vlaj, & Kramberger, 2016; Taghavi, Davari, 2 METHODOLOGY
Tabatabaee Malazi, & Abin, 2019; Xia et al., 2020).
To our knowledge, tennis shots or strokes action Considering past research, our objective is to produce
recognition relying on computer vision started in a relatively simple and generalised initial solution and
2001, by combining computer vision and hidden a human motion modelling (HMMA) framework for
Markov model (HMM) approaches, before HD TV- video indexing applicable to tennis. The tennis
broadcast resolution became available (Petkovic, dataset was created from both amateur and
Jonker, & Zivkovic, 2001). After Sepp Hochreiter professional players’ videos. It is also expected that
and Jürgen Schmidhuber invented Long Short Term the produced framework may be easily transferable to
Memory (LSTM) in 1997, LSTMs have been used in other sport disciplines and related contexts such as
action recognition (Cai & Tang, 2018; Liu, rehabilitation and improving safety and usability of
Shahroudy, Xu, Kot, & Wang, 2018; Zhao, Yang, spaces where human movement occurs. As part of
Chevalier, Xu, & Zhang, 2018). In 2017, inertial movement pattern analysis, we focused on expressing
sensors with Convolutional Neural Networks (CNN) features as spatiotemporal human movement patterns
and bi-directional LSTM networks were used to from faster moving segments (e.g., dominant hand
recognise actions in multiple sports (Anand et al., holding a racquet) relative to the more static trunk
2017). In 2018, an LSTM with Inception v3 was used segment.
to recognise actions in tennis videos achieving 74%
classification accuracy (Cai & Tang, 2018).
2.1 Stick Figure Overlays as Initial
For prototyping explainable AI in next-generation
augmented coaching software, which is expected to Data Preprocessing
capture expert’s assessment and continue to provide
comprehensive coaching diagnostic feedback (Bačić To retrieve player’s motion-based data from video,
& Hume, 2018), we can rely on multiple data sources we generated stick figure overlays using OpenPose
including those operating beyond human vision. (https://github.com/CMU-Perceptual-Computing-
Prior work on 3D motion data is categorised as: Lab/openpose) and 25 key point estimator
(1) traditional feature-based swing indexing based on COCO+Foot (Figure 1 and Figure 2).
sliding window and thresholding (Bacic, 2004) and Figure 2 shows an example of data format
representing the key points coordinate of a player in
expert-driven algorithmic approach in tennis shots
and stance classification (Bačić, 2016c); (2) each video frame recorded as multi time series data.
featureless approach for accurate swing indexing As video overlays, animated stick figure topology of
generated key points (Figure 3) represents a way of
using Echo State Network (ESN) (Bačić, 2016b) and
extracting information from video to facilitate human
398
Tennis Strokes Recognition from Generated Stick Figure Video Overlays
399
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
3. Dominant hand to dominant To improve separation of serves and strokes starting from
d(P4,P11)
side foot dominant hand side
4. Non-dominant hand to To identify strokes starting from the non-dominant side
d(P7,P14)
non-dominant hand side foot
6. Dominant hand to non- To identify the circular motion around the hip in ground
dominant side hip d(P4,P12) strokes and to identify strokes starting from the dominant
side
8. Body to dominant hand To identify strokes starting from the dominant side over
P4(x) - P8(x)
x-axis distance body’s vertical (symmetrical) axis
9. Body to non-dominant hand To identify strokes starting from the non-dominant side over
P7(x) - P8(x)
x-axis distance body’s vertical axis.
400
Tennis Strokes Recognition from Generated Stick Figure Video Overlays
401
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
The above-expected experimental results suggest that detected, the next rolling-window of 27 frames are
the improved solution would include modelling and buffered and supplied to the classifier. If a stroke is
analysis of additional output classes (e.g., volleys, detected, the overlay with the identified stroke will be
drop shots, serve variations) requiring (sub)phasing displayed over the next 27 frames (which are skipped
movement analysis. Similar to prior work on 3D from feature processing, considering minimum times
kinematic data (Bačić, 2016a), the ensemble between shots e.g., for the opposing player’s stroke).
orchestration control would not only rely on a
weighted probabilistic equation but also on expert’s
knowledge captured in a state automata machine. 5 DISCUSSION
Such approach allows ensemble modelling on small
and large dataset, where parameter optimisation and
For a prototype, the classification performance results
human-labelling efforts can be further reduced by exceeded expectations for the collected dataset (with
transfer learning and adaptive system design. approx. 80:20% split used for model training and
testing). We expect that expanding the dataset may
reduce classification performance, justifying a
4 IMPLEMENTATION FOR follow-up investigation into achieving an improved
VIDEO STREAMING solution that will generalise well on future data.
Another limitation is that occasionally, OpenPose
Trained model can be used to classify strokes in and fails to generate the correct stick figure, warranting
display overlaid text for video streaming. Model input further investigations to improve overall robustness
is a spatiotemporal dataset of nine features (Table 1). and accuracy. Further improvement is intended by
Spatiotemporal dataset subsample should be using additional videos taken from other vantage
imputed to a classifier as a block of experimentally points e.g. in front of the player. Considering the
determined size of 27 frames (Figure 7). Buffering of computational performance of pose estimation, we
27 frames (of approx. 1 second) represents rolling will look at implementation on lower cost platforms,
window concept in time-series analysis, in which key including tablets and mobiles. In coaching scenarios,
points from 2D pose estimation skeleton overlay were the intended platform would also process video feeds
converted into the 9 distance-based features from fixed camera positions akin to the dataset used
generating a 9x27 size buffered data block. in this paper. Unlike carrying and managing inertial
Therefore, after 27 frames of data were buffered, sensors, video is considered: (1) an unobtrusive data
a trained model (i.e. classifier) was used to detect a source not interfering with the player’s feel; and (2)
stroke and to classify the stroke. If a stroke is not to minimise the possibility of motion data
interpretation being contested.
During match situations, players may move closer
to the net. When players are close to the net, they will
perform stroke exchange in higher frequency than
compared to producing strokes behind the baseline.
Therefore, time between strokes may be sometimes
less than a second. For the scope of this research and
the proof-of-concept, one second (or longer time)
splits between the strokes in video have been
considered as sufficient for stroke identification and
classification. Future work will involve modelling of
increased number of output classes including faster
strokes exchange (e.g., containing further information
such as: direction, depth, drop shot and lob volleys).
6 CONCLUSION
This paper contributes to video indexing and human
activity recognition by applying a multidisciplinary
Figure 7: Strokes classification and overlay annotation as combination of computer vision, pose estimation and
video processing workflow concept.
402
Tennis Strokes Recognition from Generated Stick Figure Video Overlays
403
VISAPP 2022 - 17th International Conference on Computer Vision Theory and Applications
Chan, K. Y., & Bačić, B. (2018). Pseudo-3D binary Tennis Legend TV. (2019, 28 Aug. 2020). Roland-Garros
silhouette for augmented golf coaching. In ISBS, XXXVI 2019: Federer - Schwartzman practice points (court
International Symposium on Biomechanics in Sports. level view) Retrieved from https://www.youtube.com/
Auckland, New Zealand watch?v=vkGwyke5jDU
Connaghan, D., Conaire, C. Ó., Kelly, P., & Connor, N. E. Top Tennis Training - Pro Tennis Lessons. (2014, 28 Aug.
O. (2010). Recognition of tennis strokes using key 2020). Tsonga vs Anderson training match 2014-court
postures. In ISSC, 21st Irish Signals and Systems level view. Retrieved from https://www.youtube.com/
Conference. Dublin, Ireland. watch?v=RHokxoEsFsc
Connaghan, D., Kelly, P., O’Connor, N., Gaffney, M., TV Tennis Pro. (2020, 28 Aug. 2020). Alexander Zverev
Walsh, M., & O’Mathuna, C. (2011). Multisensor practice match vs Andrey Rublev court level view
classification of tennis strokes. In IEEE Sensors, tennis. Retrieved from https://www.youtube.com/
Limerick, Ireland. watch?v=mcR3d9jnWaI
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term Xia, K., Wang, H., Xu, M., Li, Z., He, S., & Tang, Y.
memory. Neural computation, 9(8). doi: (2020). Racquet sports recognition using a hybrid
https://doi.org/10.1162/neco.1997.9.8.1735 clustering model learned from integrated wearable
Kos, M., Ženko, J., Vlaj, D., & Kramberger, I. (2016). sensor. Sensors, 20(6). doi:https://doi.org/10.3390/
Tennis stroke detection and classification using s20061638
miniature wearable IMU device. In IWSSIP, Zhao, Y., Yang, R., Chevalier, G., Xu, X., & Zhang, Z.
International Conference on Systems, Signals and (2018). Deep residual Bidir-LSTM for human activity
Image Processing. Bratislava, Slovakia. recognition using wearable sensors. Mathematical
Liu, J., Shahroudy, A., Xu, D., Kot, A., & Wang, G. (2018). Problems in Engineering, 2018. doi:https://doi.org/
Skeleton-based action recognition using spatio- 10.1155/2018/7316954
temporal LSTM network with trust gates. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 40(12). doi:https://doi.org/10.1109/TPA
MI.2017.2771306
Lu, J., Shen, J., Yan, W. Q., & Bačić, B. (2018). An
empirical study for human behavior analysis.
International Journal of Digital Crime and Forensics.
IGI Global. http://doi.org/10.4018/IJDCF.2017070102
Martin, P.-E., Benois-Pineau, J., Peteri, R., & Morlier, J.
(2018). Sport action recognition with Siamese spatio-
temporal CNNs: Application to table tennis. In CBMI,
International Conference on Content-Based
Multimedia Indexing. La Rochelle, France.
Page, S. (2020). Tennis practice match points - NTRP 4.5
vs 5.0. Retrieved from https://www.youtube.com/
watch?v=dfrec4pjnI0
Petkovic, M., Jonker, W., & Zivkovic, Z. (2001).
Recognizing strokes in tennis videos using hidden
Markov models. In IASTED, International Conference
on Visualization, Imaging and Image Processing.
Marbella, Spain.
Ramasinghe, S., Chathuramali, K. G. M., & Rodrigo, R.
(2014). Recognition of badminton strokes using dense
trajectories. In 7th International Conference on
Information and Automation for Sustainability.
Colombo, Sri Lanka.
Shah, H., Chockalingam, P., Paluri, B., Pradeep, S., &
Raman, B. (2007). Automated stroke classification in
tennis. In ICIAR, 4th international conference on Image
Analysis and Recognition (Vol. 4633). Springer.
Taghavi, S., Davari, F., Tabatabaee Malazi, H., & Abin, A.
A. (2019). Tennis stroke detection using inertial data of
a smartwatch. In ICCKE, 9th International Conference
on Computer and Knowledge Engineering. Mashhad,
Iran.
Tenfitmen Tennis Impulse. (2020). Tennis match - player
vs coach (Tenfitmen - episode 123). Retrieved from
https://www.youtube.com/watch?v=uSoD2yyzRgY
404