Skeleton Based Keyframe Detection Framework For Sports Action Analysis Badminton Smash Case
Skeleton Based Keyframe Detection Framework For Sports Action Analysis Badminton Smash Case
ABSTRACT The analysis of badminton player actions from videos plays a crucial role in improving athletes’
performance and generating statistical insights. The complexity and speed of badminton movements pose
unique challenges compared to everyday activities. To analyze badminton player actions, we propose a
skeleton-based keyframe detection framework for action analysis. Keyframe detection is widely used in
video summarization and visual localization due to its computational efficiency and memory optimization
compared to analyzing all frames of a video. This framework segments the complex macro-level activity
into micro-level segments and analyzes each micro-level activity individually. Firstly, it extracts skeleton
data from a motion sequence video using 3D:VIBE pose estimation. Then, the keyframe detection module
explores the sequence of activity frames and identifies keyframes for each micro-level activity, including
start, ready, strike, and end. Finally, the posture and movement detection modules analyze the posture and
movement data to identify specific activities. This framework is implemented in the device called CoachBox.
The proposed framework is evaluated using the mean absolute error on a dataset. The average mean absolute
error for the keyframe detection module is less than 0.168 seconds, and the striking moment detection has
an error of only 0.033 seconds. Additionally, a coordinate transform method is provided to convert body
coordinates to real-world coordinates for visualization purposes.
INDEX TERMS Keyframe detection, action analysis, skeleton detection, coordinate transform, action
analysis framework.
II. RELATED METHODS body pose and shape of badminton players from such videos.
In this section, we will explain the related methods that are VIBE adopts an end-to-end architecture, transforming 2D
employed to recognize badminton shot actions. The stereo input images to 3D skeleton coordinates through a generative
vision cameras capture the players’ video that assists in adversarial network (GAN) [25]. To capture the temporal
reconstructing the 3D representation of the player and calcu- relationship of the video frames and enhance action coher-
lating the court size. Subsequently, OpenPose and VIBE are ence, VIBE incorporates a gated recurrent unit (GRU) [26].
utilized to extract the 2D and 3D skeletons from the video. To train VIBE, a mixed dataset comprising 2D and 3D data
Finally, a keyframe detection module is employed to extract from MPI-INF-3DHP [27], Human3.6M [28], and 3DPW
key frames for sub-action content analysis. Additionally, the [29] is employed. This diverse dataset ensures robust training
details of the badminton action recognition and keyframe and generalization of the network. VIBE’s performance is
extraction methods will be provided. evaluated using the Percentage of Correct Keypoints (PCK)
metric, achieving an impressive correctness score of 89.3%.
A. STEREO VISION CAMERA VIBE detects 49 key points of the skeleton, providing a
We utilized two stereo vision cameras with different viewing detailed understanding of the actions captured in the bad-
angles to accurately capture the depth information of bad- minton video. This comprehensive set of key points enables
minton actions by employing the principle of triangulation a thorough analysis of the player action content, facilitating
[23]. By obtaining depth information from stereo vision, further interpretation.
it becomes possible to reconstruct a three-dimensional (3D)
representation of the scene. Prior to calculating the 3D posi- C. TRAJECTORY DETECTION
tions, it is necessary to determine the intrinsic and extrinsic We employed the TrackNetv2 network, as described in [30],
matrices for each camera. to track shuttlecocks and visualize their positions in the vir-
To obtain the intrinsic matrix and distortion parameters tual world court. TrackNetV2 is specifically designed to excel
for each camera, a checkerboard pattern was used to assist in detecting small, fast-moving objects such as shuttlecocks
in calibration. Additionally, the extrinsic matrix for each in video footage. It operates on a frame-by-frame basis, accu-
camera in the court was calculated by employing homography rately determining the shuttlecock’s position in each frame.
mapping from the white field lines of the court to the known The architecture of TrackNetV2 follows an encoder-
court size. Once the camera parameters were obtained, the 3D decoder structure. The encoder acts as a feature extractor,
points were triangulated from a set of points calculated using utilizing convolutional kernels to capture image clues and
two different perspective images. condensing the features through max-pooling operations.
Conversely, the decoder expands the feature maps to gen-
B. HUMAN SKELETON DETECTION erate the prediction function, enabling accurate shuttlecock
Human skeleton detection can be broadly categorized into tracking.
two types: 2D skeleton prediction models and 3D skeleton TrackNetV2 is trained on dataset that contains
prediction models. 55563 frames including 15 broadcast videos of professional
games and 3 amateur games. In order to prevent overfitting,
1) 2D HUMAN SKELETON DETECTION we collected an additional 125 rally videos with diverse
Firstly, we utilize OpenPose [20], which is an open-source backgrounds and filming angles. Approximately 2,500 to
state-of-the-art method based on Part Affinity Fields (PAF) to 3,000 frames were included from each video. TrackNetV2
track human pose in badminton court. PAF provides vectors accuracy respectively reach to 98.7% in the training phase
that connect one joint point to the next, capturing the rela- and 85.4% in a test on a new match. Moreover, TrackNetV2
tionships between different body parts. OpenPose is a highly exhibits a processing speed of 31.84 frames per second (FPS),
capable framework that enables the detection and tracking which greatly facilitates shuttlecock tracking in our approach.
of multiple people’s poses simultaneously. This multi-person
tracking capability is particularly important as it allows us to D. KEYFRAME EXTRACTION
account for human interactions and analyze them in natural Several researchers have proposed various keyframe extrac-
settings. By leveraging OpenPose, we can accurately track tion methods using different strategies. Phan et al. [31] intro-
and analyze the poses of badminton player and understand duced an efficient framework named KFSENet for action
their interactions with each other that aids in visualizing from recognition in videos, incorporating keyframe extraction
real-world court to the virtual world. based on skeleton deep learning architectures. Kim et al. [32]
proposed a bidirectional consecutively connected two path-
2) 3D HUMAN SKELETON DETECTION way network (BCCN) for efficient gesture recognition using
We utilized the VIBE (Video Inference for Human Body a Skeleton-Based Keyframe Selection Module. Lv et al. [33]
Pose and Shape Estimation) network [19] to extract the 3D developed a sports action classification system for accu-
human skeleton from a monocular RGB video [24]. The rately classifying athletes’ actions based on keyframe
primary objective of VIBE is to accurately estimate the 3D extraction.
FIGURE 6. Movement detection: smash swing period. FIGURE 7. OpenPose keypoints based gravity center.
position so that tester can hit the ball better, (iv) selecting the
court line corner to calculate the extrinsic matrix, (v) starting
the test.
The dataset contains six types of ball for each of the ten
people that have different gender, ages, and levels, as stated in
table 2. The ball types include forehand and backhand smash,
forehand and backhand high ball, forehand and backhand cut
ball, and involve a total of six actions of 10 people. Each
action is done 10 times, which means the ball machine will
serve 10 consecutive balls as the data collection. So, a short
FIGURE 9. Visualization of coordinates points with ball trajectory in video of a total of 600 rallies that are pre-edited and stored in
virtual court.
the dataset. The intrinsic matrix and extrinsic matrix are saved
in the dataset. The collecting process is that the player arrives
by applying rotation and taking the transpose of the skeleton at the designated position on the court, and then another
pose keypoints. After rotation and transpose of the skeleton person presses the start test button. The ball machine first
pose, a dot product is performed with the body coordinate serve two balls for initial testing, and these two balls will not
kppose to obtain the body keypoints image kpimg , which visu- be included in the evaluation. Then, the official test starts by
alizes the player’s body keypoints in the virtual court. serving 10 consecutive balls for the data collection and for
Similarly, Rcourt and Tcourt represent the court coordinates changing the next action.
rotation and transpose notations. These notations, along with
the dot product of kpcourt , also help in obtaining the body B. PERFORMANCE EVALUATION
keypoints image kpimg to visualize the player’s body key- 1) KEYFRAME DETECTION EVALUATION
points in the virtual court. The notation kpcourt is obtained The keyframe detection module is evaluated using the above
using the formula defined in step 2. After finding the body mentioned dataset based on the mean absolute error which
and court keypoints, they are visualized in the virtual court compares the ground truth and prediction frame labeling.
along with the shuttle trajectory, as illustrated in Fig. 9, where The keyframe module is evaluated according to different
white points indicate the badminton trajectory and red points categories of players. The first category represents the pro-
represent the body keypoints. fessional player which is only one in our dataset and the
rest belongs to the beginner category. First, we evaluated the
IV. PERFORMANCE EVALUATION AND module for the professional player and calculated the error.
IMPLEMENTATION ENVIRONMENT The module used 60 different types of action to detect the
A. DATASET four different keyframes postures including the start, ready,
The badminton games dataset was captured independently strike and end. The average mean absolute errors of the four
on multiple subjects and used for performance evaluation. keyframe posture are shown in Table 3.
This dataset has synchronized multi-view videos and labeled The performance of the keyframe detection module is also
keyframes that are defined by each evaluation algorithm. The evaluated on 5 beginner-level players which are randomly
dataset collection process include (i) fixing the position and decided and their actions frames are 300 actions in total. The
angle of the two cameras, (ii) the testing player needs to stand average mean absolute errors of the four keyframe posture are
in the red rectangle of the two pictures in CoachBox like the shown in Table 4. The results show that the mean absolute
fig 10, (iii) a ball machine also needs to be placed in a fixed error is larger for the beginner players that the professional
TABLE 3. keyframe evaluation on pro-level player. TABLE 5. Comparative analysis with related methods.
2) POSTURE EVALUATION
Each posture position is evaluated based on the pre-defined
rules in Table 2 and Fig. 5. The athlete who has the most results showed that our method outperformed the other state-
accurate posture position gets the highest score, as illustrated of-the-art methods for player action recognition in badminton
in Fig. 11. The figure shows that the average score of the videos. Similarly, the proposed method is evaluated on each
beginning-level athlete is lower than professional player. The class to check the performance. Table 6 shows each class
reason is that their actions are inaccurate and incomplete result with their respective shots data.
compared to the professional player.
The proposed method is compared with a similar method as C. CoachBox: SYSTEM TECHNIQUE
shown in Table5. These methods are able to recognize various The CoachBox’s entire system technology is illustrated in
player actions, such as forehand, backhand, serve, and volley, Fig. 12. It is divided into four main parts, video capture,
by analyzing the video frames. The performance of these shuttlecock trajectory tracking, skeleton detection, and action
proposed methods are evaluated on their own datasets, and the analysis. The video capture part includes how to synchronize
and record the multi-view video, camera calibration that cal- [10] H. Wang and C. Schmid, ‘‘Action recognition with improved trajectories,’’
culates the intrinsic matrix and the extrinsic matrix, and also in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 3551–3558.
[11] S. Ji, W. Xu, M. Yang, and K. Yu, ‘‘3D convolutional neural networks
includes using MQTT to transport the data from two cam- for human action recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
eras. Trajectory tracking includes using TrackNetV2 to detect vol. 35, no. 1, pp. 221–231, Jan. 2013.
shuttlecock tracks, 3D positioning, trajectory smoothing, etc. [12] W. Liu and J. Ke, ‘‘A brief analysis of multi-ball training in badminton,’’
Educ. Res. Frontier, vol. 10, no. 4, 2020.
Skeleton detection includes using VIBE to detect 3D human [13] T. Huang, Y. Li, and W. Zhu, ‘‘An auxiliary training method for single-
skeletons, parsing the output of VIBE, and transforming 3D player badminton,’’ in Proc. 16th Int. Conf. Comput. Sci. Educ. (ICCSE),
skeleton points into the court coordinate system. The last Aug. 2021, pp. 441–446.
[14] S. Ramasinghe, K. G. M. Chathuramali, and R. Rodrigo, ‘‘Recognition of
part is the action analysis system which uses the framework badminton strokes using dense trajectories,’’ in Proc. 7th Int. Conf. Inf.
proposed in this study to systematically analyze the actions, Autom. Sustainability, Dec. 2014, pp. 1–6.
including data extraction, keyframe detection, posture evalu- [15] Y. Wang, W. Fang, J. Ma, X. Li, and A. Zhong, ‘‘Automatic badminton
ation, and movement evaluation. action recognition using CNN with adaptive feature extraction on sensor
data,’’ in Proc. 15th Int. Conf. Intell. Comput. Theories Appl. (ICIC),
Nanchang, China. Cham, Switzerland: Springer, Aug. 2019, pp. 131–143.
V. CONCLUSION [16] Proplayai. Pitchai. [Online]. Available: https://proplayai.com/pitchai/
This paper presents a badminton action analysis framework [17] P.-Y. Kuo. Badminton Smash Visualization System. [Online]. Available:
https://etd.lib.nctu.edu.tw/cgi-bin/gs32/tugsweb.cgi?o=dnctucdr&s=id=%
that offers a solution for analyzing and evaluating com- 22GT0706568120%22.&searchmode=basic
plex shots, such as the smash, in badminton videos using [18] C. Feichtenhofer, A. Pinz, and R. P. Wildes, ‘‘Spatiotemporal multiplier
the keyframe detection module. The framework can han- networks for video action recognition,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 7445–7454.
dle real-time inputs from badminton games and provides a [19] M. Kocabas, N. Athanasiou, and M. J. Black, ‘‘VIBE: Video inference
comprehensive analysis of badminton activity, ranging from for human body pose and shape estimation,’’ in Proc. IEEE/CVF Conf.
macro-level to micro-level analysis, allowing for insights Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5252–5262.
[20] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, ‘‘Realtime multi-person 2D
into each attribute of micro-level badminton activity. Fur- pose estimation using part affinity fields,’’ in Proc. IEEE Conf. Comput.
thermore, the framework is implemented on CoachBox, Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1302–1310.
enabling the mapping of player actions and shuttle trajec- [21] B. Xiao, H. Wu, and Y. Wei, ‘‘Simple baselines for human pose esti-
mation and tracking,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
tories onto real-world courts for visualization. This system pp. 466–481.
assists coaches and players in generating analysis reports [22] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel,
that provide insights into their games, helping them correct W. Xu, D. Casas, and C. Theobalt, ‘‘VNect: Real-time 3D human pose
estimation with a single RGB camera,’’ ACM Trans. Graph., vol. 36, no. 4,
their action poses and reduce the risk of sports injuries. The
pp. 1–14, Aug. 2017.
future work will focus on developing the action description [23] W. Luo, Y. Qin, Q. Li, D. Zhang, and L. Li, ‘‘Automatic mileage position-
language to translate the coach’s defined feature judgments, ing for road inspection using binocular stereo vision system and global
thus enhancing the algorithm’s efficiency and facilitating the navigation satellite system,’’ Autom. Construction, vol. 146, Feb. 2023,
Art. no. 104705.
systematic integration of all action features. [24] W. Liu, Q. Bao, Y. Sun, and T. Mei, ‘‘Recent advances of monocular 2D and
3D human pose estimation: A deep learning perspective,’’ ACM Comput.
Surv., vol. 55, no. 4, pp. 1–41, Apr. 2023.
REFERENCES
[25] P. Bhattacharjee and S. Das, ‘‘Temporal coherency based criteria for
[1] K. Host and M. Ivašić-Kos, ‘‘An overview of human action recognition predicting video frames using deep multi-stage generative adversarial
in sports based on computer vision,’’ Heliyon, vol. 8, no. 6, Jun. 2022, networks,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
Art. no. e09633. [26] A. Sen and K. Deb, ‘‘Categorization of actions in soccer videos using a
[2] B. Li and M. Tian, ‘‘Volleyball movement standardization recognition combination of transfer learning and gated recurrent unit,’’ ICT Exp., vol. 8,
model based on convolutional neural network,’’ Comput. Intell. Neurosci., no. 1, pp. 65–71, Mar. 2022.
vol. 2023, pp. 1–9, Jan. 2023. [27] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and
[3] Y. Li, Y. Liu, R. Yu, H. Zong, and W. Xie, ‘‘Dual attention based spatial– C. Theobalt, ‘‘Monocular 3D human pose estimation in the wild using
temporal inference network for volleyball group activity recognition,’’ improved CNN supervision,’’ in Proc. Int. Conf. 3D Vis. (3DV), Oct. 2017,
Multimedia Tools Appl., vol. 82, no. 10, pp. 15515–15533, Apr. 2023. pp. 506–516.
[4] M. Ibh, S. Grasshof, D. Witzner, and P. Madeleine, ‘‘TemPose: A new [28] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, ‘‘Human3.6M:
skeleton-based transformer model designed for fine-grained motion recog- Large scale datasets and predictive methods for 3D human sensing in
nition in badminton,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern natural environments,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
Recognit. Workshops (CVPRW), Jun. 2023, pp. 5198–5207. no. 7, pp. 1325–1339, Jul. 2014.
[5] K. Davids, S. Bennett, G. J. Savelsbergh, and J. Van der Kamp, Interceptive [29] T. Von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and
Actions in Sport: Information and Movement, 2002. G. Pons-Moll, ‘‘Recovering accurate 3D human pose in the wild using
[6] Ş. Maftei, ‘‘Study regarding the specific of badminton footwork, on dif- IMUs and a moving camera,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV),
ferent levels of performance,’’ in Proc. eLearning Softw. Educ. (eLSE), 2018, pp. 601–617.
vol. 13, no. 1. Carol I National Defence Univ. Publishing House, 2017, [30] N.-E. Sun, Y.-C. Lin, S.-P. Chuang, T.-H. Hsu, D.-R. Yu, H.-Y. Chung, and
pp. 161–166. T.-U. Ik, ‘‘TrackNetV2: Efficient shuttlecock tracking network,’’ in Proc.
[7] S.-H. Cheng, M. A. Sarwar, Y.-A. Daraghmi, T.-U. Ik, and Y.-L. Li, ‘‘Peri- Int. Conf. Pervasive Artif. Intell. (ICPAI), Dec. 2020, pp. 86–91.
odic physical activity information segmentation, counting and recognition [31] H.-H. Phan, T. T. Nguyen, N. H. Phuc, N. H. Nhan, D. M. Hieu, C. T. Tran,
from video,’’ IEEE Access, vol. 11, pp. 23019–23031, 2023. and B. N. Vi, ‘‘Key frame and skeleton extraction for deep learning-based
[8] K. Soomro and A. R. Zamir, ‘‘Action recognition in realistic sports human action recognition,’’ in Proc. RIVF Int. Conf. Comput. Commun.
videos,’’ in Computer Vision in Sports. Cham, Switzerland: Springer, 2015, Technol. (RIVF), Aug. 2021, pp. 1–6.
pp. 181–208. [32] Y. Kim and H. Myung, ‘‘Gesture recognition with a skeleton-based
[9] S. Zhou, ‘‘A survey of pet action recognition with action recommendation keyframe selection module,’’ 2021, arXiv:2112.01736.
based on HAR,’’ in Proc. IEEE/WIC/ACM Int. Joint Conf. Web Intell. Intell. [33] C. Lv, J. Li, and J. Tian, ‘‘Key frame extraction for sports training based
Agent Technol. (WI-IAT), Nov. 2022, pp. 765–770. on improved deep learning,’’ Sci. Program., vol. 2021, pp. 1–8, Sep. 2021.
[34] R. A. Teimoor and A. M. Darwesh, ‘‘Node detection and tracking in smart TSÌ-UÍ İK (Member, IEEE) received the B.S.
cities based on Internet of Things and machine learning,’’ UHD J. Sci. degree in mathematics and the M.S. degree in com-
Technol., vol. 3, no. 1, pp. 30–38, May 2019. puter science and information engineering from
[35] E. Ws, ‘‘Center of mass of the human body helps in analysis of balance the National Taiwan University, in 1991 and 1993,
and movement,’’ MOJ Appl. Bionics Biomech., vol. 2, no. 2, Apr. 2018. respectively, and the Ph.D. degree in computer
[36] G. Bradski, ‘‘The OpenCV library,’’ Dr. Dobb’s J. Softw. Tools, 2000. science from the Illinois Institute of Technology,
[37] H. Miyamori and S.-I. Iisaku, ‘‘Video annotation for content-based in 2005. He is currently a Professor with the
retrieval using human behavior analysis and domain knowledge,’’ in Proc. Department of Computer Science and the Director
4th IEEE Int. Conf. Autom. Face Gesture Recognit. (PR), Mar. 2000,
of the Institute of Computer Science and Engineer-
pp. 320–325.
ing, National Yang Ming Chiao Tung University.
[38] G. Zhu, C. Xu, Q. Huang, W. Gao, and L. Xing, ‘‘Player action recognition
His research focuses on intelligent applications, such as intelligent sports
in broadcast tennis video with applications to semantic analysis of sports
game,’’ in Proc. 14th ACM Int. Conf. Multimedia, Oct. 2006, pp. 431–440. learning and intelligent transportation systems, mobile sensing, machine
learning, deep learning, and wireless sensor and ad hoc networks.
He has been a Senior Research Fellow with the Department of Com-
MUHAMMAD ATIF SARWAR received the B.S. puter Science, City University of Hong Kong. He was bestowed the Out-
and M.S. degrees in computer science from COM- standing Young Engineer Award by the Chinese Institute of Engineers,
SATS University Islamabad, Sahiwal Campus, in 2009, and the Young Scholar Best Paper Award by IEEE IT/COMSOC
Pakistan, in 2015 and 2017, respectively. He is Taipei/Tainan Chapter, in 2010. He received the Best Paper Award at ITST
currently pursuing the Ph.D. degree with the EECS 2012. He received a three year Outstanding Young Researcher Grant from the
International Graduate Program, National Yang National Science Council, Taiwan, in 2012. In 2020, he received the Sports
Ming Chiao Tung University, Taiwan. His research Science Research and Development Award, MoE, Taiwan. In 2020 and 2021,
interests include artificial intelligence, deep learn- his research works received the MOST Future Tech Award.
ing, and computer vision. His current research to
detect activity recognition and actions in a retailers
store, sports, and exercise.