0% found this document useful (0 votes)
56 views10 pages

Skeleton Based Keyframe Detection Framework For Sports Action Analysis Badminton Smash Case

no

Uploaded by

Linhh Maii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views10 pages

Skeleton Based Keyframe Detection Framework For Sports Action Analysis Badminton Smash Case

no

Uploaded by

Linhh Maii
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Received 15 June 2023, accepted 16 August 2023, date of publication 22 August 2023, date of current version 30 August 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3307620

Skeleton Based Keyframe Detection Framework


for Sports Action Analysis: Badminton
Smash Case
MUHAMMAD ATIF SARWAR 1 , YU-CHEN LIN1 , YOUSEF-AWWAD DARAGHMI 2,

TSÌ-UÍ İK 1 , (Member, IEEE), AND YIH-LANG LI 1 , (Member, IEEE)


1 Department of Computer Science, College of Computer Science, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
2 Department of Computer Systems Engineering, Palestine Technical University–Kadoorie, Tulkarm 310, Palestine
Corresponding author: Tsì-Uí İk ([email protected])
This work of Tsì-Uí İk was supported in part by the Ministry of Science and Technology, Taiwan under grants MOST
110-2627-H-A49-001, MOST 110-2221-E-A49-063-MY3, and MOST 111-2622-E-A49-009. This work was financially supported
by the Center for Open Intelligent Connectivity from The Featured Areas Research Center Program within the Higher Education
Sprout Project framework by the Ministry of Education (MOE) in Taiwan.

ABSTRACT The analysis of badminton player actions from videos plays a crucial role in improving athletes’
performance and generating statistical insights. The complexity and speed of badminton movements pose
unique challenges compared to everyday activities. To analyze badminton player actions, we propose a
skeleton-based keyframe detection framework for action analysis. Keyframe detection is widely used in
video summarization and visual localization due to its computational efficiency and memory optimization
compared to analyzing all frames of a video. This framework segments the complex macro-level activity
into micro-level segments and analyzes each micro-level activity individually. Firstly, it extracts skeleton
data from a motion sequence video using 3D:VIBE pose estimation. Then, the keyframe detection module
explores the sequence of activity frames and identifies keyframes for each micro-level activity, including
start, ready, strike, and end. Finally, the posture and movement detection modules analyze the posture and
movement data to identify specific activities. This framework is implemented in the device called CoachBox.
The proposed framework is evaluated using the mean absolute error on a dataset. The average mean absolute
error for the keyframe detection module is less than 0.168 seconds, and the striking moment detection has
an error of only 0.033 seconds. Additionally, a coordinate transform method is provided to convert body
coordinates to real-world coordinates for visualization purposes.

INDEX TERMS Keyframe detection, action analysis, skeleton detection, coordinate transform, action
analysis framework.

I. INTRODUCTION SAR applications is to identify the athlete’s actions from


Sports Action Recognition (SAR) is a challenging task used an unknown video sequence, determine the action’s dura-
for various sports, including soccer, volleyball, basketball, tion and type, monitor a player’s performance, track their
tennis, and badminton [1], [2], [3], [4]. SAR detects and movements, recognize the performed action, compare various
recognizes actions during competitions, matches, warm-ups, actions, compare different kinds and skills of performances,
and training sessions [5], [6]. SAR-based applications have or perform automatic statistical analysis [7], [8], [9].
been extensively utilized by sports analysts and coaches Badminton is a highly technical sport that can greatly
to enhance athletes’ performance. The main objective of benefit from SAR-based applications for analyzing player
actions. Recently, research on badminton actions [10], [11]
The associate editor coordinating the review of this manuscript and has made rapid progress in monitoring athletes’ perfor-
approving it for publication was Joewono Widjaja . mance. It involves comparing various actions performed
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
VOLUME 11, 2023 90891
M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

by different players or multiple executions by a single


player, such as smash, clear, and backhand. These advance-
ments assist in practicing techniques and improving playing
styles [12], [13]. Ramasinghe et al. [14] proposed an accurate
approach for badminton stroke recognition using RGB-D
data. They utilized dense trajectories and trajectory-aligned
HOG features to classify four stroke classes, including smash,
forehand, backhand, and break, using an SVM classifier.
Wang et al. [15] inserted a chip into the badminton racket
to collect data, which was then analyzed by a deep con-
FIGURE 1. Skeleton based keyframe detection and action recognition
volutional neural network (CNN) for action recognition. framework.
Another example is PitchAI, [16] a mobile app that analyzes
pitching movements using neural networks and 3D skeleton
data to calculate movement features and evaluate the kinetic providing insights into both the macro-level and micro-level
chain. The CoachBox [17] is a computer vision-based system activities performed by the players.
designed to monitor badminton strike actions and improve The proposed framework is implemented in CoachBox,
player performance. a stereo vision device equipped with two cameras that auto-
Intelligent badminton action recognition techniques [10], matically captures badminton actions for learning purposes.
[11] have been developed to provide objective analysis and The framework utilizes a database collected from multiple
evaluation of badminton, leading to improved accuracy in athletes’ action data, including badminton smashes, and con-
performance analysis and more efficient training programs sists of videos of a total of 600 badminton rallies. To evaluate
[18]. However, assessing and objectively measuring bad- the proposed methodology, the mean absolute error is used as
minton activity from the macro-level to the micro-level is the performance metric, calculated for each player. The aver-
more challenging compared to daily activities. These tech- age mean absolute error for the keyframe detection module is
niques do not provide detailed analysis of activities at the less than 0.168 seconds, indicating a high accuracy in detect-
micro-level, such as specific actions, movements, or behav- ing keyframes. Furthermore, the striking moment is measured
iors of individuals or objects. They also lack the ability to with an average mean absolute error of only 0.033 seconds,
capture fine-grained aspects of an activity and may overlook demonstrating precise detection of the precise moment of
important seconds-level details or specific interactions. For impact. The small magnitude of changes in activity within
instance, badminton strokes, e.g. the smash, can be further a short time frame suggests that the proposed framework
segmented into micro-level activity to enhance the analysis of performs well in capturing and analyzing badminton actions,
the player’s complete action. This level of granularity is nec- ensuring that the detected keyframes and striking moments
essary to capture and understand the nuances of each attribute are within an acceptable range.
of micro-level badminton activity. Therefore, there is a need The primary contributions of this work include the devel-
for a system that can comprehensively analyze badminton opment of the framework, the utilization of a comprehensive
activity from the macro-level to the micro-level, providing dataset, and the evaluation of the methodology using mean
insights into each attribute of micro-level badminton activity. absolute error, highlighting the effectiveness and accuracy of
In order to perform badminton video analysis and pro- the proposed approach are:
cess activity from the macro-level to the micro-level, it is • Through our experiments, we demonstrate how incor-
crucial to extract relevant and essential information about porating the skeleton-based keyframe selection module
the badminton player. One important step in video analysis assists in achieving effective keyframe features for the
is the extraction of keyframes, as they contain significant badminton activity framework.
information that provides a summary of the entire video • The developed framework extracts skeleton informa-
sequence. In this paper, we propose a novel framework that tion from the video and identifies keyframe pos-
utilizes badminton videos to extract keyframes from motion tures for action recognition, reducing the computational
sequences. This framework combines pose estimation tech- resources required compared to using all frames of a
niques and a keyframe detection algorithm to classify activi- video.
ties from the macro-level to the micro-level, as illustrated in • The proposed network pathways are designed to provide
Fig. 1. The recognition of skeleton-based activity is achieved real-time feedback for coaches and athletes, enabling
by analyzing the sequence of skeleton keypoints over time them to improve their actions in a timely manner.
using 3D:VIBE [19] from RGB data [20], [21], [22]. The The rest of the paper is organized as follows: Section II
keyframe detection module extracts key-pose frames from discusses the related work about badminton activity recog-
a series of activity frames, while the pose and movement nition. Section III presents the methodology and prototype
modules detect and recognize key poses based on prede- design. Section IV describes the performance evaluation and
fined rules. By integrating these components, our framework implementation environment, and section V concludes this
enables the analysis of badminton videos at various levels, research.

90892 VOLUME 11, 2023


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

II. RELATED METHODS body pose and shape of badminton players from such videos.
In this section, we will explain the related methods that are VIBE adopts an end-to-end architecture, transforming 2D
employed to recognize badminton shot actions. The stereo input images to 3D skeleton coordinates through a generative
vision cameras capture the players’ video that assists in adversarial network (GAN) [25]. To capture the temporal
reconstructing the 3D representation of the player and calcu- relationship of the video frames and enhance action coher-
lating the court size. Subsequently, OpenPose and VIBE are ence, VIBE incorporates a gated recurrent unit (GRU) [26].
utilized to extract the 2D and 3D skeletons from the video. To train VIBE, a mixed dataset comprising 2D and 3D data
Finally, a keyframe detection module is employed to extract from MPI-INF-3DHP [27], Human3.6M [28], and 3DPW
key frames for sub-action content analysis. Additionally, the [29] is employed. This diverse dataset ensures robust training
details of the badminton action recognition and keyframe and generalization of the network. VIBE’s performance is
extraction methods will be provided. evaluated using the Percentage of Correct Keypoints (PCK)
metric, achieving an impressive correctness score of 89.3%.
A. STEREO VISION CAMERA VIBE detects 49 key points of the skeleton, providing a
We utilized two stereo vision cameras with different viewing detailed understanding of the actions captured in the bad-
angles to accurately capture the depth information of bad- minton video. This comprehensive set of key points enables
minton actions by employing the principle of triangulation a thorough analysis of the player action content, facilitating
[23]. By obtaining depth information from stereo vision, further interpretation.
it becomes possible to reconstruct a three-dimensional (3D)
representation of the scene. Prior to calculating the 3D posi- C. TRAJECTORY DETECTION
tions, it is necessary to determine the intrinsic and extrinsic We employed the TrackNetv2 network, as described in [30],
matrices for each camera. to track shuttlecocks and visualize their positions in the vir-
To obtain the intrinsic matrix and distortion parameters tual world court. TrackNetV2 is specifically designed to excel
for each camera, a checkerboard pattern was used to assist in detecting small, fast-moving objects such as shuttlecocks
in calibration. Additionally, the extrinsic matrix for each in video footage. It operates on a frame-by-frame basis, accu-
camera in the court was calculated by employing homography rately determining the shuttlecock’s position in each frame.
mapping from the white field lines of the court to the known The architecture of TrackNetV2 follows an encoder-
court size. Once the camera parameters were obtained, the 3D decoder structure. The encoder acts as a feature extractor,
points were triangulated from a set of points calculated using utilizing convolutional kernels to capture image clues and
two different perspective images. condensing the features through max-pooling operations.
Conversely, the decoder expands the feature maps to gen-
B. HUMAN SKELETON DETECTION erate the prediction function, enabling accurate shuttlecock
Human skeleton detection can be broadly categorized into tracking.
two types: 2D skeleton prediction models and 3D skeleton TrackNetV2 is trained on dataset that contains
prediction models. 55563 frames including 15 broadcast videos of professional
games and 3 amateur games. In order to prevent overfitting,
1) 2D HUMAN SKELETON DETECTION we collected an additional 125 rally videos with diverse
Firstly, we utilize OpenPose [20], which is an open-source backgrounds and filming angles. Approximately 2,500 to
state-of-the-art method based on Part Affinity Fields (PAF) to 3,000 frames were included from each video. TrackNetV2
track human pose in badminton court. PAF provides vectors accuracy respectively reach to 98.7% in the training phase
that connect one joint point to the next, capturing the rela- and 85.4% in a test on a new match. Moreover, TrackNetV2
tionships between different body parts. OpenPose is a highly exhibits a processing speed of 31.84 frames per second (FPS),
capable framework that enables the detection and tracking which greatly facilitates shuttlecock tracking in our approach.
of multiple people’s poses simultaneously. This multi-person
tracking capability is particularly important as it allows us to D. KEYFRAME EXTRACTION
account for human interactions and analyze them in natural Several researchers have proposed various keyframe extrac-
settings. By leveraging OpenPose, we can accurately track tion methods using different strategies. Phan et al. [31] intro-
and analyze the poses of badminton player and understand duced an efficient framework named KFSENet for action
their interactions with each other that aids in visualizing from recognition in videos, incorporating keyframe extraction
real-world court to the virtual world. based on skeleton deep learning architectures. Kim et al. [32]
proposed a bidirectional consecutively connected two path-
2) 3D HUMAN SKELETON DETECTION way network (BCCN) for efficient gesture recognition using
We utilized the VIBE (Video Inference for Human Body a Skeleton-Based Keyframe Selection Module. Lv et al. [33]
Pose and Shape Estimation) network [19] to extract the 3D developed a sports action classification system for accu-
human skeleton from a monocular RGB video [24]. The rately classifying athletes’ actions based on keyframe
primary objective of VIBE is to accurately estimate the 3D extraction.

VOLUME 11, 2023 90893


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

FIGURE 3. Keyframe detection module.


FIGURE 2. Keyframe based activity recognition methodology.

III. KEYFRAME BASED ACTION ANALYSIS


METHODOLOGY
keyframe based action analysis framework contains four
main modules including data extraction, keyframe detection,
posture detection and movement detection, as illustrated in
Fig. 2.

A. DATA EXTRACTION MODULE


The first module of the system is data extraction, which
consists of five steps: multi-view video, camera parame-
ters, skeleton detection, trajectory detection, and coordi-
nate system. The multi-view video system is equipped with
two stereo-vision cameras to capture the game video of
badminton players. The camera parameters, including the
intrinsic and extrinsic matrix and distortion parameters, are
calculated using the methodology defined in subsection II-A
of section II. The skeleton detection utilizes the 3D:VIBE
[19] method to identify the keypoints of the human body from
the video. Similarly, the TrackNetV2 [30] network tracks the
3D position of the badminton shuttle from the video.
After the calculation of the above steps, the coordinate
system incorporates real-world court coordinates synchro-
nized with timestamps and camera parameters. The court FIGURE 4. Working flow of keyframe detection module.
coordinates are defined with the court as the origin, the
short side as the X-axis, the long side as the Y-axis, and the
Z-axis pointing upward towards the ground’s normal vector. To identify these keyframes, we focus on studying the bad-
The corrected camera parameters are utilized to establish the minton smash action, which can be roughly divided into four
relative relationship between the two cameras and the court key poses. These poses are determined based on the skele-
origin, enabling the visualization of the players’ actions and ton keypoints and ball trajectory, as illustrated in Figure 3.
the ball. Lastly, the body and shuttle coordinates extraction By analyzing the skeleton and trajectory information, each
are mapped onto the world coordinate system to be visualized smashing video is segmented into several one-shot videos
in the virtual court. using the trajectory turning point as a reference. The one-shot
videos are synchronized with timestamps from two different
B. KEYFRAME DETECTION MODULE perspectives, and the entire action can be segmented based on
Keyframe detection is utilized to extract key-pose frames four key posture positions, including the start, ready, strike,
from a series of action frames. In a sequence of action frames, and end poses.
there are several key poses that specifically represent certain First, the ball trajectory turning point is calculated to deter-
actions. Our proposed method for action recognition extracts mine the moment of the strike position. The turning point
key pose frames from videos instead of analyzing the entire can be calculated by analyzing the variation product of the
video sequence frame-by-frame. This approach significantly Y vector, considering that badminton is played along the long
reduces the amount of data that needs to be processed, axis. If the trajectory turning point is not found, it indicates
as keyframes only capture the most significant parts of the that the ball was not hit, and the racquet is considered invalid.
video where noticeable changes occur. This process is illustrated in the flow chart shown in Figure 4.

90894 VOLUME 11, 2023


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

Second, the ready posture keyframe is determined based on


the position of the elbow as the back-most point of the court
from the beginning of the video up to the keyframe of the
strike. The elbow keypoint plays a crucial role in identifying
the ready posture in the badminton smash action.
f0 +k∗fps
X X w
|jif +1 − jif | < const (1)
f =f0 i=0

Third, the start posture keyframe is identified as the frame


with the least skeleton variation. The start posture is deter-
mined based on the relaxed position of the body, either
considering the entire body or specific joints in a relaxed posi-
tion. The relaxation position is calculated using the Euclidean
distance equation [34], as shown in Equation 1. In Equation 1, FIGURE 5. Posture detection mechanism for badminton smash start post.

f0 represents the initial frame, k is the time in seconds, fps is


the camera frame rate per second, w denotes a set of important
skeleton points (such as the shoulder, elbow, neck, wrist), segmented into several segments. The static pose is defined
and ji represents the 3D coordinates of skeleton point i in by the stillness captured in the keyframe, while the move-
the court. The equation implies that the sum of the moving ment occurring near the keyframes or between two keyframes
distances of all skeleton points w within k seconds should represents the dynamic pose. For instance, in the case
be less than a threshold. The threshold can be determined of a badminton smash, there are four key postures: start,
by calculating the sum of the first quarter of the movement ready, hit, and end. Additionally, there are three key move-
of all skeleton keypoints (e.g., if the keypoints’ movement ments between adjacent keyframes, including the preparation
is 100cm, a threshold value of 25cm is considered). If the period, swing period, and closing period. The extracted fea-
Euclidean distance variation for the w skeleton points within tures from these key postures and movements are valuable as
k seconds is less than the threshold, it is considered as a start they can be evaluated based on expert feedback. This evalu-
posture keyframe for the smash action. ation process aids both beginners and professional players in
Finally, the end posture keyframe is determined by con- assessing their actions.
sidering the position of the wrist as the lowest point on the The posture detection module measures the position of
court from the keyframe of the strike to the end of the video. each keyframe posture, including start, ready, strike, and end,
The timestamp can also be utilized to assist in identifying the based on rules defined by coaches and experts [16]. For
keyframe, especially when the keyframe is located very close example, the start posture of a badminton smash is defined by
to the end of the video. The overall analytical framework for five skeleton keypoints positions. First, the lean angle, which
the badminton smash action, which includes action descrip- is the angle between the spine vector J1 J8 and the ground
tion, keyframe detection, posture analysis, and movement normal vector n, should be between 10◦ ∼20◦ . Second, the
analysis, is defined in Table 2. hands should be placed naturally in front of the body without
Furthermore, this framework can also be applied to ana- overlapping, ensuring that the right hand J4 and left hand J7
lyze other ball actions in badminton, such as the fore- are on their respective sides and facing the plane defined by
hand/backhand high ball and the forehand/backhand cut ball. keypoints J2 J8 J5 .
The process involves specifying keyframe restrictions and Third, the knees keypoints J10 , J13 should be in a squat
extracting the features of each posture and movement for position, and the angle between the both thigh keypoints
analysis. The only differences lie in the number of keyframes J9 , J11 and the both calf keypoints J12 , J14 should be between
and the specific action judgments. For instance, the analysis 150◦ ∼170◦ , as shown in the orange rectangle in Fig. 5.
of the hit angle is required for high ball, smash ball, and Fourth, the absolute distance between the two heel keypoints
cut ball, but the landing location of the ball and the ball J21 and J24 should be the same as the width between the
speed requirements may vary. Therefore, the judgments for shoulder keypoints (J2 and J5 ), with an acceptable error
different types of actions need to be tailored accordingly. of about 20 cm. Finally, the center of gravity should fall
In this case, the number of keyframes remains the same for between the feet, with an acceptable error of about 10 cm, and
these three actions, but it will differ from the analysis of a flat the gravity vector G should be projected onto the midpoint
ball. keypoint (ankle joint keypoints J11 and J14 ) with an absolute
vector error of 10%. Similar patterns are used to measure
C. POSTURE AND MOVEMENT DETECTION MODULE the remaining keyframe postures, including ready, strike, and
The posture and movement evaluation modules analyze end.
both the static and dynamic movements of the action fea- The movement detection module detects the player move-
tures. Once the keyframes are detected, the entire action is ment duration between the adjacent keyframes including the

VOLUME 11, 2023 90895


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

TABLE 1. Characteristics of the study participants.

FIGURE 6. Movement detection: smash swing period. FIGURE 7. OpenPose keypoints based gravity center.

preparation period, the swing period, and the closing period.


The movement detection mainly focuses on changes and a clearer understanding of their actions and improve them,
accumulation in joints keypoints including rotation, action as illustrated in Fig.7. The center of gravity is a significant
sequence, kinetic chain, ball speed, etc. A player movement aspect of biomechanics and locomotion, aiding in the mod-
is detected based on the movement features which is roughly eling of the human body and its activities. It is instrumental
divided into four common categories including (i) angle; in assessing static positions and various types of movement
(ii) spatial comparison; (iii) rotation; (iv) relax positions. All techniques. The center of gravity [35] of the whole body
these angles, lines and planar are composed of human joint is calculated using the weighted average of body keypoints,
keypoints, ball positions, coordinate points, or gravity points as shown in equation2, where wi represents the weight aver-
of the human body. For example, the swing period is detected age of keypoints and ji denotes keypoint i of the body.
between the ready and strike posture as illustrated in Fig. 6.
The swing period is measured based on the movement of
P
wi ∗ ji
upper arm with rotation of the waist and the forearm to hit G= P (2)
wi
the ball. Similarly, the rest of movement period is determined
based on the rules shown in table 2 in the movement analysis 2) COORDINATE TRANSFORM
column. The posture and movement detection modules assists The coordinate transform method is used to map the human
in determining the player’s action, and an action is determined body coordinates (kppose ), generated by VIBE, onto the vir-
based on the body movement position either the whole body tual court coordinate system (kpcourt ), as depicted in Fig. 8.
joints or some joints. In the figure, the red dotted rectangle represents the camera
coordinates, and the blue color notations represent unknown
1) GRAVITY POINT parameters. The intrinsic matrix and extrinsic matrix are
It is worth mentioning that the center of gravity plays a crucial represented by Ks , which is obtained using the perspective
role in ensuring the correctness of the player’s action. The projection formula of computer vision and perspective-n-
change in the center of gravity helps beginner players gain Point pose computation [36]. Rpose and Tpose are calculated

90896 VOLUME 11, 2023


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

FIGURE 8. Coordinate transform method.


FIGURE 10. Camera view angle and tester standing position.

position so that tester can hit the ball better, (iv) selecting the
court line corner to calculate the extrinsic matrix, (v) starting
the test.
The dataset contains six types of ball for each of the ten
people that have different gender, ages, and levels, as stated in
table 2. The ball types include forehand and backhand smash,
forehand and backhand high ball, forehand and backhand cut
ball, and involve a total of six actions of 10 people. Each
action is done 10 times, which means the ball machine will
serve 10 consecutive balls as the data collection. So, a short
FIGURE 9. Visualization of coordinates points with ball trajectory in video of a total of 600 rallies that are pre-edited and stored in
virtual court.
the dataset. The intrinsic matrix and extrinsic matrix are saved
in the dataset. The collecting process is that the player arrives
by applying rotation and taking the transpose of the skeleton at the designated position on the court, and then another
pose keypoints. After rotation and transpose of the skeleton person presses the start test button. The ball machine first
pose, a dot product is performed with the body coordinate serve two balls for initial testing, and these two balls will not
kppose to obtain the body keypoints image kpimg , which visu- be included in the evaluation. Then, the official test starts by
alizes the player’s body keypoints in the virtual court. serving 10 consecutive balls for the data collection and for
Similarly, Rcourt and Tcourt represent the court coordinates changing the next action.
rotation and transpose notations. These notations, along with
the dot product of kpcourt , also help in obtaining the body B. PERFORMANCE EVALUATION
keypoints image kpimg to visualize the player’s body key- 1) KEYFRAME DETECTION EVALUATION
points in the virtual court. The notation kpcourt is obtained The keyframe detection module is evaluated using the above
using the formula defined in step 2. After finding the body mentioned dataset based on the mean absolute error which
and court keypoints, they are visualized in the virtual court compares the ground truth and prediction frame labeling.
along with the shuttle trajectory, as illustrated in Fig. 9, where The keyframe module is evaluated according to different
white points indicate the badminton trajectory and red points categories of players. The first category represents the pro-
represent the body keypoints. fessional player which is only one in our dataset and the
rest belongs to the beginner category. First, we evaluated the
IV. PERFORMANCE EVALUATION AND module for the professional player and calculated the error.
IMPLEMENTATION ENVIRONMENT The module used 60 different types of action to detect the
A. DATASET four different keyframes postures including the start, ready,
The badminton games dataset was captured independently strike and end. The average mean absolute errors of the four
on multiple subjects and used for performance evaluation. keyframe posture are shown in Table 3.
This dataset has synchronized multi-view videos and labeled The performance of the keyframe detection module is also
keyframes that are defined by each evaluation algorithm. The evaluated on 5 beginner-level players which are randomly
dataset collection process include (i) fixing the position and decided and their actions frames are 300 actions in total. The
angle of the two cameras, (ii) the testing player needs to stand average mean absolute errors of the four keyframe posture are
in the red rectangle of the two pictures in CoachBox like the shown in Table 4. The results show that the mean absolute
fig 10, (iii) a ball machine also needs to be placed in a fixed error is larger for the beginner players that the professional

VOLUME 11, 2023 90897


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

TABLE 2. Characteristics of the study participants.

TABLE 3. keyframe evaluation on pro-level player. TABLE 5. Comparative analysis with related methods.

TABLE 4. keyframe evaluation on beginner-level player.

TABLE 6. Number of evaluated strokes of each class.

FIGURE 11. Visualization of Athletes performance.

player due to the non-standard action. The reason for the


error is generally that the entire action is not completed or
the player does not return to the original position.
FIGURE 12. CoachBox technology overview.

2) POSTURE EVALUATION
Each posture position is evaluated based on the pre-defined
rules in Table 2 and Fig. 5. The athlete who has the most results showed that our method outperformed the other state-
accurate posture position gets the highest score, as illustrated of-the-art methods for player action recognition in badminton
in Fig. 11. The figure shows that the average score of the videos. Similarly, the proposed method is evaluated on each
beginning-level athlete is lower than professional player. The class to check the performance. Table 6 shows each class
reason is that their actions are inaccurate and incomplete result with their respective shots data.
compared to the professional player.
The proposed method is compared with a similar method as C. CoachBox: SYSTEM TECHNIQUE
shown in Table5. These methods are able to recognize various The CoachBox’s entire system technology is illustrated in
player actions, such as forehand, backhand, serve, and volley, Fig. 12. It is divided into four main parts, video capture,
by analyzing the video frames. The performance of these shuttlecock trajectory tracking, skeleton detection, and action
proposed methods are evaluated on their own datasets, and the analysis. The video capture part includes how to synchronize

90898 VOLUME 11, 2023


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

and record the multi-view video, camera calibration that cal- [10] H. Wang and C. Schmid, ‘‘Action recognition with improved trajectories,’’
culates the intrinsic matrix and the extrinsic matrix, and also in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 3551–3558.
[11] S. Ji, W. Xu, M. Yang, and K. Yu, ‘‘3D convolutional neural networks
includes using MQTT to transport the data from two cam- for human action recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
eras. Trajectory tracking includes using TrackNetV2 to detect vol. 35, no. 1, pp. 221–231, Jan. 2013.
shuttlecock tracks, 3D positioning, trajectory smoothing, etc. [12] W. Liu and J. Ke, ‘‘A brief analysis of multi-ball training in badminton,’’
Educ. Res. Frontier, vol. 10, no. 4, 2020.
Skeleton detection includes using VIBE to detect 3D human [13] T. Huang, Y. Li, and W. Zhu, ‘‘An auxiliary training method for single-
skeletons, parsing the output of VIBE, and transforming 3D player badminton,’’ in Proc. 16th Int. Conf. Comput. Sci. Educ. (ICCSE),
skeleton points into the court coordinate system. The last Aug. 2021, pp. 441–446.
[14] S. Ramasinghe, K. G. M. Chathuramali, and R. Rodrigo, ‘‘Recognition of
part is the action analysis system which uses the framework badminton strokes using dense trajectories,’’ in Proc. 7th Int. Conf. Inf.
proposed in this study to systematically analyze the actions, Autom. Sustainability, Dec. 2014, pp. 1–6.
including data extraction, keyframe detection, posture evalu- [15] Y. Wang, W. Fang, J. Ma, X. Li, and A. Zhong, ‘‘Automatic badminton
ation, and movement evaluation. action recognition using CNN with adaptive feature extraction on sensor
data,’’ in Proc. 15th Int. Conf. Intell. Comput. Theories Appl. (ICIC),
Nanchang, China. Cham, Switzerland: Springer, Aug. 2019, pp. 131–143.
V. CONCLUSION [16] Proplayai. Pitchai. [Online]. Available: https://proplayai.com/pitchai/
This paper presents a badminton action analysis framework [17] P.-Y. Kuo. Badminton Smash Visualization System. [Online]. Available:
https://etd.lib.nctu.edu.tw/cgi-bin/gs32/tugsweb.cgi?o=dnctucdr&s=id=%
that offers a solution for analyzing and evaluating com- 22GT0706568120%22.&searchmode=basic
plex shots, such as the smash, in badminton videos using [18] C. Feichtenhofer, A. Pinz, and R. P. Wildes, ‘‘Spatiotemporal multiplier
the keyframe detection module. The framework can han- networks for video action recognition,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jul. 2017, pp. 7445–7454.
dle real-time inputs from badminton games and provides a [19] M. Kocabas, N. Athanasiou, and M. J. Black, ‘‘VIBE: Video inference
comprehensive analysis of badminton activity, ranging from for human body pose and shape estimation,’’ in Proc. IEEE/CVF Conf.
macro-level to micro-level analysis, allowing for insights Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 5252–5262.
[20] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, ‘‘Realtime multi-person 2D
into each attribute of micro-level badminton activity. Fur- pose estimation using part affinity fields,’’ in Proc. IEEE Conf. Comput.
thermore, the framework is implemented on CoachBox, Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 1302–1310.
enabling the mapping of player actions and shuttle trajec- [21] B. Xiao, H. Wu, and Y. Wei, ‘‘Simple baselines for human pose esti-
mation and tracking,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), 2018,
tories onto real-world courts for visualization. This system pp. 466–481.
assists coaches and players in generating analysis reports [22] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H.-P. Seidel,
that provide insights into their games, helping them correct W. Xu, D. Casas, and C. Theobalt, ‘‘VNect: Real-time 3D human pose
estimation with a single RGB camera,’’ ACM Trans. Graph., vol. 36, no. 4,
their action poses and reduce the risk of sports injuries. The
pp. 1–14, Aug. 2017.
future work will focus on developing the action description [23] W. Luo, Y. Qin, Q. Li, D. Zhang, and L. Li, ‘‘Automatic mileage position-
language to translate the coach’s defined feature judgments, ing for road inspection using binocular stereo vision system and global
thus enhancing the algorithm’s efficiency and facilitating the navigation satellite system,’’ Autom. Construction, vol. 146, Feb. 2023,
Art. no. 104705.
systematic integration of all action features. [24] W. Liu, Q. Bao, Y. Sun, and T. Mei, ‘‘Recent advances of monocular 2D and
3D human pose estimation: A deep learning perspective,’’ ACM Comput.
Surv., vol. 55, no. 4, pp. 1–41, Apr. 2023.
REFERENCES
[25] P. Bhattacharjee and S. Das, ‘‘Temporal coherency based criteria for
[1] K. Host and M. Ivašić-Kos, ‘‘An overview of human action recognition predicting video frames using deep multi-stage generative adversarial
in sports based on computer vision,’’ Heliyon, vol. 8, no. 6, Jun. 2022, networks,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
Art. no. e09633. [26] A. Sen and K. Deb, ‘‘Categorization of actions in soccer videos using a
[2] B. Li and M. Tian, ‘‘Volleyball movement standardization recognition combination of transfer learning and gated recurrent unit,’’ ICT Exp., vol. 8,
model based on convolutional neural network,’’ Comput. Intell. Neurosci., no. 1, pp. 65–71, Mar. 2022.
vol. 2023, pp. 1–9, Jan. 2023. [27] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, and
[3] Y. Li, Y. Liu, R. Yu, H. Zong, and W. Xie, ‘‘Dual attention based spatial– C. Theobalt, ‘‘Monocular 3D human pose estimation in the wild using
temporal inference network for volleyball group activity recognition,’’ improved CNN supervision,’’ in Proc. Int. Conf. 3D Vis. (3DV), Oct. 2017,
Multimedia Tools Appl., vol. 82, no. 10, pp. 15515–15533, Apr. 2023. pp. 506–516.
[4] M. Ibh, S. Grasshof, D. Witzner, and P. Madeleine, ‘‘TemPose: A new [28] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, ‘‘Human3.6M:
skeleton-based transformer model designed for fine-grained motion recog- Large scale datasets and predictive methods for 3D human sensing in
nition in badminton,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern natural environments,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 36,
Recognit. Workshops (CVPRW), Jun. 2023, pp. 5198–5207. no. 7, pp. 1325–1339, Jul. 2014.
[5] K. Davids, S. Bennett, G. J. Savelsbergh, and J. Van der Kamp, Interceptive [29] T. Von Marcard, R. Henschel, M. J. Black, B. Rosenhahn, and
Actions in Sport: Information and Movement, 2002. G. Pons-Moll, ‘‘Recovering accurate 3D human pose in the wild using
[6] Ş. Maftei, ‘‘Study regarding the specific of badminton footwork, on dif- IMUs and a moving camera,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV),
ferent levels of performance,’’ in Proc. eLearning Softw. Educ. (eLSE), 2018, pp. 601–617.
vol. 13, no. 1. Carol I National Defence Univ. Publishing House, 2017, [30] N.-E. Sun, Y.-C. Lin, S.-P. Chuang, T.-H. Hsu, D.-R. Yu, H.-Y. Chung, and
pp. 161–166. T.-U. Ik, ‘‘TrackNetV2: Efficient shuttlecock tracking network,’’ in Proc.
[7] S.-H. Cheng, M. A. Sarwar, Y.-A. Daraghmi, T.-U. Ik, and Y.-L. Li, ‘‘Peri- Int. Conf. Pervasive Artif. Intell. (ICPAI), Dec. 2020, pp. 86–91.
odic physical activity information segmentation, counting and recognition [31] H.-H. Phan, T. T. Nguyen, N. H. Phuc, N. H. Nhan, D. M. Hieu, C. T. Tran,
from video,’’ IEEE Access, vol. 11, pp. 23019–23031, 2023. and B. N. Vi, ‘‘Key frame and skeleton extraction for deep learning-based
[8] K. Soomro and A. R. Zamir, ‘‘Action recognition in realistic sports human action recognition,’’ in Proc. RIVF Int. Conf. Comput. Commun.
videos,’’ in Computer Vision in Sports. Cham, Switzerland: Springer, 2015, Technol. (RIVF), Aug. 2021, pp. 1–6.
pp. 181–208. [32] Y. Kim and H. Myung, ‘‘Gesture recognition with a skeleton-based
[9] S. Zhou, ‘‘A survey of pet action recognition with action recommendation keyframe selection module,’’ 2021, arXiv:2112.01736.
based on HAR,’’ in Proc. IEEE/WIC/ACM Int. Joint Conf. Web Intell. Intell. [33] C. Lv, J. Li, and J. Tian, ‘‘Key frame extraction for sports training based
Agent Technol. (WI-IAT), Nov. 2022, pp. 765–770. on improved deep learning,’’ Sci. Program., vol. 2021, pp. 1–8, Sep. 2021.

VOLUME 11, 2023 90899


M. A. Sarwar et al.: Skeleton Based Keyframe Detection Framework for Sports Action Analysis

[34] R. A. Teimoor and A. M. Darwesh, ‘‘Node detection and tracking in smart TSÌ-UÍ İK (Member, IEEE) received the B.S.
cities based on Internet of Things and machine learning,’’ UHD J. Sci. degree in mathematics and the M.S. degree in com-
Technol., vol. 3, no. 1, pp. 30–38, May 2019. puter science and information engineering from
[35] E. Ws, ‘‘Center of mass of the human body helps in analysis of balance the National Taiwan University, in 1991 and 1993,
and movement,’’ MOJ Appl. Bionics Biomech., vol. 2, no. 2, Apr. 2018. respectively, and the Ph.D. degree in computer
[36] G. Bradski, ‘‘The OpenCV library,’’ Dr. Dobb’s J. Softw. Tools, 2000. science from the Illinois Institute of Technology,
[37] H. Miyamori and S.-I. Iisaku, ‘‘Video annotation for content-based in 2005. He is currently a Professor with the
retrieval using human behavior analysis and domain knowledge,’’ in Proc. Department of Computer Science and the Director
4th IEEE Int. Conf. Autom. Face Gesture Recognit. (PR), Mar. 2000,
of the Institute of Computer Science and Engineer-
pp. 320–325.
ing, National Yang Ming Chiao Tung University.
[38] G. Zhu, C. Xu, Q. Huang, W. Gao, and L. Xing, ‘‘Player action recognition
His research focuses on intelligent applications, such as intelligent sports
in broadcast tennis video with applications to semantic analysis of sports
game,’’ in Proc. 14th ACM Int. Conf. Multimedia, Oct. 2006, pp. 431–440. learning and intelligent transportation systems, mobile sensing, machine
learning, deep learning, and wireless sensor and ad hoc networks.
He has been a Senior Research Fellow with the Department of Com-
MUHAMMAD ATIF SARWAR received the B.S. puter Science, City University of Hong Kong. He was bestowed the Out-
and M.S. degrees in computer science from COM- standing Young Engineer Award by the Chinese Institute of Engineers,
SATS University Islamabad, Sahiwal Campus, in 2009, and the Young Scholar Best Paper Award by IEEE IT/COMSOC
Pakistan, in 2015 and 2017, respectively. He is Taipei/Tainan Chapter, in 2010. He received the Best Paper Award at ITST
currently pursuing the Ph.D. degree with the EECS 2012. He received a three year Outstanding Young Researcher Grant from the
International Graduate Program, National Yang National Science Council, Taiwan, in 2012. In 2020, he received the Sports
Ming Chiao Tung University, Taiwan. His research Science Research and Development Award, MoE, Taiwan. In 2020 and 2021,
interests include artificial intelligence, deep learn- his research works received the MOST Future Tech Award.
ing, and computer vision. His current research to
detect activity recognition and actions in a retailers
store, sports, and exercise.

YU-CHEN LIN is currently pursuing the master’s


degree with the National Yang Ming Chiao Tung
University, Taiwan. His research interests include
artificial intelligence, deep learning, and computer
vision.

YIH-LANG LI (Member, IEEE) received the B.S.


degree in nuclear engineering and the M.S. and
Ph.D. degrees in computer science, majoring in
YOUSEF-AWWAD DARAGHMI received the designing and implementing a highly parallel
B.E. degree in electrical and computer engineer- cellular automata machine for fault simulation
ing from An-Najah National University, in 2002, from the National Tsing Hua University, Hsinchu,
and the master’s and Ph.D. degrees in com- Taiwan. In 2003, he joined the Faculty of the
puter science and engineering from the National Department of Computer Science, National Chiao
Chiao Tung University, Taiwan, in 2007 and Tung University (NCTU), Hsinchu, where he is
2014, respectively. He is currently an Associate currently a Professor. From 1995 to 1996 and
Professor with the Computer Systems Engineer- from 1998 to 2003, he was a Software Engineer and an Associate Manager
ing Department, Palestine Technical University– with Springsoft Corporation, Hsinchu, where he first completed the develop-
Kadoorie. His research focuses on intelligent ment of design rule checking (DRC) tool for the custom-based layout design
transportation systems, vehicular ad hoc networks, and blockchain. and then established and led a routing team for developing a block-level
He received the Best Paper Award from the International Confer- shape-based router for the custom-based layout design. His current research
ence on Intelligent Transportation Systems Telecommunications, in 2012. interests include physical synthesis, parallel architecture, vehicle navigation,
He served as a Technical Program Committee Member for the International and deep learning. He joined the technical committee of the first CAD
Conference on Connected Vehicles and Expo (ICCVE 2012–2016), the contest in Taiwan and served as a committee member for ten years. He has
International Conference on Intelligent Transportation Systems Telecom- been serving as the Compensation Committee Member and the Independent
munications (ITST 2012–2018), the International Conference on Signal Director of the Board of Directors for AMICCOM Electronics Corporation,
Processing (ICOSP 2015 and 2016), and the Asia–Pacific Network Opera- since 2012. He was a recipient of the Japan Society for the Promotion of
tion and Management Symposium ( APNOMS 2015.2016). He is a Reviewer Science Faculty Invitation Fellowship. He was the Contest Chair of the first
of some highly distinguished journals, including IEEE TRANSACTION ON CAD Contest at ICCAD, in 2012, and the Technical Program Committee
INTELLIGENT TRANSPORTATION SYSTEMS, IEEE Communication Magazine, and Member of ASPDAC and DAC.
IEEE Network Magazine.

90900 VOLUME 11, 2023

You might also like