Enhancing Boxing Techniques Through Explainable AI
Enhancing Boxing Techniques Through Explainable AI
Jack Laing
School of Computing Science, Newcastle University, UK
Abstract
This project aims to use a combination of sensor data collected from the IMU of an Arduino and body
landmark positions detected using MediaPipe’s pose estimation to successfully train a neural network capa-
ble of predicting punch acceleration. Using SHAP analysis, we identify which aspects of a boxer’s position
whilst punching most significantly affect its speed/acceleration. This information can then be used to give
insights for a boxer’s training, with the aim of enhancing their technique.
Keywords: Pose Estimation, Arduino, Machine Learning, SHAP Analysis, Punch Acceleration
Laing
1 Introduction
1.1 Project Statement
The main aim of this project is to identify an aspect of a boxer’s punching technique
that either increases or decreases their punch speed in order to serve as a proof of
concept in using sensors for enhanced athletic training. This is achieved through
applying SHAP analysis to a model capable of predicting punch acceleration based
on the changes in body position over time. Once a successful model is trained,
SHAP analysis determines the input that most affects the predicted acceleration,
therefore indicating which parts of a boxer’s position are most important when
throwing a punch.
We aim to pinpoint areas that are important to the kinetic chain [5]. This
includes taking a look at hip rotation and the swapping of shoulder locations (re-
tracting one to throw the other forward) with the aim of showing through various
data analytics the impact these have on a punch’s acceleration, thereby permitting
a tailored training experience surrounding these findings [30].
1.2 Motivation
Demonstrating the benefits of sensor-based training when combined with AI can
help pave the way for future developments and help guide future research. By
proving the validity of data-driven insights in sensor-based training, we can help
encourage investments into such tech, allowing for improved training regimes in
many athletic fields. This project does not only aim to improve the efficiency and
effectiveness of a punch in boxing, but aims to demonstrate how we can revolutionize
athletic training methods in all sports and even better our base understanding of
biomechanics [22] [3].
• Forward Propagation - As the name may suggest this phase involves the trans-
mission of data from the input layer throughout the network. The network pro-
cesses the data by applying weights and biases and activation functions to help
introduce non-linearity until finally reaching the output layer which is the pre-
dicted value based on the current weights within the system.
• Back Propagation - This process involves deriving the gradient of the loss
function with respect to each weight in the network and adjusting the weights
accordingly so that they reduce the predicted error size. It does this using various
differentiation techniques and this specific technique can be referred to as gradient
descent.
• Gradient Descent - This is the actual algorithm used within backpropagation
to minimise the loss value produced via loss functions. Adjusting the weights
accordingly to help achieve a network capable of predicting the ground truth
data based off of raw input data [9].
• Activation Functions - These functions are applied to the output of a given
neuron to introduce non-linearity. This allows the network to learn more complex
relationships between input data and ground truth data.
Non-linearity is an essential part of various deep learning models as real-life
scenarios and data are often too complex to be modelled using only linear operations.
Without non-linearity, models would only be able to solve problems where the data
points are linearly separable, meaning it could be divided by a straight line. An
example of this being implemented would be the ReLU function which passes only
positive values.
This allows them to excel at processing time series data as they can consider
4
Laing
previous inputs when making their predictions [37]. This makes it an essential part
of our project due to our requirement to be able to consider the position of body
landmarks [18] over a time period.
However, RNNs are not without issue. A primary one being the vanishing
gradient [38] which occurs when a gradient shrinks exponentially due to the RNN’s
connections through time. Similarly, the opposite can occur with exploding gradi-
ents for the same reason [10].
Due to the prominence of these issues, the Long Short-Term Memory (LSTM)
model was created to combat these challenges faced by the standard RNN.
Shown above is the subtle yet important difference between the standard RNN
and its more advanced successor the LSTM. The LSTM combats the exploding and
vanishing gradient challenges by implementing gates that can regulate the flow of
information. These gates are known as:
• Input Gate - Determines which values from the input data are important and
discards it if it is not.
• Forget Gate - Allows the model to discard previously saved input data if deemed
no longer useful.
• Output Gate - Controls the flow of information from cell state to hidden state
all the way to the output.
[37]
This improved maintenance of the gradient allows for the LSTM to consider
long durations of data making it perfect for time sequence tasks without causing
issues such as the exploding or vanishing gradient [10][38].
1.3.4 CNN
A convolutional neural network is a type of neural network that specialises in com-
puter vision. The reason for its success in computer vision is that it is able to
extract features real time by applying kernels to the raw pixel data. The extrac-
tion of these features maintains the spatial relationships of the original image which
helps to interpret the image accurately. This ability is what allows the CNN to
5
Laing
recognise things like patterns and objects within images. Effectively, it analyses
smaller portions of the image in order to make final predictions without losing the
context of the overall image.
Shown below is the architecture of a typical CNN:
As illustrated in the diagram there are more than just convolution layers within
a CNN which are explained below:
• Convolution Layer - This is the core building block of the CNN and operates by
applying kernels which are small matrices containing various values to pixel data
and simultaneously calculates the dot product when combined with other calcu-
lated values generate the activation map that represents the localised features of
the input image.
• Activation Function - An activation function is applied to the output of a
convolution layer or feature map and the aim of this function is to introduce
non-linearity to help guide the system in learning more complex relationships.
• Pooling Layer - The pooling layer, also known as subsampling or downsampling,
helps to reduce the spatial size of the feature map and therefore the computational
power. It also can help make the features invariant to alterations in their scale
and orientation.
• Fully Connected Layer - The fully connected layers are essentially a final
feedforward neural network in which the extracted feature maps are flattened
and fed into it so that the relationships between features and the ground truth
data can be formed.
[32]
The CNN’s powerful ability to perform feature extraction during training, per-
mits the possibility of creating pose estimation technologies of which are a fun-
damental part of our project. Allowing us to capture the inferred coordinates of
various body landmarks [13][6].
6
Laing
Pose estimation [18] is a branch of computer vision [31] that aims to locate and
track the positions of body landmarks (shoulders, elbows, hips etc.). It identifies
the coordinates of these landmarks which can be utilised to differentiate various
poses in a given person.
Pose estimation is achieved in 5 key steps:
• Preprocessing - Various methods such as resizing, noise reduction, and normal-
isation can be applied to further enhance the quality and make it more suitable
for pose estimation analysis.
• Feature Extraction - A CNN is used to extract key features from the image to
help in identifying the spatial arrangement of different body landmarks.
• Keypoint Detection - The pose estimation model uses these extracted features
to identify joints such as the elbow or knee.
• Pose Inference - Once identifying key parts of the body, these keypoints are
connected, forming a skeletal structure of the human body.
• Postprocessing - A last and final stage is taken to refine the accuracy of the
model. This can typically include filtering noise or applying constraints to prevent
an impossible skeletal structure.
[18]
Pose estimation can accept multiple forms of inputs, ranging from standard
RGB and depth images [8], whilst also accepting inputs of a static or dynamic
nature (videos). In relation to this project, we will focus on using a dynamic RGB
3D pose estimation. This allows us to capture an inferred depth of body landmarks
from a recorded video [39].
While there are plenty available pose estimation architectures such as [6]. We
will be using google’s very own MediaPipe PoseNet due to its ease of use and real-
time pose estimation [13]. During this project, this will allow us to record the
position of the boxer’s body without directly using sensors which we do not have
access to in order to train the model.
Pose estimation boasts a wide variety of applications. These range from motor
development tracking and clinical use in paediatrics for the early detection of neu-
rodevelopmental disorders like cerebral palsy [34] allowing for doctors to begin help-
ing patients earlier than ever. It also aids in injury risk assessment by evaluating
abnormal gait patterns and sports-related injuries [34][20][18].
More relevant to our project, however, is the use of pose estimation to correct
form in things like yoga [11]. And to improve athletic performance in sports [22].
Our project aims to apply these strategies to the world of boxing whilst also helping
to gain a better understanding of biomechanics [22] which we hope will allow for
future investments and development by using our research as a key example in the
7
Laing
training regimes.
9
Laing
• Hybrid Model Integration - Both models are then integrated to form a hy-
brid model, thus leveraging both the sensor and the pose data to predict the
acceleration magnitude.
[11]
11
Laing
A rather simple yet effective strategy whereby the Arduino gets attached to
a standard boxing glove using black electrician’s tape, alongside its 3-meter cable,
allows for enough slack to prevent the punch from being affected by the tension in
the cable and also permits a higher rate of data transmission than its Bluetooth
counterpart [22].
The Python script that uses MediaPipes’ pose estimation also stores the body
landmark positions into a JSON file similarly to that of the Arduino data alongside
the elapsed time which increases in intervals of 20 milliseconds [8].
13
Laing
3.4.1 Synchronisation
The reason for using a 20 millisecond interval is simply for the ease of synchroni-
sation. The Arduino is only capable of measuring to the nearest millisecond, this
means when it must sync up with a video at 60 fps where each frame occurs every
16.66ms (1/60fps). The best it can do is 17ms, although a small difference on a
small time frame this issue becomes more apparent and prominent as time goes on
leaving data to be completely out of sync.
This is not acceptable as synchronisation is essential to ensure that the Arduino
data corresponds correctly and directly to the pose data. Without synchronisation,
the model is likely if not certain to learn false trends therefore ruining any chances of
training a precise and effective model. It is therefore imperative that synchronisation
of both datasets occurs to provide a reliable basis for training so that useful insights
may be drawn [7].
These measures were taken in order to help ensure synchronisation throughout:
• Arduino - In the initial setup of the Arduino or in other words when it starts
up, it has a three-second timer indicated by an orange light which blinks before
it turns orange permanently to show data capturing has begun. This provides a
very clear and simple way of indicating exactly when the data capture procedure
begins [15].
• Video - Utilising the countdown measure put in place within the Arduino sketch,
we can simply record this countdown happening untrimmed video using a software
such as Clipchamp so that data capture occurs at the very beginning of the video.
This ensures that the pose and Arduino data correlate directly to one another.
[1]
in the passband which makes it ideal for this purpose. This filter retains the im-
portant low-frequency components however completely removes all high-frequency
noise.
• Normalisation - Each separate punch data is normalized to a specific scale
ensuring uniformity across samples. This involves scaling the sensor values to
a range of [0, 1], which helps in reducing the variability due to different scales
which can negatively impact the model’s training and give unnecessary weights
to essentially random inputs [29].
• Purpose - They identify how acceleration changes throughout the punch, reveal-
ing key details that are important for understanding a punch’s dynamics [35].
16
Laing
3.6.1 LSTM
This model processes sequential data captured by the Arduino. This includes the
acceleration and gyroscopic readings. This model captures the temporal dependen-
cies captured by the sensor data which is a crucial part of understanding punch
dynamics.
It consists of these layers:
• Input Layer - The input shape is (None, 16, 7). 16 representing the number of
time steps and 7 is the number of features (acceleration magnitude etc.).
• LSTM Layer - This contains 64 units and is responsible for capturing temporal
dependencies [17].
• Dense Layer - A dense layer with 32 units follows the LSTM layer, this processes
the extracted features further and once more helps build relationships between
the inputs and the ground truth data [16].
17
Laing
3.6.2 CNN
The CNN is applied to the generated body landmark positions from MediaPipes’
pose estimation, it recognises patterns within the spatial arrangement of all the
body landmark positions.
It consists of these layers:
• Input Layer - The input shape is (None, 16, 101). 16 representing the number
of time steps and 101 is the number of features (all body landmark positions and
their visibility).
• Convolutional Layer - The Conv1D layer with 64 filters and a kernel size of 3
is used to extract the local spatial features from the pose data [31].
• MaxPooling Layer - This layer is used to capture the most important features
as well as reduce the spatial dimensions of the feature map produced by the CNN
layer.
• Flatten Layer - This layer simply flattens the Max pooling layers’ data so that
it is suitable for the final dense layer.
• Dense Layer - A dense layer with 32 units follows the MaxPooling layer, this
processes the extracted features further and once more helps build relationships
between the inputs and the ground truth data [31].
The model used both the sensor data and the pose data during training with its per-
formance being evaluated on a separate validation set. Training involved splitting
the data into separate sets one being the training data and another the validation
data. An 80/20 split was used with 80% of the data for training and 20% for
validation [31].
The model possessed the following hyperparameters:
18
Laing
• Learning Rate - We aimed to balance the need for quick convergence and stable
training, using a learning rate of 0.001.
• Epochs - The model was trained for 50 epochs. This allowed for sufficient iter-
ations and opportunities for the model to form relationships between inputs and
the ground truth data [42].
The evaluation of the model is a key aspect to allowing it to predict the accel-
eration magnitude successfully as it provides the loss value which the network aims
to reduce via backpropagation as explained in 1.3.2.
The model utilised the following metrics:
• Mean Squared Error (MSE) - Squares the errors between predicted and actual
values and finds their average, penalising larger errors more than small.
• Mean Absolute Error (MAE) - Finds the absolute errors between predicted
and actual values and finds their average, provides a clear metric measuring the
error’s magnitude [23].
The graph shown above represents the validation and training loss of the model
over its 50-epoch training period from which several main observations can be made:
• Initial Training Phase - Both the training and validation loss see a steep decline
within the first few epochs indicating not only that the model is learning from
the data but also is generalising well to unseen data during the early stages of
training.
• Stabilization - After their steep decline in the initial training phase both the
validation and training loss stabilise around epoch 10 where the validation loss
reaches a low point and stabilises and the training loss fluctuates continuously.
• Convergence - At the end of the training, we can see a clear convergence where
the validation loss has remained low and stable indicating the model has converged
19
Laing
and is not overfitting. A small gap can be seen between the training and validation
losses which further suggests a good generalisation from the model [24].
These are extremely positive results and indicate that the model has been able
to effectively learn the underlying patterns of the data including the biomechanics of
the punches which is what we aimed to achieve. These insights validate the model’s
architecture and the training process and indicate that this model is a valid solution
to predicting punch acceleration magnitude based on body landmark positions [26].
20
Laing
4 Explainable AI (XAI)
Explainable AI (XAI) is a critical aspect of this project as it allows us to provide
the insights into the predictive model’s process that allows us to give a meaningful
interpretation over the model which will allow us to understand the influence each
input has on the punch acceleration magnitude. This information can then be used
within training regimes to help target specific aspects of a boxer’s technique [14].
Shown above is a summary plot visualising the SHAP values for each feature.
This indicates not only how much a feature contributes to the prediction but also
which feature is more important than the others. This summary plot focuses on
landmarks directly related to the kinetic chain and has provided us with several key
insights into the biomechanics of a punch such as [28]:
affecting the arm’s leverage and therefore its ability to generate punching power
[25].
22
Laing
23
Laing
References
[1] L. Alzubaidi, J. Zhang, A.J. Humaidi, and et al. Review of deep learning: concepts, cnn architectures,
challenges, applications, future directions. Journal of Big Data, 8:53, 2021.
[3] J.S. Arlotti, W.O. Carroll, Y. Afifi, P. Talegaonkar, L. Albuquerque, R.F.B. V, J.E. Ball, H. Chander,
and A. Petway. Benefits of imu-based wearables in sports medicine: Narrative review. International
Journal of Kinesiology and Sports Science, 10(1):36–43, 2022. [Online; accessed 19-May-2023].
[4] Dillon Bowen and Lyle Ungar. Generalized shap: Generating multiple types of explanations in machine
learning, 2020.
[5] Boxing Science. Punch force - the science behind the punch, 2014. [Online; accessed 14-May-2024].
[6] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh. Openpose: Realtime multi-person 2d
pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis amp; Machine
Intelligence, 43(01):172–186, jan 2021.
[7] Lucas Cinelli, Gabriel Chaves, and Markus Lima. Vessel classification through convolutional neural
networks using passive sonar spectrogram images. 05 2018.
[9] DataRobot AI Platform. Introduction to loss functions, n.d. [Online; accessed 14-May-2024].
[11] R. Gajbhiye, S. Jarag, P. Gaikwad, and S. Koparde. Ai human pose estimation: Yoga pose detection and
correction. International Journal of Innovative Science and Research Technology, 7(5), 2022. [Online;
accessed 14-May-2024].
[12] GeeksforGeeks. Introduction to recurrent neural network - geeksforgeeks, 2018. [Online; accessed 14-
May-2024].
[13] Google for Developers. Pose landmark detection guide — mediapipe. https://developers.
google.com/mediapipe/solutions/vision/pose_landmarker#pose_landmarker_model. Accessed: 14
May 2024.
[14] Robert I. Hamilton and Panagiotis N. Papadopoulos. Using shap values and machine learning to
understand trends in the transient stability limit. IEEE Transactions on Power Systems, 39(1):1384–
1397, 2024.
[16] IBM. What are neural networks?, 2023. [Online; accessed 14-May-2024].
[17] Rohit Josyula and Sarah Ostadabbas. A review on human pose estimation, 2021.
[18] Amrutha K, Prabu P, and Joy Paulose. Human body pose estimation and applications. In 2021
Innovations in Power and Advanced Computing Technologies (i-PACT), pages 1–6, 2021.
[19] W Ben Kibler and Timothy J Chandler. Sport-specific conditioning. The American Journal of Sports
Medicine, 23(3):472–479, 1995.
[20] W. Kim, J. Sung, D. Saakes, C. Huang, and S. Xiong. Ergonomic postural assessment using a new open-
source human pose estimation technology (openpose). International Journal of Industrial Ergonomics,
84:103164, 2021.
[21] Abhinav Lalwani, Aman Saraiya, Apoorv Singh, Aditya Jain, and Tirtharaj Dash. Machine learning in
sports: A case study on using explainable models for predicting outcomes of volleyball matches, 2022.
[22] Michael Lapinski, Carolina Brum Medeiros, Donna Moxley Scarborough, Eric Berkson, Thomas J.
Gill, Thomas Kepple, and Joseph A. Paradiso. A wide-range, wireless wearable inertial motion sensing
system for capturing fast athletic biomechanics in overhead pitching. Sensors, 19(17), 2019.
[23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
[24] Jung B Lee, Rory B Mellifont, and Brendan J Burkett. The use of a single inertial sensor to identify
stride, step, and stance durations of running gait. Journal of Science and Medicine in Sport, 13(2):270–
273, 2010.
24
Laing
[25] Seth Lenetsky, Matt Brughelli, Roy J. Nates, J.G. Neville, Matt R. Cross, and Anna V. Lormier.
Defining the phases of boxing punches: A mixed-method approach. Journal of Strength and
Conditioning Research, 34(4):1040–1051, April 2020.
[26] Seth Lenetsky, Nigel K Harris, and Matt Brughelli. Assessment and contributors of punching forces in
combat sports athletes: Implications for strength and conditioning. Strength and Conditioning Journal,
35(2):1–7, 2013.
[27] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis. Explainable ai: A review of machine learning
interpretability methods. Entropy, 23(1):18, 2020. [Online; accessed 14-May-2024].
[28] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of
machine learning interpretability methods. Entropy, 23(1), 2021.
[29] M. Mishra. Convolutional neural networks, explained, 2020. [Online; accessed 14-May-2024].
[30] Carol A Putnam. Sequential motions of body segments in striking and throwing skills: descriptions
and explanations. Journal of Biomechanics, 26(Suppl 1):125–135, 1993.
[31] Jin Qiu, Jian Liu, and Yunyi Shen. Computer vision technology based on deep learning. In 2021
IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence
(ICIBA), volume 2, pages 1126–1130, 2021.
[32] L. Sigal. Human pose estimation. In Springer eBooks, pages 573–592. 2021.
[33] J. Silver and T. Huffman. Baseball predictions and strategies using explainable ai, n.d. [Online; accessed
14-May-2024].
[34] Jan Stenum, Kendra M. Cherry-Allen, Connor O. Pyles, Rachel D. Reetzke, Michael F. Vignos, and
Ryan T. Roemmich. Applications of pose estimation in human health and performance across the
lifespan. Sensors, 21(21), 2021.
[35] Jacopo Tosi, Fabrizio Taffoni, Marco Santacatterina, Roberto Sannino, and Domenico Formica.
Performance evaluation of bluetooth low energy: A systematic review. Sensors, 17(12), 2017.
[36] William C Whiting and Ronald F Zernicke. Biomechanics of Musculoskeletal Injury. Human Kinetics,
2008.
[37] www.ibm.com. What are recurrent neural networks? — ibm, 2024. [Online; accessed 15-May-2024].
[38] www.superdatascience.com. Recurrent neural networks (rnn): The vanishing gradient problem, n.d.
[Online; accessed 14-May-2024].
[39] Y. Yang, Y. Yuan, Z. Han, and G. Liu. Interpretability analysis for thermal sensation machine learning
models: An exploration based on the shap approach. Indoor Air, 2022.
[40] Robail Yasrab and Michael Pound. Phenomnet: Bridging phenotype-genotype gap: A cnn-lstm based
automatic plant root anatomization system, 05 2020.
[41] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. A Review of Recurrent Neural Networks:
LSTM Cells and Network Architectures. Neural Computation, 31(7):1235–1270, 07 2019.
[42] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, Nasser Kehtarnavaz, and M. Shah. Deep
learning-based human pose estimation: A survey. ACM Computing Surveys, 2023.
25