0% found this document useful (0 votes)
80 views25 pages

Enhancing Boxing Techniques Through Explainable AI

This project aims to use a combination of sensor data collected from the IMU of an Arduino and body landmark positions detected using MediaPipe’s pose estimation to successfully train a neural network capa- ble of predicting punch acceleration. Using SHAP analysis, we identify which aspects of a boxer’s position whilst punching most significantly affect its speed/acceleration. This information can then be used to give insights for a boxer’s training, with the aim of enhancing their technique.

Uploaded by

jacklaing1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views25 pages

Enhancing Boxing Techniques Through Explainable AI

This project aims to use a combination of sensor data collected from the IMU of an Arduino and body landmark positions detected using MediaPipe’s pose estimation to successfully train a neural network capa- ble of predicting punch acceleration. Using SHAP analysis, we identify which aspects of a boxer’s position whilst punching most significantly affect its speed/acceleration. This information can then be used to give insights for a boxer’s training, with the aim of enhancing their technique.

Uploaded by

jacklaing1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CSC8498

Enhancing Boxing Techniques Through


Explainable AI: A SHAP Analysis of Sensor
Data and Pose Estimation

Jack Laing
School of Computing Science, Newcastle University, UK

Abstract
This project aims to use a combination of sensor data collected from the IMU of an Arduino and body
landmark positions detected using MediaPipe’s pose estimation to successfully train a neural network capa-
ble of predicting punch acceleration. Using SHAP analysis, we identify which aspects of a boxer’s position
whilst punching most significantly affect its speed/acceleration. This information can then be used to give
insights for a boxer’s training, with the aim of enhancing their technique.

Keywords: Pose Estimation, Arduino, Machine Learning, SHAP Analysis, Punch Acceleration
Laing

1 Introduction
1.1 Project Statement
The main aim of this project is to identify an aspect of a boxer’s punching technique
that either increases or decreases their punch speed in order to serve as a proof of
concept in using sensors for enhanced athletic training. This is achieved through
applying SHAP analysis to a model capable of predicting punch acceleration based
on the changes in body position over time. Once a successful model is trained,
SHAP analysis determines the input that most affects the predicted acceleration,
therefore indicating which parts of a boxer’s position are most important when
throwing a punch.
We aim to pinpoint areas that are important to the kinetic chain [5]. This
includes taking a look at hip rotation and the swapping of shoulder locations (re-
tracting one to throw the other forward) with the aim of showing through various
data analytics the impact these have on a punch’s acceleration, thereby permitting
a tailored training experience surrounding these findings [30].

1.2 Motivation
Demonstrating the benefits of sensor-based training when combined with AI can
help pave the way for future developments and help guide future research. By
proving the validity of data-driven insights in sensor-based training, we can help
encourage investments into such tech, allowing for improved training regimes in
many athletic fields. This project does not only aim to improve the efficiency and
effectiveness of a punch in boxing, but aims to demonstrate how we can revolutionize
athletic training methods in all sports and even better our base understanding of
biomechanics [22] [3].

1.3 Overview of Key Concepts


This section provides a brief overview of concepts discussed and used throughout
this project.

1.3.1 Arduino Nano 33 BLE Sense


This project requires a sensor to be attached to a boxing glove in order to record
acceleration and gyroscopic values. The Arduino Nano 33 BLE Sense is a microcon-
troller board that is equipped with the tech capable of recording this data, boasting
a wide range of features:
• Bluetooth Low Energy - 2.4GHz Bluetooth5 Module enabling wireless data
transfer.
• Inertial Measurement Unit - LSM9DS1 IMU possessing a 3D accelerometer,
gyroscope, and magnetometer [15].
• Temperature & Humidity Sensor - The HTS221 Module can measure tem-
perature and humidity to a high degree of accuracy.
• Barometric Pressure Sensor - The LPS22HB allows for altitude estimation
through the detection of barometric pressure.
2
Laing

• Proximity & Gesture Detection - The APDS9960 provides gesture detection


and proximity sensing.
• Digital Microphone - The MP34DT05 is an omnidirectional digital microphone
capable of capturing and analysing sound.
Although not all features are utilised in this project, it is clear that the Arduino
is very robust and can be used in many applications. In this instance, it is used to
capture the accelerations and angles of the boxer’s hand throughout a punch to be
used later in training a machine learning model [2].

1.3.2 Deep Learning


Deep learning is an increasingly popular form of machine learning due to its ability to
process large quantities of data using multiple layers containing nodes and neurons.
It allows for advanced data predictions through various rounds of optimisation. This
approach allows for complex tasks like speech recognition, computer vision, and in
relevance to our project, pose estimation [18][13].
Below a typical deep learning model or ”Neural Network” can be seen (some-
times referred to as Deep NNs or Feedforward NNs).

Fig. 1. NN architecture [7]

Neural networks are comprised of 3 core sections:


• Input Layer - As the name suggests, this is the area of the NN that receives the
raw data to be processed.
• Hidden Layers - The hidden layers are the intermediary stage between the
initial input data and the final prediction/output. The data progresses through
to these layers where patterns and ’features’ can be extracted [16] by neurons
with various weightings that get adjusted throughout training.
• Output Layer - This is the final layer of the NN where the final prediction is
generated and outputted.
Not only are neural networks defined by their structure but also by the processes
that allow them to learn from raw data. These include:
3
Laing

• Forward Propagation - As the name may suggest this phase involves the trans-
mission of data from the input layer throughout the network. The network pro-
cesses the data by applying weights and biases and activation functions to help
introduce non-linearity until finally reaching the output layer which is the pre-
dicted value based on the current weights within the system.
• Back Propagation - This process involves deriving the gradient of the loss
function with respect to each weight in the network and adjusting the weights
accordingly so that they reduce the predicted error size. It does this using various
differentiation techniques and this specific technique can be referred to as gradient
descent.
• Gradient Descent - This is the actual algorithm used within backpropagation
to minimise the loss value produced via loss functions. Adjusting the weights
accordingly to help achieve a network capable of predicting the ground truth
data based off of raw input data [9].
• Activation Functions - These functions are applied to the output of a given
neuron to introduce non-linearity. This allows the network to learn more complex
relationships between input data and ground truth data.
Non-linearity is an essential part of various deep learning models as real-life
scenarios and data are often too complex to be modelled using only linear operations.
Without non-linearity, models would only be able to solve problems where the data
points are linearly separable, meaning it could be divided by a straight line. An
example of this being implemented would be the ReLU function which passes only
positive values.

1.3.3 RNNs & LSTM


Recurrent Neural Networks (RNNs) are a variation of the standard Feed-Forward
NNs discussed beforehand with one primary difference, the ability to store an in-
ternal state (memory) [37]. This allows them to process sequences of data by in-
corporating loops in their architecture. The difference is highlighted in the below
diagram:

Fig. 2. RNN vs FNN [12]

This allows them to excel at processing time series data as they can consider
4
Laing

previous inputs when making their predictions [37]. This makes it an essential part
of our project due to our requirement to be able to consider the position of body
landmarks [18] over a time period.
However, RNNs are not without issue. A primary one being the vanishing
gradient [38] which occurs when a gradient shrinks exponentially due to the RNN’s
connections through time. Similarly, the opposite can occur with exploding gradi-
ents for the same reason [10].
Due to the prominence of these issues, the Long Short-Term Memory (LSTM)
model was created to combat these challenges faced by the standard RNN.

Fig. 3. RNN vs LSTM [40]

Shown above is the subtle yet important difference between the standard RNN
and its more advanced successor the LSTM. The LSTM combats the exploding and
vanishing gradient challenges by implementing gates that can regulate the flow of
information. These gates are known as:
• Input Gate - Determines which values from the input data are important and
discards it if it is not.
• Forget Gate - Allows the model to discard previously saved input data if deemed
no longer useful.
• Output Gate - Controls the flow of information from cell state to hidden state
all the way to the output.
[37]
This improved maintenance of the gradient allows for the LSTM to consider
long durations of data making it perfect for time sequence tasks without causing
issues such as the exploding or vanishing gradient [10][38].

1.3.4 CNN
A convolutional neural network is a type of neural network that specialises in com-
puter vision. The reason for its success in computer vision is that it is able to
extract features real time by applying kernels to the raw pixel data. The extrac-
tion of these features maintains the spatial relationships of the original image which
helps to interpret the image accurately. This ability is what allows the CNN to
5
Laing

recognise things like patterns and objects within images. Effectively, it analyses
smaller portions of the image in order to make final predictions without losing the
context of the overall image.
Shown below is the architecture of a typical CNN:

Fig. 4. CNN architecture [29]

As illustrated in the diagram there are more than just convolution layers within
a CNN which are explained below:
• Convolution Layer - This is the core building block of the CNN and operates by
applying kernels which are small matrices containing various values to pixel data
and simultaneously calculates the dot product when combined with other calcu-
lated values generate the activation map that represents the localised features of
the input image.
• Activation Function - An activation function is applied to the output of a
convolution layer or feature map and the aim of this function is to introduce
non-linearity to help guide the system in learning more complex relationships.
• Pooling Layer - The pooling layer, also known as subsampling or downsampling,
helps to reduce the spatial size of the feature map and therefore the computational
power. It also can help make the features invariant to alterations in their scale
and orientation.
• Fully Connected Layer - The fully connected layers are essentially a final
feedforward neural network in which the extracted feature maps are flattened
and fed into it so that the relationships between features and the ground truth
data can be formed.
[32]
The CNN’s powerful ability to perform feature extraction during training, per-
mits the possibility of creating pose estimation technologies of which are a fun-
damental part of our project. Allowing us to capture the inferred coordinates of
various body landmarks [13][6].

6
Laing

2 Background and Related Work


2.1 Pose Estimation Technologies

Pose estimation [18] is a branch of computer vision [31] that aims to locate and
track the positions of body landmarks (shoulders, elbows, hips etc.). It identifies
the coordinates of these landmarks which can be utilised to differentiate various
poses in a given person.
Pose estimation is achieved in 5 key steps:
• Preprocessing - Various methods such as resizing, noise reduction, and normal-
isation can be applied to further enhance the quality and make it more suitable
for pose estimation analysis.
• Feature Extraction - A CNN is used to extract key features from the image to
help in identifying the spatial arrangement of different body landmarks.
• Keypoint Detection - The pose estimation model uses these extracted features
to identify joints such as the elbow or knee.
• Pose Inference - Once identifying key parts of the body, these keypoints are
connected, forming a skeletal structure of the human body.
• Postprocessing - A last and final stage is taken to refine the accuracy of the
model. This can typically include filtering noise or applying constraints to prevent
an impossible skeletal structure.
[18]
Pose estimation can accept multiple forms of inputs, ranging from standard
RGB and depth images [8], whilst also accepting inputs of a static or dynamic
nature (videos). In relation to this project, we will focus on using a dynamic RGB
3D pose estimation. This allows us to capture an inferred depth of body landmarks
from a recorded video [39].
While there are plenty available pose estimation architectures such as [6]. We
will be using google’s very own MediaPipe PoseNet due to its ease of use and real-
time pose estimation [13]. During this project, this will allow us to record the
position of the boxer’s body without directly using sensors which we do not have
access to in order to train the model.

2.2 Pose Estimation Applications

Pose estimation boasts a wide variety of applications. These range from motor
development tracking and clinical use in paediatrics for the early detection of neu-
rodevelopmental disorders like cerebral palsy [34] allowing for doctors to begin help-
ing patients earlier than ever. It also aids in injury risk assessment by evaluating
abnormal gait patterns and sports-related injuries [34][20][18].
More relevant to our project, however, is the use of pose estimation to correct
form in things like yoga [11]. And to improve athletic performance in sports [22].
Our project aims to apply these strategies to the world of boxing whilst also helping
to gain a better understanding of biomechanics [22] which we hope will allow for
future investments and development by using our research as a key example in the
7
Laing

benefits and importance of pose estimation and sensor-based training.

2.3 Explainability in Machine Learning Models


Due to the recent growth and popularity increase of AI and machine learning,
coupled with the lack of clarity and uncertainty of the operations taken by the AI
to reach its decision, the necessity for a method to explain these steps has become
apparent [27]. It is this issue that fueled the innovation of explainable AI (XAI)
which aims to provide clarity and transparency to users so confidence can be ensured
when using AI in critical situations [28].
One method that has shown great promise is SHapley Additive exPlanations
(SHAP) [14]. Originating from game theory, Shapley values consider every possible
combination of a range of inputs with the aim of determining which combination has
the largest impact on the output. An example relating to our project may be that
hip rotation has a larger impact on punch acceleration than wrist rotation. SHAP
would be able to identify this fact by evaluating the contributions of each input
through its brute force method, therefore gaining a comprehensive understanding
of which feature most significantly influences the result [4].
Explainable AI enhances sports analytics by increasing the trustworthiness of
machine learning models as their predictions are more human-readable and under-
standable. They provide clear insights into which features most influence outcomes.
This means coaches and analysts are able to make better-informed decisions and
optimise strategies leading to an improvement in overall performance. As noted
by Lalwani et al., ”Explainable models are necessary for understanding the reason
behind a particular prediction, which helps in building trust and facilitating the
adoption of AI systems in sports analytics” [21]. A similar opinion is shared by
Silver and Huffman ”making the rationale for predictions understandable to human
beings is crucial for confidence in AI-driven decisions in baseball” [33].

2.4 Punch Dynamics and the Kinetic Chain


In order to optimise performance and prevent injuries within combat sports, under-
standing the biomechanics of punching is crucial. Countless studies have shown the
importance of forces and correct movements in executing a punch. A comprehen-
sive overview of the biomechanics relating to musculoskeletal injuries was published
by Whiting and Zernicke [36]. Whilst other studies expose specific contributors
to punching force and offer insights into how to enhance this through the use of
strength and conditioning as explained by Lenetsky et al [26].
Studies performed by Kibler and Chandler [19] as well as Putnam [30] aid in
highlighting the important roles each part of the body plays within the kinetic chain.
The kinetic chain is an extremely important part of sports science and is solely
responsible for almost all types of powerful and efficient movements. Using things
like inertial sensors to capture detailed movement data have led to technological
advancements that enable more precise analysis of punching techniques [?][22][15].
Our project aims to contribute to these findings by applying explainable AI to
help analyse sensor data and pose estimation, providing further insight into the
biomechanics of punching allowing coaches to make more informed decisions in the
8
Laing

training regimes.

9
Laing

3 Design and Implementation


This section aims to thoroughly explain the steps taken in order to achieve our
project goals. Our project has various key components, each as important as the
other, with a high-level overview seen in the Design Overview section.

3.1 Design Overview


3.1.1 Data Collection
• Sensor Data - Using the Arduino we capture acceleration and gyroscopic data
from the boxer’s hand during punches.
• Pose Estimation Data - After recording a video of the boxer throwing punches
whilst capturing data using the Arduino, MediaPipe’s pose estimation can be
applied generating the body landmarks’ position over time.
[41]

3.1.2 Data Preprocessing


• Sensor Data Processing - We remove noise from the raw sensor data as well
as any excess data/timestamps that hold no valuable information (time before
and after the punch takes place) before normalising it.
• Pose Data Processing - The video used with MediaPipes’ pose estimation is
trimmed to ensure that the body landmarks are captured correctly alongside their
corresponding Arduino data.
[22]

3.1.3 Data Synchronisation


To ensure that the pose landmarks directly correspond to the Arduino data, various
steps, including trimming down the original video are taken. This is an essential
step so that the model does not learn any false trends [20].

3.1.4 Feature Engineering


• Sensor Features - In order to provide more detailed and logical insights, features
such as the acceleration magnitude as well as the acceleration and gyroscopic
derivatives are calculated.
• Pose Features - MediaPipe’s pose estimation records key body landmarks before
a CNN extracts key spatial features from the positions of the landmarks.
[34]

3.1.5 Model Development


• LSTM Model for Sensor Data - An LSTM is used to analyse the sequential
data captured by the Arduino.
• CNN Model for Pose Data - A CNN is used to process the pose data generated
by MediaPipes’ pose estimation, this extracts the spatial features from the body
landmarks.
10
Laing

• Hybrid Model Integration - Both models are then integrated to form a hy-
brid model, thus leveraging both the sensor and the pose data to predict the
acceleration magnitude.
[11]

3.1.6 Model Training and Evaluation


After successfully setting up the correct model architecture, the model is trained
using both the sensor and the pose data and it has its performance evaluated using
metrics such as mean squared error and mean absolute error. This helps in op-
timising the model’s parameters making it more accurate in predicting the punch
acceleration magnitude [32].

3.1.7 Explainable AI (XAI)


• SHAP Analysis - The final and arguably most important point is the use of
SHAP analysis. This shows which aspect of the punch most significantly con-
tributes to its acceleration taking into account the pose data as well as the Ar-
duino data.
[4]

3.2 Hardware Setup


This section briefly explains how the hardware was configured and set up to allow
for data capture.

3.2.1 Pose Estimation Hardware


For the pose estimation, the hardware was straightforward. Using a smartphone, I
recorded the punches whilst they were being thrown at a rate of 50 fps. This required
some configuration with a software known as BlackMagicCam before being possible.
It was essential that the video recorded in 50 fps to ensure that the pose estimation
would record the body landmark positions every 20ms for synchronisation purposes
explained in further detail in section 3.4 [18].

3.2.2 Arduino Hardware


Although the Arduino Nano 33 BLE Sense does offer Bluetooth Low Energy as a
method of data transmission, attempts at setting this up proved difficult and often
led to missing data or incorrect timestamps due to the transmission period between
Arduino and PC. For this reason, a simple and reliable 3-meter micro USB cable was
used instead, this proved very effective, being able to record up to 5x the amount
of data as its Bluetooth counterpart [35].

11
Laing

Below the final Arduino setup can be seen:

Fig. 5. Hardware Setup

A rather simple yet effective strategy whereby the Arduino gets attached to
a standard boxing glove using black electrician’s tape, alongside its 3-meter cable,
allows for enough slack to prevent the punch from being affected by the tension in
the cable and also permits a higher rate of data transmission than its Bluetooth
counterpart [22].

3.3 Data Collection


This section details the process taken to collect both the Arduino and the pose data.

3.3.1 Sensor Data


We use the Arduino Nano 33 BLE Sense to capture acceleration and gyroscopic
values whilst the boxer is punching. The full data collection process can be outlined
as such:
• Sensor Configuration - The Arduino is programmed using the Arduino IDE to
create an Arduino sketch. Arduino sketches are responsible for its initial setup as
well as continuous loop. The Arduino is set up so that it initializes the inertial
measurement unit, specifically the LSM9DS1 sensor so that it is ready to record
the 3D accelerometer and gyroscope data [23].
• Sampling Rate - For the purpose of synchronisation, the IMU captures data
at a sampling rate of 50 Hertz or every 20 milliseconds, equivalent to that of the
pose estimation which samples at 50 fps. This provides sufficient data throughout
the duration of each punch which will help the model form better relationships
between the inputs and the ground truth data.
• Data Logging & Transmission - The data is logged and stored in a JSON
file using a Python script that receives the IMU data via the 3-meter long micro
USB cable which is plugged into the Arduino from a laptop. From here, a Python
script communicates with the Arduino on a COM3 port using the PySerial library
12
Laing

and then saves the data that it receives [24].


This strategy ensures that data is reliable and consistent throughout the data
collection process.

3.3.2 Pose Data


A key aspect of the collection of the pose data relies heavily on the quality of video
and requires various precautions to be taken such as:
• Environment Setup - In order to provide the best quality data for pose estima-
tion a proper environment must be created. This means we must aim to minimise
background noise and ensure good lighting conditions as this will ensure we cap-
ture clear video frames to be used in pose estimation which is essential for precise
readings. We achieve this by recording the video against a plain white wall to
ensure there is no unnecessary background noise and that the boxer is always the
focus point of the video.
• Camera Positioning - The actual positioning of the camera is also just as
important as the environment it is recording. It is set at an optimal position
and angle to ensure that every aspect of the boxer’s movements can be seen
throughout the entire punch movement.
• Recording Specifications - To aid in the synchronisation of both the Arduino
and pose data it is essential that they record at equal intervals, which in this case
is 50FPS or 20 milliseconds [18].
An example frame showcasing the environment and camera position with pose
estimation applied to it can be seen below:

Fig. 6. Pose estimation Camera Positioning

The Python script that uses MediaPipes’ pose estimation also stores the body
landmark positions into a JSON file similarly to that of the Arduino data alongside
the elapsed time which increases in intervals of 20 milliseconds [8].
13
Laing

3.4 Data Preprocessing and Synchronisation


This section highlights the key steps taken to preprocess the data making it more
suitable and ready for a deep learning model as well as ensure both the pose and
Arduino data are in sync with one another.

3.4.1 Synchronisation
The reason for using a 20 millisecond interval is simply for the ease of synchroni-
sation. The Arduino is only capable of measuring to the nearest millisecond, this
means when it must sync up with a video at 60 fps where each frame occurs every
16.66ms (1/60fps). The best it can do is 17ms, although a small difference on a
small time frame this issue becomes more apparent and prominent as time goes on
leaving data to be completely out of sync.
This is not acceptable as synchronisation is essential to ensure that the Arduino
data corresponds correctly and directly to the pose data. Without synchronisation,
the model is likely if not certain to learn false trends therefore ruining any chances of
training a precise and effective model. It is therefore imperative that synchronisation
of both datasets occurs to provide a reliable basis for training so that useful insights
may be drawn [7].
These measures were taken in order to help ensure synchronisation throughout:
• Arduino - In the initial setup of the Arduino or in other words when it starts
up, it has a three-second timer indicated by an orange light which blinks before
it turns orange permanently to show data capturing has begun. This provides a
very clear and simple way of indicating exactly when the data capture procedure
begins [15].
• Video - Utilising the countdown measure put in place within the Arduino sketch,
we can simply record this countdown happening untrimmed video using a software
such as Clipchamp so that data capture occurs at the very beginning of the video.
This ensures that the pose and Arduino data correlate directly to one another.
[1]

3.4.2 Data Preprocessing - Sensor Data


Data preprocessing is essential before feeding into a deep learning model. It ensures
the quality and the consistency of the data making for more reliable and consistent
training results. Shown below are the processes applied to the sensor data for
preprocessing:
• Data Cleaning - Once the Arduino data had been recorded and stored in a
JSON file, it required the excess data before and after the punch which held
no meaningful or valuable information to be removed. This process was done
manually by removing lines within the JSON file where there was little to no
movement. Due to punches being recorded in separate JSON files as explained in
section 3.1.1, this was fairly simple as unnecessary data appeared as values less
than one for almost all cases.
• Noise Reduction - A Butterworth low pass filter was used to remove high-
frequency noise generated by the sensor. This filter has a flat frequency response
14
Laing

in the passband which makes it ideal for this purpose. This filter retains the im-
portant low-frequency components however completely removes all high-frequency
noise.
• Normalisation - Each separate punch data is normalized to a specific scale
ensuring uniformity across samples. This involves scaling the sensor values to
a range of [0, 1], which helps in reducing the variability due to different scales
which can negatively impact the model’s training and give unnecessary weights
to essentially random inputs [29].

3.4.3 Data Preprocessing - Pose Data


Shown below are the processes applied to the pose data for preprocessing:
• Trimming - Once the video that will be used to generate the pose data has been
recorded we can now trim the early access data where the Arduino has not yet
started capturing data using its programmed countdown feature that begins on
set up. This ensures that the pose data begins at the same time as the Arduino
data ensuring that the pose data and Arduino data directly correlate to one
another [3].
• Normalisation - Pose data is normalized in a similar way to that of the Arduino
data, except for the fact that it requires different normalised data for X and Y
values. This is calculated by dividing the X coordinate with the frame width and
the Y coordinate with the frame height. Scaling the coordinates from 0 to 1.

3.5 Feature Engineering


Feature engineering is a vital part of this project, as it turns the Arduino data into
more valuable information allowing for more complex and logical relationships to
be formed by the deep learning model. These engineered features include:

3.5.1 Acceleration Magnitude


• Definition - This measures the overall acceleration of the punch by combining
the x, y, and z accelerations.
• Purpose - Gives a single value measuring the acceleration of the punch where
higher values indicate higher overall speed, used as the ground truth data and
easier to perceive [25].

3.5.2 Gyroscopic Magnitude


• Definition - This measures the overall rotation of the punch by combining the
x, y, and z rotational speeds.
• Purpose - Gives a single value measuring the rotation of the punch where higher
values indicate a greater rotation making it easier to perceive [42].

3.5.3 Acceleration Derivatives


• Definition - These measurements indicate how the acceleration changes over
time in the x, y, and z directions separately.
15
Laing

• Purpose - They identify how acceleration changes throughout the punch, reveal-
ing key details that are important for understanding a punch’s dynamics [35].

3.5.4 Gyroscopic Derivatives


• Definition - These measurements indicate how the rotational speed changes over
time in the x, y, and z directions separately.
• Purpose - They identify how the rotation of the wrist changes throughout the
punch, revealing key details that are important for understanding a punch’s ro-
tational dynamics [26].

3.5.5 Body Landmarks


• Definition - Body landmark positions such as hips, shoulders etc. are generated
by MediaPipe’s pose estimation.
• Purpose - These landmarks provide us with spatial information regarding the
body’s posture during the punch. We can analyse these positions to help us in
understanding the biomechanics of the punch [18].

3.5.6 CNN Extracted Features


• Definition - A CNN processes the pose data and extracts spatial features even
further using the body landmark positions.
• Purpose - These features help identify patterns and relationships between differ-
ent body landmark inputs and ground truth data allowing us to accurately model
the punching motion [18].

3.6 Model Development


Due to the complex nature of this project and the variation in forms of data, a
hybrid model combining long short-term memory (LSTM) networks for the sensor
data and convolutional neural networks (CNN) for the pose data was required in
order to predict punch acceleration based on the biomechanics of the boxer.

16
Laing

Shown below is the model architecture followed by a detailed overview of the


model’s design:

Layer Type Name Output Shape Connected to

InputLayer pose input (None, 16, 101)

Conv1D conv1d 6 (None, 14, 64) pose input

InputLayer arduino input (None, 16, 7)

MaxPooling1D max pooling1d 6 (None, 7, 64) conv1d 6

LSTM lstm 6 (None, 64) arduino input

Flatten flatten 6 (None, 448) max pooling1d 6

Dense dense 24 (None, 32) lstm 6

Dense dense 25 (None, 32) flatten 6

Concatenate concatenate 6 (None, 64) dense 24, dense 25

Dense dense 26 (None, 64) concatenate 6

Dropout dropout 6 (None, 64) dense 26

Dense dense 27 (None, 1) dropout 6


Table 1
Model Architecture [18]

3.6.1 LSTM
This model processes sequential data captured by the Arduino. This includes the
acceleration and gyroscopic readings. This model captures the temporal dependen-
cies captured by the sensor data which is a crucial part of understanding punch
dynamics.
It consists of these layers:
• Input Layer - The input shape is (None, 16, 7). 16 representing the number of
time steps and 7 is the number of features (acceleration magnitude etc.).
• LSTM Layer - This contains 64 units and is responsible for capturing temporal
dependencies [17].
• Dense Layer - A dense layer with 32 units follows the LSTM layer, this processes
the extracted features further and once more helps build relationships between
the inputs and the ground truth data [16].
17
Laing

3.6.2 CNN
The CNN is applied to the generated body landmark positions from MediaPipes’
pose estimation, it recognises patterns within the spatial arrangement of all the
body landmark positions.
It consists of these layers:
• Input Layer - The input shape is (None, 16, 101). 16 representing the number
of time steps and 101 is the number of features (all body landmark positions and
their visibility).
• Convolutional Layer - The Conv1D layer with 64 filters and a kernel size of 3
is used to extract the local spatial features from the pose data [31].
• MaxPooling Layer - This layer is used to capture the most important features
as well as reduce the spatial dimensions of the feature map produced by the CNN
layer.
• Flatten Layer - This layer simply flattens the Max pooling layers’ data so that
it is suitable for the final dense layer.
• Dense Layer - A dense layer with 32 units follows the MaxPooling layer, this
processes the extracted features further and once more helps build relationships
between the inputs and the ground truth data [31].

3.6.3 Hybrid Integration


Here is where the temporal features as well as the spatial features produced by
the LSTM and CNN respectively are combined to predict the punch acceleration
magnitude.
This is constructed with these layers:
• Concatenate Layer - This generates a combined feature vector using the out-
puts of the dense layers from the LSTM and the CNN model.
• Dense Layer - A dense layer possessing 64 units further processes the com-
bined feature vector to help form and recognise even more relationships/patterns
between the inputs and the ground truth data.
• Dropout Layer - The dropout layer effectively neutralises half of the nodes
within the network to force it to develop more complex relationships and help to
prevent overfitting.
• Output Layer - This is the final prediction for the punch acceleration magnitude
obtained by using a linear activation function [23].

3.7 Model Training and Evaluation

The model used both the sensor data and the pose data during training with its per-
formance being evaluated on a separate validation set. Training involved splitting
the data into separate sets one being the training data and another the validation
data. An 80/20 split was used with 80% of the data for training and 20% for
validation [31].
The model possessed the following hyperparameters:
18
Laing

• Learning Rate - We aimed to balance the need for quick convergence and stable
training, using a learning rate of 0.001.
• Epochs - The model was trained for 50 epochs. This allowed for sufficient iter-
ations and opportunities for the model to form relationships between inputs and
the ground truth data [42].
The evaluation of the model is a key aspect to allowing it to predict the accel-
eration magnitude successfully as it provides the loss value which the network aims
to reduce via backpropagation as explained in 1.3.2.
The model utilised the following metrics:
• Mean Squared Error (MSE) - Squares the errors between predicted and actual
values and finds their average, penalising larger errors more than small.
• Mean Absolute Error (MAE) - Finds the absolute errors between predicted
and actual values and finds their average, provides a clear metric measuring the
error’s magnitude [23].

Fig. 7. Model Training [16]

The graph shown above represents the validation and training loss of the model
over its 50-epoch training period from which several main observations can be made:
• Initial Training Phase - Both the training and validation loss see a steep decline
within the first few epochs indicating not only that the model is learning from
the data but also is generalising well to unseen data during the early stages of
training.
• Stabilization - After their steep decline in the initial training phase both the
validation and training loss stabilise around epoch 10 where the validation loss
reaches a low point and stabilises and the training loss fluctuates continuously.
• Convergence - At the end of the training, we can see a clear convergence where
the validation loss has remained low and stable indicating the model has converged
19
Laing

and is not overfitting. A small gap can be seen between the training and validation
losses which further suggests a good generalisation from the model [24].
These are extremely positive results and indicate that the model has been able
to effectively learn the underlying patterns of the data including the biomechanics of
the punches which is what we aimed to achieve. These insights validate the model’s
architecture and the training process and indicate that this model is a valid solution
to predicting punch acceleration magnitude based on body landmark positions [26].

20
Laing

4 Explainable AI (XAI)
Explainable AI (XAI) is a critical aspect of this project as it allows us to provide
the insights into the predictive model’s process that allows us to give a meaningful
interpretation over the model which will allow us to understand the influence each
input has on the punch acceleration magnitude. This information can then be used
within training regimes to help target specific aspects of a boxer’s technique [14].

Fig. 8. SHAP Analysis Summary plot [27]

Shown above is a summary plot visualising the SHAP values for each feature.
This indicates not only how much a feature contributes to the prediction but also
which feature is more important than the others. This summary plot focuses on
landmarks directly related to the kinetic chain and has provided us with several key
insights into the biomechanics of a punch such as [28]:

4.1 Landmark11 x (Left Shoulder X-coordinate)


This feature shows a significant impact on the model’s predictions with high SHAP
values. This indicates that the lateral position of the left shoulder is a key part
in predicting the punch acceleration. In other words, a more extended position
correlates with a greater rotational force (at the hips) and is likely the reason it
contributes to a higher punch acceleration [26].

4.2 Landmark24 x (Left Hip X-coordinate)


The X coordinate of the left hip is also a vital part in predicting the punch accel-
eration similar to the left shoulder. The lateral movement of the left hip greatly
impacts the acceleration. This suggests that the hips’ rotation is an integral part
of the kinetic chain and overall punch biomechanics which is backed by multiple
scientific papers [30].

4.3 Landmark11 y (Left Shoulder Y-coordinate)


The vertical position of the left shoulder holds considerable influence over punch
acceleration magnitude. This is likely due to variations within the shoulder’s height
21
Laing

affecting the arm’s leverage and therefore its ability to generate punching power
[25].

4.4 Landmark23 x (Right Hip X-coordinate)


The right hip’s lateral position impacts punch acceleration significantly also, in-
dicating that the alignment and movement of both hips are essential to generate
power and speed in a boxing punch, once again this is backed by scientific research
validating the use of explainable AI [20].

4.5 Landmark12 y (Right Shoulder Y-coordinate)


I suspect for the same reasons the left shoulder’s Y coordinate or height contributes
significantly to punch acceleration also applies to that of the right shoulder indicat-
ing that the use of both shoulders is extremely important when throwing a punch
[18].

4.6 Landmark12 x (Right Shoulder X-coordinate)


Once more the right shoulder’s X coordinate demonstrates the importance of the
shoulder’s lateral movements and cements the idea that both shoulders are impor-
tant throughout the entire duration of the punch [28].

4.7 Insights and Applications


These conclusions prove the SHAP analysis can generate actionable insights for
boxing and potentially other sports. These insights can be used by coaches to
create tailored training regimes to optimise the positioning and movement of key
parts of generating punch acceleration such as the hips and shoulders [18].
In summary, SHAP analysis has proven useful in providing insights into where
the power and speed from a boxing punch come from. I believe this to be a good
proof of concept that explainable AI can aid in the understanding of biomechanics
in sports allowing athletes to optimise and improve their technique [27].

22
Laing

5 Conclusions and Further Work


5.1 Conclusions
In conclusion, this study shows how explainable AI can be used to determine key
aspects of a boxer or athlete’s form in producing a more powerful or fast punch.
Serving as a successful proof of concept to help guide future innovation and research
as the use of SHAP analysis has proven useful in athletic performance [33].
SHAP analysis highlights the critical body landmarks and verifies their im-
portance in generating fast and powerful punches, this information can be used
by coaches as insights to make data-driven recommendations and help optimise an
athlete’s performance. The ability to leverage this form of research ensures that
coaches’ training programmes are scientifically grounded and will improve the effi-
ciency and effectiveness of boxers’ performance and movements [18].
Alongside all of this, this has proven a great use in verifying the theory of the
kinetic chain and other biomechanical concepts. With this solidified understanding
of biomechanics, it may be able to aid in developments within other sports as well
as providing a foundation for more evidence-based training programmes. Advance-
ments such as these can enhance overall athletic performance, not only in boxing but
has also proven an efficient method in other sports helping to contribute to a deeper
and more comprehensive and interpretable understanding of human biomechanics.

5.2 Further Work


Although I believe this study provides useful insights and serves as a proof of concept
in using explainable AI to aid in sports training in order to improve an athlete’s
technique, I still believe there are improvements that could be made:
• Larger Dataset - Due to time restrictions, we were only able to generate a
small dataset of punches. In the future, a larger dataset would provide more
opportunities for the model to form relationships between the inputs and the
ground truth data which may potentially lead to better or more clear insights
generated by SHAP analysis. It may also provide a more diverse dataset allowing
it to reduce the chances of overfitting and improve the model’s generalisation to
new data.
• Inclusion of Multiple Boxers - Currently we only use one boxer during this
study, this was due simply to the lack of available participants. Ideally, a dataset
containing upwards of 10 boxers would be ideal so that it did not create a model
that was overfitted especially to a single boxer. This diversity would help solidify
the SHAP values as valuable as it would be drawing insights from more than one
individual.

23
Laing

References
[1] L. Alzubaidi, J. Zhang, A.J. Humaidi, and et al. Review of deep learning: concepts, cnn architectures,
challenges, applications, future directions. Journal of Big Data, 8:53, 2021.

[2] Arduino.cc. Arduino nano 33 ble sense. https://docs.arduino.cc/hardware/nano-33-ble-sense/,


2024. Accessed: 14 May 2024.

[3] J.S. Arlotti, W.O. Carroll, Y. Afifi, P. Talegaonkar, L. Albuquerque, R.F.B. V, J.E. Ball, H. Chander,
and A. Petway. Benefits of imu-based wearables in sports medicine: Narrative review. International
Journal of Kinesiology and Sports Science, 10(1):36–43, 2022. [Online; accessed 19-May-2023].

[4] Dillon Bowen and Lyle Ungar. Generalized shap: Generating multiple types of explanations in machine
learning, 2020.

[5] Boxing Science. Punch force - the science behind the punch, 2014. [Online; accessed 14-May-2024].

[6] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh. Openpose: Realtime multi-person 2d
pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis amp; Machine
Intelligence, 43(01):172–186, jan 2021.

[7] Lucas Cinelli, Gabriel Chaves, and Markus Lima. Vessel classification through convolutional neural
networks using passive sonar spectrogram images. 05 2018.

[8] cvit.iiit.ac.in. Depth-image representations. https://cvit.iiit.ac.in/research/projects/


cvit-projects/depth-image-representations. Accessed: 14 May 2024.

[9] DataRobot AI Platform. Introduction to loss functions, n.d. [Online; accessed 14-May-2024].

[10] DeepAI. Exploding gradient problem, 2019. [Online; accessed 14-May-2024].

[11] R. Gajbhiye, S. Jarag, P. Gaikwad, and S. Koparde. Ai human pose estimation: Yoga pose detection and
correction. International Journal of Innovative Science and Research Technology, 7(5), 2022. [Online;
accessed 14-May-2024].

[12] GeeksforGeeks. Introduction to recurrent neural network - geeksforgeeks, 2018. [Online; accessed 14-
May-2024].

[13] Google for Developers. Pose landmark detection guide — mediapipe. https://developers.
google.com/mediapipe/solutions/vision/pose_landmarker#pose_landmarker_model. Accessed: 14
May 2024.

[14] Robert I. Hamilton and Panagiotis N. Papadopoulos. Using shap values and machine learning to
understand trends in the transient stability limit. IEEE Transactions on Power Systems, 39(1):1384–
1397, 2024.

[15] Sam Harris. Inertial measurement unit (imu) - an introduction. https://www.advancednavigation.


com/tech-articles/inertial-measurement-unit-imu-an-introduction/, 2023. Accessed: 14 May
2024.

[16] IBM. What are neural networks?, 2023. [Online; accessed 14-May-2024].

[17] Rohit Josyula and Sarah Ostadabbas. A review on human pose estimation, 2021.

[18] Amrutha K, Prabu P, and Joy Paulose. Human body pose estimation and applications. In 2021
Innovations in Power and Advanced Computing Technologies (i-PACT), pages 1–6, 2021.

[19] W Ben Kibler and Timothy J Chandler. Sport-specific conditioning. The American Journal of Sports
Medicine, 23(3):472–479, 1995.

[20] W. Kim, J. Sung, D. Saakes, C. Huang, and S. Xiong. Ergonomic postural assessment using a new open-
source human pose estimation technology (openpose). International Journal of Industrial Ergonomics,
84:103164, 2021.

[21] Abhinav Lalwani, Aman Saraiya, Apoorv Singh, Aditya Jain, and Tirtharaj Dash. Machine learning in
sports: A case study on using explainable models for predicting outcomes of volleyball matches, 2022.

[22] Michael Lapinski, Carolina Brum Medeiros, Donna Moxley Scarborough, Eric Berkson, Thomas J.
Gill, Thomas Kepple, and Joseph A. Paradiso. A wide-range, wireless wearable inertial motion sensing
system for capturing fast athletic biomechanics in overhead pitching. Sensors, 19(17), 2019.

[23] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[24] Jung B Lee, Rory B Mellifont, and Brendan J Burkett. The use of a single inertial sensor to identify
stride, step, and stance durations of running gait. Journal of Science and Medicine in Sport, 13(2):270–
273, 2010.

24
Laing

[25] Seth Lenetsky, Matt Brughelli, Roy J. Nates, J.G. Neville, Matt R. Cross, and Anna V. Lormier.
Defining the phases of boxing punches: A mixed-method approach. Journal of Strength and
Conditioning Research, 34(4):1040–1051, April 2020.

[26] Seth Lenetsky, Nigel K Harris, and Matt Brughelli. Assessment and contributors of punching forces in
combat sports athletes: Implications for strength and conditioning. Strength and Conditioning Journal,
35(2):1–7, 2013.

[27] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis. Explainable ai: A review of machine learning
interpretability methods. Entropy, 23(1):18, 2020. [Online; accessed 14-May-2024].

[28] Pantelis Linardatos, Vasilis Papastefanopoulos, and Sotiris Kotsiantis. Explainable ai: A review of
machine learning interpretability methods. Entropy, 23(1), 2021.

[29] M. Mishra. Convolutional neural networks, explained, 2020. [Online; accessed 14-May-2024].

[30] Carol A Putnam. Sequential motions of body segments in striking and throwing skills: descriptions
and explanations. Journal of Biomechanics, 26(Suppl 1):125–135, 1993.

[31] Jin Qiu, Jian Liu, and Yunyi Shen. Computer vision technology based on deep learning. In 2021
IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence
(ICIBA), volume 2, pages 1126–1130, 2021.

[32] L. Sigal. Human pose estimation. In Springer eBooks, pages 573–592. 2021.

[33] J. Silver and T. Huffman. Baseball predictions and strategies using explainable ai, n.d. [Online; accessed
14-May-2024].

[34] Jan Stenum, Kendra M. Cherry-Allen, Connor O. Pyles, Rachel D. Reetzke, Michael F. Vignos, and
Ryan T. Roemmich. Applications of pose estimation in human health and performance across the
lifespan. Sensors, 21(21), 2021.

[35] Jacopo Tosi, Fabrizio Taffoni, Marco Santacatterina, Roberto Sannino, and Domenico Formica.
Performance evaluation of bluetooth low energy: A systematic review. Sensors, 17(12), 2017.

[36] William C Whiting and Ronald F Zernicke. Biomechanics of Musculoskeletal Injury. Human Kinetics,
2008.

[37] www.ibm.com. What are recurrent neural networks? — ibm, 2024. [Online; accessed 15-May-2024].

[38] www.superdatascience.com. Recurrent neural networks (rnn): The vanishing gradient problem, n.d.
[Online; accessed 14-May-2024].

[39] Y. Yang, Y. Yuan, Z. Han, and G. Liu. Interpretability analysis for thermal sensation machine learning
models: An exploration based on the shap approach. Indoor Air, 2022.

[40] Robail Yasrab and Michael Pound. Phenomnet: Bridging phenotype-genotype gap: A cnn-lstm based
automatic plant root anatomization system, 05 2020.

[41] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. A Review of Recurrent Neural Networks:
LSTM Cells and Network Architectures. Neural Computation, 31(7):1235–1270, 07 2019.

[42] C. Zheng, W. Wu, C. Chen, T. Yang, S. Zhu, J. Shen, Nasser Kehtarnavaz, and M. Shah. Deep
learning-based human pose estimation: A survey. ACM Computing Surveys, 2023.

25

You might also like