Azure Kinect Body Tracking Under Review For The Specific Case of Upper Limb Exercises
Azure Kinect Body Tracking Under Review For The Specific Case of Upper Limb Exercises
AZURE KINECT BODY are normally performed using optical markerbased motion
capture systems [Cai et al., 2019]. They use reflective
TRACKING UNDER REVIEW landmarks positioned on the body that are detected by optical
cameras located in the tracking area. Optical systems are
FOR THE SPECIFIC CASE OF considered gold standard due to their high accuracy in
detecting human poses and movements [Aurand et al., 2017].
UPPER LIMB EXERCISES Nonetheless, the high cost (€20,000), the amount of space
required, and, in some cases, the substantial amount of time
EUGENIO IVORRA, MARIO ORTEGA, MARIANO ALCANIZ spent preparing the subject for assessment make the
introduction of these systems in clinics and daily routines
Institute for Research and Innovation in Bioengineering,
difficult [Oh et al., 2018]. Cameras with the ability to detect
Universitat Politecnica de Valencia, Valencia, Spain
depth and colour (RGB-D cameras), one of the earliest being
DOI: 10.17973/MMSJ.2021_6_2021012
Microsoft Kinect v1 released in 2011, have previously been
e-mail: [email protected] used and validated for human pose estimation (HPE) [Cai et al.,
2019, Eltoukhy et al., 2017, Oh et al., 2018]. In this study, a
A tool for human pose estimation and quantification using Microsoft Azure Kinect camera is employed because it is one of
consumer-level equipment is a long-pursued objective. Many the most modern and well-known RGB-D cameras.
studies have employed the Microsoft Kinect v2 depth camera The solution to the problem of HPE, i.e. the problem of
but with recent release of the new Kinect Azure a revision is localisation of human joints, has recently made significant
required. This work researches the specific case of estimating progress as a result of the use of convolutional neural networks
the range of motion in five upper limb exercises using four (CNN) in images. The state-of-the-art method OpenPose [Cao et
different pose estimation methods. These exercises were al., 2017] is able to perform 8.8 FPS HPE of 19 persons in the
recorded with the Kinect Azure camera and assessed with the image by employing the Nvidia 1080 GTX GPU. In fact, this
OptiTrack motion tracking system as baseline. The statistical method has already been validated as a motion analysis
analysis consisted of evaluation of intra-rater reliability with method, but only for bilateral squat exercise [Ota et al., 2020].
intra-class correlation, the Pearson correlation coefficient and Two subsequent methods, Mask RCNN [He et al., 2017] and
Bland–Altman statistical procedure. The modified version of the AlphaPose [Fang et al., 2017], have made small improvements
OpenPose algorithm with the post-processing algorithm to the mean average precision (mAP) metric, but at the cost of
PoseFix had excellent reliability with most intra-class slower runtime. Osokin modified the CNN of the OpenPose
correlations being over 0.75. The Azure body tracking algorithm algorithm for making it more computationally efficient [Osokin,
had intermediate results. The results obtained justify clinicians 2018]. This modified OpenPose algorithm is employed in this
employing these methods, as quick and low-cost simple tools, paper and is referred to as OpenPoseMod.
to assess upper limb angles. Obtaining the 3D skeletal pose from a monocular RGB image is
a much harder challenge than 2D attempted by fewer methods
KEYWORDS [Bogo et al., 2016, Martinez et al., 2017]. Unfortunately, these
Human pose estimation, Microsoft Azure Kinect, Upper limb methods are typically offline or do not provide predictions in
exercises, OptiTrack system . real world units [Mehta et al., 2017a]. This could be solved by
an additional depth channel provided by RGB-D sensors that
1 INTRODUCTION overcomes forward–backward ambiguities in monocular pose
estimation. The best-known studies are those based on
Human balance disturbances are common disorders in different Microsoft Kinect software development kit (SDK) [Wang et al.,
populations. Previous reports have shown the contribution of 2015] that exploits temporal and kinematic constraints to result
the upper limb to human postural control; for example, upper in a smooth output skeleton. This system is popular in clinical
limb immobilisation or fatiguing arm exercises have shown that practice because it gives good results and is easy to use due to
decreases in upper limb function negatively affect postural its free SDK which is able to track the movement of human
control [Souza et al., 2016].Therefore, the assessment of upper joints without markers. This tool has been employed and
limb disorders is crucial for elaborating proper rehabilitation or validated in numerous clinical applications such as gait and
for treatment diagnosis. motion analysis and shoulder joint angle or jump-landing
Upper limb movements have usually been assessed through kinematics [Asaeda et al., 2018, Eltoukhy et al., 2017, Valevicius
traditional tests and scales. The Wolf Motor Function Test, the et al., 2019].
Action Research Arm Test and the Melbourne Assessment are To the best of our knowledge, this is the first study to analyse
valid tools for measuring the quality of upper limb movement the new Microsoft Azure Kinect Body tracking technology with
[Randall et al., 2001, van Wegen et al., 2010]. However, the two recent 3D human pose algorithms using depth information
majority of these scales may be biased and could introduce a and only colour information for upper limb movements
subjective component due to the therapist’s level of experience (OpenPoseMod and RGB-3DHP). For this comparison, a
and are frequently time-consuming. In order to overcome the baseline test was performed with the optical marker-based
limitations of the traditional scales, instrumented systems have OptiTrack system.
been developed and have appeared in the market. In clinical
practice, the goniometer is a widely used instrument to
measure range of motion. It is regarded as a simple, versatile, 2 METHODS
and easy to-use instrument. Reports indicate that its accuracy is 2.1 Participants
highly dependent on the level of assessor experience and the
anatomical joint being measured. It is also limited to measuring Thirty healthy individuals, 20 men and 10 women, participated
joint angles in single planes and static positions [Walmsley et in this study. They had no known musculoskeletal or vestibular
al., 2018]. disease. The Universitat Politècnica de València granted ethical
MM SCIENCE JOURNAL I 2021 I JUNE
4333
approval for the study. All participants in the study signed an shown in Table 1 were obtained by averaging the processing
informed consent form. The mean age of the participants was time of all the subjects and exercises. According to these results
31.5 ± 10.3 years. The participants did not present any mobility and the assumption that real time was more than 20 fps, we
impairment. Their average height was 1.7 meters and the can affirm that the only method that does not work in real time
average weight was 70.6 kg, and 22 were righthanded. is OpenPoseMod with PoseFix. However, OpenPoseMod
without PoseFix can work up to 35 FPS vs the 8.8 FPS of the
2.2 Instrumentation and procedures original OpenPose implementation.
2.2.1 Human pose estimation methods
Four different HPE methods were investigated and tested in
this study: OptiTrack as the baseline; Azure Kinect body
tracking method as the modern update of the commonly used
Kinect v2 employed in numerous rehabilitation studies and
clinical trials; OpenPoseMod as a CNN that employs an RGB-D
camera for obtaining 3D human poses; and an RGB-based
human pose estimation that leverages state-of-the-art
algorithms in the field.
The first HPE method employed was the OptiTrack motion
capture system. OptiTrack typically generates less than 0.2 mm
of measurement error; thus, it is considered the gold standard
[Nagymáté and Kiss, 2018]. The setup employed included 28
Prime 13 cameras distributed equally along two levels of
height. This setup is able to capture a volume of 12 × 6 × 2
meters and costs around €23,000. The skeleton obtained with
the OptiTrack human motion tracking proprietary algorithm
corresponds to the Rizzoli marker set protocols [Leardini et al.,
2007].
The second HPE method was the Microsoft Azure Kinect body
tracking method (Azure). Azure is able to obtain a 3D human
skeleton composed of 32 3D joints with the coordinates. It
presumably provides better tracking results compared to Kinect
v1 and v2 and can record more joints (from 20–25 to 32). This
technology employs a deep-learning algorithm (not published)
to estimate the 3D joints for each RGB-D image [Shotton et al.,
2013] using only depth information. It works in real time with
consumer equipment with a dedicated graphical processing
unit.
The third HPE method was a modified OpenPosebased Figure 1. Schematic diagram of the procedure for estimating the 3D
algorithm (OpenPoseMod) that estimates first a 2D human human skeleton with three methods
pose using a CNN called Lightweight OpenPose [Osokin, 2018]
and then depth information and the corresponding 3D human
pose for each person. In order to enhance the precision and A summary of the comparison of these four 3D HPE methods is
accuracy of OpenPoseMod, after estimation of the 2D human shown in Tab 1.
pose, the next step is a post-processing filter using the state-of-
the-art method PoseFix [Moon et al., 2019]. PoseFix reported Opti- Azure Open OpenPose RGB-
Track Kinect Pose- Mod + 3DHP
an increase from 64.2% average precision of the original
body Mod PoseFix
OpenPose up to 76.7 mean average precision with the public tracking
Human Pose Microsoft COCO dataset [Lin et al., 2014]. CNN
Once the 2D human points of each joint have been extracted CNN using CNN using
Infrared CNN
and refined, the 3D points of each can be easily and efficiently using RGB RGB and
Algorithm marker using
extracted with the information from the depth image. The tracking
depth and depth
RGB
complete procedure for obtaining the 3D human pose with image depth image
OpenPoseMod is synthesised in Fig 1. image
The fourth HPE method was RGB 3D human pose estimation Yes,
2D project
(RGB-3DHP). As shown in Fig 1, it is composed of two parts. The No Yes Yes Yes
keypoints the 3D to
first infers the 2D joint locations and joint detection 2D
confidences (heatmap) using the Lightweight OpenPose CNN Yes,
and the second estimates occlusion-robust pose maps and infers Yes,
Yes, infers
infers the 3D pose of the joints. This method, which is defined 3D infers
3D 3D using a
in [Mehta et al., 2018], is a further development of the Vnect Yes Yes using the 3D
keypoints depth
[Mehta et al., 2017b] algorithm. a using a
image
These 3D human pose estimation techniques were calculated depth CNN
image
offline using the recorded media. However, the processing time
Calib
was evaluated in order to determine if it was possible to required
Yes No No No No
employ the particular technology in real time. A personal
Cost € 20.000 399 180 180 50
computer with an Intel Xeon W3225 CPU, Nvidia Titan RTX and
128G RAM memory was employed for testing. The results Hardware NIR Microsoft RGB- RGB-D RGB
MM SCIENCE JOURNAL I 2021 I JUNE
4334
camera Kinect D camera camera Finally, Bland–Altman analysis [Bland and Altman, 2010] was
Azure camer also performed. It is commonly employed to compare two
a methods of measurement and interpret findings to determine
FPS 120 ±1 26 ±2 35 ±4 8±2 21 ±3 whether a new method of measurement could replace an
existing accepted ‘gold-standard’ method [Ota et al., 2020]. The
Table 1. A summary of the comparison of 3D human pose estimation
statistical analyses were performed using the Matlab R2019B
methods
computational environment and Microsoft Excel 2016.
2.2.2 Procedure
Participants were asked to wear a black suit on which optical 3 RESULTS
markers were positioned in order to obtain the 3D position of The ICC obtained for each HPE method and for each exercise
these landmarks in a virtual world through OptiTrack are shown in Tab. 2. To determine which exercises were better
technology. These subjects performed an exercise set to build calculated and which worst for each HPE method, the
an average model of the exercises that could be used for difference from OptiTrack was also determined (Fig. 2).
checking the results of other technologies. Later, participants
were asked to perform the same exercise set in their own
clothes. The participants were assessed in four different areas
without obstacles and repeated the exercises three times.The
exercises were recorded using the Microsoft Azure Kinect RGB-
D camera and the recorded media were processed with the
three different HPE methods (OpenposeMod, Azure and RGB-
3DHP). This way, the coordinate system, timeline and
conditions were equal for all of them. The participants did not
perform the exercises simultaneously with OptiTrack because
the other three technologies lost accuracy significantly due to Figure 2. Percentage difference in the upper link angle between
the black suit with optical markers. A manual temporal and OptiTrack and other technologies. The lower the percentage the better
spatial synchronisation were performed to transform the the technology.
OptiTrack results with the other three HPE methods. Microsoft
Azure Kinect camera was located at 2.5 meters from the E Kinemati Azure OpenposeMod RGB-3DHP
participant and 0.9 m over the floor during the experimental x c param
tests and was configured to record at 1280 × 720 colour e L I H L I H L I H
resolution, narrow field-of-view unbinned mode with 640 × 576 r
for the depth mode and at 30 FPS framerate. 1 min 0. 0. 0. 0. 0. 0. 0. 0. 0.
angle 63 80 83 71 84 89 49 79 91
The exercise set was composed of five exercises designed for
shABfp
assessment of numerous rehabilitation parameters, especially max 0. 0. 0. 0. 0. 0. 0. 0. 0.
for the joints of upper body parts. Exercises lasted between 20 angle 74 90 97 76 90 95 47 75 86
and 40 seconds each with 30 seconds of break between them. shABfp
Please check supplementary materials for a graphical angle 0. 0. 0. 0. 0. 0. 0. 0. 0.
description. These exercises were range 72 89 94 77 92 97 46 74 85
shABfp
1. Shoulder abduction in the frontal plane (shABfp).
2 min 0. 0. 0. 0. 0. 0. 0. 0. 0.
2. Flexion of the shoulder in the sagittal plane (shFLsg). angle 55 72 72 61 72 76 46 74 85
3. Flexion of the elbow in the sagittal plane (elFLsg). shFLsg
4. External rotation of the shoulder in the zenith plane max 0. 0. 0. 0. 0. 0. 0. 0. 0.
(shROTzp). angle 74 91 97 78 92 97 50 81 93
5. Horizontal flexion of the shoulder in the zenith plane shFLsg
angle 0. 0. 0. 0. 0. 0. 0. 0. 0.
(shFLzp). range 70 86 91 74 88 93 49 80 91
shFLsg
2.3 Statistical and data analysis 3 min 0. 0. 0. 0. 0. 0. 0. 0. 0.
angle 63 79 82 46 54 57 44 70 81
The raw 3D joint positions were smoothed for all HPE methods elFLsg
with a rolling window median of two seconds. Then, kinematic max 0. 0. 0. 0. 0. 0. 0. 0. 0.
parameters were calculated as the average performance of all angle 71 88 93 77 91 96 44 71 81
the subjects for each technology. The kinematic parameters elFLsg
defined the range of motion (minimum, maximum and angle 0. 0. 0. 0. 0. 0. 0. 0. 0.
range 73 89 95 65 77 81 38 61 70
difference) of angles between the joints involved in the elFLsg
movement. Shoulder and elbow angles were calculated 4 min 0. 0. 0. 0. 0. 0. 0. 0. 0.
following the international standards defined by the angle 64 81 84 76 90 95 46 74 84
International Society of Biomechanics [Wu et al., 2005]. shRotzp
max 0. 0. 0. 0. 0. 0. 0. 0. 0.
Pearson correlation coefficients were calculated to determine angle 60 77 79 69 82 87 38 61 70
the concurrent validity of the three technologies with the shRotzp
OptiTrack baseline at an alpha value of 0.05. In addition, intra- angle 0. 0. 0. 0. 0. 0. 0. 0. 0.
range 48 63 65 64 76 80 29 47 54
class correlation (ICC) for model (2,k) was calculated to consider
shRotzp
the consistency of the within-subject agreement between 5 min 0. 0. 0. 0. 0. 0. 0. 0. 0.
systems, taking into account possible systematic errors. ICC was angle 45 6 63 68 81 86 33 52 60
considered poor if it was < 0.4, fair if 0.4–0.6, good if 0.6–0.75 shFLzp
and excellent if ≥ 0.75 [McGinley et al., 2009].
MM SCIENCE JOURNAL I 2021 I JUNE
4335
max 0. 0. 0. 0. 0. 0. 0. 0. 0. 4 DISCUSSION
angle 69 86 90 73 87 92 49 79 91
shFLzp 4.1 Discussion of results
angle 0. 0. 0. 0. 0. 0. 0. 0. 0. The purpose of this study was to evaluate and compare the
range 62 79 81 63 75 79 35 57 65
shFLzp performance of the Azure body tracking algorithm with two
alternative human pose estimation algorithms using OptiTrack
Table 2. Intra-class correlation (ICC) for kinematic parameters system as benchmark. From the analysis shown in Fig 2, it can
calculated for upper limb joints. In each row, the best value is in bold. be concluded that the results greatly depend on the type of
(L- Lower, I-ICC,H-Higher)
exercise. These measures are used for example in rehabilitation
exercises to assess the degree of recovery from injuries that
limit the range of motion. Currently, these angles are clinical
The five tasks results average of the % difference for Azure, measured using goniometers; therefore, a fast and accurate
OpenPoseMod and RGB-3DHP were 10.7%, 7.6% and 18.2%, camera-based method would be a significant improvement.
respectively. These results are consistent with the Root Mean Exercises with smaller differences from OptiTrack are those in
Square Deviation (RMSD) of 10 degrees for Azure, 8 for the frontal plane, parallel to the camera sensor and those with
OpenPoseMod and 22 for RGB-3DHP and Pearson correlation wide movements. The algorithms had worse pose estimations
coefficients of 0.979, 0.988 and 0.941 (all p<0.05), respectively. when own subjects’ clothes happened to be black, had their
These values show that there was a strong correlation, but arms tightly against the body or perfectly aligned with a normal
there was still some difference in accuracy, between OptiTrack vector from the camera. Exercise 4 had two of these problems
and the RGB-D camera-based methods. which could be rectified by changing the camera setup position
Finally, the results of the Bland–Altman analysis are shown in (or using another camera) so the movement was on the frontal
Fig. 3. In these three methods, 95% of the measures were plane. Moreover, previous studies such as [Bonnechere et al.,
inside the limits of agreement (LOA). A perfect match between 2014] have reported poor results when using the Kinect for
methods would give a mean of 0, and a smaller LOA means measuring the elbow angle on the sagittal plane.
better adjustment. Azure had an excellent ICC (Tab. 2) except for measuring
shoulder rotation in zenith plane which was rated good (range
0.63–0.81). These results are lower to those obtained by [Cai et
al., 2019], with ratings of 0.59–0.96 for shoulder motions
measured by Kinect v2. The absolute mean error for Azure was
10.7 degrees, higher than the 6 degrees reported by [Shotton
et al., 2013]for different upper limb movements measured with
Kinectv2 or the 7.6 degrees of error of [Wiedemann et al.,
2015] also with Kinect v2. These results were recently
corroborated by Albert et al [Albert et al., 2020] who reported
during a gait analysis that the Kinect v2 performed better than
the Azure in the mid and upper body region, especially in the
upper extremities. However, a specific study should be
performed with the same conditions and exercises in order to
conclude if it is meaningful to upgrade to Kinect Azure from
Kinect v2. Moreover, overall Azure results are lower than those
obtained for OpenPoseMod with PoseFix, but at the cost of
offline processing. With the overall results reviewed, it can be
concluded that the best RGB-D camera method of the three
analysed in this study is OpenPoseMod. This method can be
employed as an effective alternative to the traditional
goniometer or the expensive OptiTrack.
Based on the results of this study, several recommendations
can be made. It is important to be aware of the advantages and
disadvantages of each technology. For example, if the
illumination cannot be controlled or the scenario is cluttered, it
is a good idea to employ the Azure algorithm because it is only
based on depth information calculated using the infrared
spectrum. On the other hand, OpenPoseMod showed better
performance, mainly due to the employment of PoseFix, when
the acquisition scenery was controlled. RGB-3DHP could be also
employed, at a very low cost, for some exercises because it only
needs a common RGB camera, although its accuracy is lower
than that of the other two methods. This method is
recommended for ludic applications or motivational games in
rehabilitation with animated avatars like [Tannous et al., 2016].
Fig 4 summarises the recommendations depending on
Figure 3. Bland–Altman plot for upper-limb exercises for each application requirements. It is important to remark that this
technology. diagram shows our recommended method for each context but
it does not imply that other method is not valid, for example,
6 CONFILICST OF INTEREST
The authors declare that there are no conflicts of interest.
Figure 4. Decision tree for recommendations from the RGB-D methods
for upper limb joint estimation
ACKNOWLEDGMENTS
7 SUPPLEMENTARY MATERIALS
(b) Subject in motion capture suit Figure S3. Acquisition setup with the RGB-D camera
Figure S1. Optitrack setup
2 shFLsg
3 elFLsg
4 shROTzp
5 shFLzp
Table S1. Graphical explanation of the exercise set for the upper limb
Table S2. Kinematic parameters calculated for upper limb joints (in degrees)