ENSAM JIANG 2023 Archivage
ENSAM JIANG 2023 Archivage
ENSAM JIANG 2023 Archivage
THÈSE
présentée par : Jindong JIANG
soutenue le : 28 Août 2023
Jury
M. Philippe PUDLO, Professeur des Universités, UPHF Rapporteur
M. Mohsen ZARE, Maitre de conférences HDR, UTBM Rapporteur
T
Mme Adriana SAVESCU, Docteur, Ingénieur de recherche, INRS Examinatrice
Mme Laurence CHEZE, Professeur des Universités, UCBL Examinatrice H
M. Ali SIADAT, Professeur des Universités, ENSAM-Metz Directeur
È
Mme Wafa SKALLI, Professeur des Universités, ENSAM-Paris Co-encadrante
M. Laurent GAJNY, Maitre de conférences, ENSAM-Paris Co-encadrant S
E
Acknowledgement
First and foremost, I would like to thank my supervisors Prof. Ali Siadat (Ph.D.), Prof. Wafa
Skalli (Ph.D.) and Assoc. Prof. Laurent Gajny (Ph.D.) for having accompanied me, guided me
and helped me every time I needed it during my doctoral study.
I would like to express my sincere thanks to all the members of the jury. I am sincerely thankful
to Prof. Philippe Pudlo (Ph.D.) and Assoc. Prof. Mohsen Zare (Ph.D.) for kindly agreeing to
review this manuscript. Furthermore, I express my appreciation to Dr. Adriana Savescu (Ph.D.)
and Prof. Laurence Chèze (Ph.D.) for graciously accepting the invitation to be examiners in the
jury.
I would also like to thank the whole research teams in Laboratoire de Conception Fabrication
Commande (LCFC) and Institut de Biomécanique Humaine Georges Charpak (IBHGC), for
everything they brought to me, both on a cultural and technical level.
I would like to thank Prof. Helene Pillet (Ph.D.) and Dr. Xavier Bonnet (Ph.D.) for providing
important information and resources. I am also deeply grateful to Mr. Sylvain Persohn (Eng.)
for his technical assistance, to Assoc. Prof. Laurent Gajny (Ph.D.) and Ms. Laura Valdes (Eng.)
for their help with the motion capture and medical imaging systems. Additionally, I want to
acknowledge the invaluable help of Dr. Saman Vafadart (Ph.D.) in helping me understand the
theory and equipment related to motion capture. Finally, I would like to express my appreciation
to Dr. Marc Khalifé (MD.) and Mr. Guillaume Rebeyrat (Ph.D. very soon!) for their cooperation
in the data acquisitions. I would like to extend my sincere appreciation to Mme Marine Souq
for consistently ensuring timely completion of everything and for her exceptional proficiency
in coordinating meetings. Also, I would like to express my sincere gratitude to Mr. Christophe
Muth-Seng (Ph.D) and Mr. Rémi Valentin (Ph.D. very soon!) for their continuous supports and
generosity as colleagues. I hope that we will continue to maintain this friendly link which
surpassed our professional relations.
I would also like to thank AMVALOR and École nationale supérieure d'Arts et Métiers
(ENSAM) for having given me the chance, through this Ph.D project, to discover the research
world in such good conditions.
Finally, I thank all the people with whom I had the opportunity to collaborate and exchange
during my Ph.D. study. This study has been a wonderful experience, it is thanks to each of you.
Abstract
Abstract in English In an industrial environment, workers are often required to repeat
specific gestures in the workstation, which can cause musculoskeletal disorders (MSDs).
However, due to the complexity of the factory environment, an automatic system that can
perform quantitative biomechanical analyses of workers without interfering with their daily
work at the workstation is not currently available. The objective of this manuscript is to explore
the possibility of conducting biomechanical and ergonomic analysis using computer vision. A
framework was developed based on 4 digital cameras to estimate the 3D human pose of workers.
With the multi-view computer vision approach, the worker’s motion during lifting/lowering
tasks can be accurately captured in a complex simulated factory environment. The load at L5-
S1 joint as well as the ergonomics evaluation were calculated subsequently. The results of the
study show that stereo-vision-based systems have the potential for automated quantitative
biomechanical and ergonomic analysis in industrial environments.
Résumé en Français Dans un environnement industriel, les opérateurs sont souvent amenés
à répéter des gestes spécifiques sur le poste de travail, ce qui peut provoquer des troubles
musculo-squelettiques (TMS). Cependant, en raison de la complexité de l'environnement
industriel, il n'existe pas actuellement de système automatique capable d'effectuer une analyse
biomécanique quantitative des opérateurs sans interférer avec leur travail quotidien au poste de
travail. L'objectif de ce manuscrit est d'explorer la possibilité d'effectuer des analyses
biomécaniques et ergonomiques via la vision par ordinateur. Un système a été développé en
utilisant 4 caméras numériques pour estimer la position humaine en 3D des opérateurs. Grâce
à l'approche de vision par ordinateur multi-vues, les mouvements des opérateurs pendant des
tâches de levage et d'abaissement peuvent être capturés dans un environnement d'usine simulé
complexe. La charge au niveau de l'articulation L5-S1 ainsi que l'évaluation ergonomique sont
calculées par la suite. Les résultats de l'étude montrent que les systèmes basés sur la vision
stéréo ont un fort potentiel pour l'analyse biomécanique et ergonomique quantitative
automatisée dans les environnements industriels.
Figure 2-1. A RULA assessment worksheet used to assess the RULA score (Wijsman et
al., 2019) ..................................................................................................................... 6
Figure 2-2. An Xsens MVN is composed of 17 inertial and magnetic sensor modules.
Suit is designed for efficient and convenient placement of sensors and cables
(Roetenberg et al., 2013). ......................................................................................... 11
Figure 2-3. A 3D Vicon motion capture system that uses several cameras to track reflective
markers in real time (Kovacic et al., 2018) .............................................................. 12
Figure 2-4. Microsoft Kinect sensor. The picture on top shows the external appearance of
the Kinect sensor for Xbox 360. The image on the bottom is the infrared projector,
RGB camera and infrared camera included in the Kinect sensor (Zhang, 2012)..... 14
Figure 2-5. 3D human pose directly from the 2D image (Zheng et al., 2021a). .............. 15
Figure 2-6. Examples of the selected 3D datasets ............................................................ 20
Figure 2-7 Thesis Structure .............................................................................................. 26
Figure 3-1. Flowchart of the study on the effect of face blurring. ................................... 32
Figure 3-2. (a) The human model defined in the ENSAM dataset. (b) The 17 keypoints
projected on two camera views. ............................................................................... 33
Figure 3-3. Example images of the ENSAM Pose dataset after face blurring. ................ 34
Figure 3-4. (a) Schematic diagram of the coordinate systems of shoulders and trunk. (b)
The 17 keypoints and the coordinate systems of shoulders and trunk projected on two
camera views, where the red, green, and blue axis represent the x, y, and z-axis,
respectively, with a length of 20 cm in the world global coordinate system. .......... 36
Figure 3-5. Distribution of joint angle differences (RMSEs of the frames for each subject
in the testset, scheme: 75th/25th percentile). The whiskers indicate the minimum and
maximum value of the data, except for the outliers marked with diamond-shaped
black dots.................................................................................................................. 39
Figure 4-1. Placement of the Vicon and digital video cameras ........................................ 49
Figure 4-2. Difference distribution of the 3D position estimation of each key point on the
test set. The middle line and left/right sides of the box represent the median and
75th/25th percentile. The whiskers indicate the minimum and maximum values of the
data. .......................................................................................................................... 52
Figure 4-3. Samples of the human pose estimation results on the test set. Each column
represents 2 of the 4 views in a single frame with the 3D positions of the key points
projected to the corresponding image. Green Skeleton: Estimate from multi-view
RGB images. Red Skeleton: Reference from Vicon ................................................ 53
Figure 4-4. Difference distribution of the 3D position estimation of each key point in S01
and the other subjects in the test set. The middle line and left/right sides of the box
represent the median and 75th/25th percentile. The whiskers indicate the minimum
and maximum values of the data. ............................................................................. 54
Figure 4-5. Samples of the human pose estimation results on S01. Each column represents
2 of the 4 views in a single frame. Green Skeleton: Estimate from multi-view RGB
Images. Red Skeleton: Reference from Vicon ......................................................... 55
Figure 5-1. Sample of biplanar radiographs and the corresponding 3D body envelope. (a)
Sample images of low-dose biplanar X-ray radiographs; (b) 3D digital envelope of
human body based on radiography. Each segment is distinguished by a different color.
The red circle on each segment represents the center of mass, and the axes represent
its inertial principal axis, with the length indicating the relative magnitude of the
rotational inertia about the corresponding axis. The red crosses denote the body key
points, which are used for the localization of the center of mass position. .............. 65
Figure 5-2. The local coordinate system of the pelvis ..................................................... 66
Figure 5-3. Distribution of RMSE of the forces generated by top down calculation strategy
for the marker based and markerless motion capture data ....................................... 71
Figure 5-4. Distribution of RMSE of the moments generated by top down calculation
strategy for the marker based and markerless motion capture data ......................... 72
Figure 5-5. Distribution of RMSE of the forces generated by bottom up calculation strategy
for the marker based and markerless motion capture data ....................................... 72
Figure 5-6. Distribution of RMSE of the moments generated by bottom up calculation
strategy for the marker based and markerless motion capture data ......................... 73
Figure 5-7. Example result of L5-S1 load versus time during a lifting task (subject ID: S08,
task trial: #19). Est.=estimate, Ref.=reference, Ant.-Post=anteroposterior,
Vert.=vertical, Medi.-Lat.= mediolateral , Front. =frontal plane, Trans.= transverse
plane, Sag.= sagittal plane, Norm= Euclidean norm ................................................ 75
Figure 5-8. Example result of L5-S1 load versus time during a lowering task (subject ID:
S12, task trial: #00). ................................................................................................. 76
Figure 5-9. 3D digital human body reconstruction of the subjects based on SMPL-X. The
images in the first row are the input RGB image, with the human face blurred. The
images in the second row present the reconstructed 3D human models. The images in
the third row demonstrate the projection of the 3D body models on the corresponding
images....................................................................................................................... 81
Figure 6-1 Flowchart of the study on the application in ergonomic analysis. ................. 87
Figure 6-2 Definition of human skeleton model and the key planes, adapted from (Li and
Xu, 2019) .................................................................................................................. 89
Figure 6-3 Upper Arm Adduction was defined as the angle of the projection vector of
𝑆𝐿𝐻𝐿 , 𝑆𝐿𝐸𝐿 on P1. .............................................................................................. 90
Figure 6-4 The lower arm was considered cross middle line of the body if the angle
between the projection vectors of 𝑁𝑊𝐿 , 𝑁𝑆𝐿 on P1 was greater than 90 degrees.
.................................................................................................................................. 90
Figure 6-5 a) The complementary of the angle formed by the projection vectors of the P3
normal and 𝑁H on P2 was defined as the neck flexion. b) The neck side bending
angle was defined as the angle between the projection vectors of 𝐻𝐶𝑁, 𝑁H on P1.
.................................................................................................................................. 91
Figure 6-6 RULA calculation flow................................................................................... 92
Figure 6-7 Example of RULA Score vs Load L5-S1 ....................................................... 97
Figure 6-8 Distribution of Spearman's rank correlation coefficient between the load at
L5S1 and the RULA estimate for each subject. The middle line and top/ bottom sides
of the box represent the median and 75th/25th percentile. The whiskers indicate the
minimum and maximum value of the data, except for the outliers marked with
diamond-shaped black dots. ..................................................................................... 97
Figure 6-9 RULA score versus L5S1 loads estimates during a specific lowering task I . 98
Figure 6-10 RULA score versus L5S1 loads estimates during a specific lifting/lowering
task II ........................................................................................................................ 99
List of Tables
Table 2-1. Conventional observational methods ................................................................ 7
Table 2-2. Evaluation of different studies on the RGB-based human post estimation..... 16
Table 3-1. Inference results with the models trained on blurred/original images. μ: the
mean value of joint position differences with the reference Vicon system (mm). σ:
standard deviation (mm). .......................................................................................... 38
Table 3-2. Results (p-value) of one-way repeated measures ANOVA tests for the effect of
face blurring on kinematics calculation. Numbers with underscores suggest
significant effects (p-value < 0.05)........................................................................... 40
Table 3-3. The RMSE (+SD of all the frames) of joint angles calculated with the joints’
position from the models trained on blurred/original images (°). ............................ 41
Table 4-1. Mean value (μ) and standard deviation (σ) of Euclidean distance between the
estimates and the references of key points position on the test set by keypoints for
each subject. (anks.=ankles, shouls.=shoulders, elbs.=elbows) ............................... 53
Table 5-1. RMSE between the L5/S1 load calculation with marker-based motion capture
and marker-less motion capture for each combination of BSIP model and calculation
strategy. μ and σ denote the mean and deviation of the RMSE across all trials,
respectively. .............................................................................................................. 70
Table 5-2. RMSE of the difference between the reference and estimate of L5-S1 load for
each subject across the trials .................................................................................... 76
Table 5-3. Peak value difference between the reference and estimate of L5-S1 load for
each subject across the trials .................................................................................... 77
Table 6-1. Accuracy of the RULA estimation on the test set........................................... 96
Table 6-2. Cohen Kappa of the RULA estimation on the test set .................................... 96
Table 6-3. RMSE of the RULA estimation on the test set ............................................... 96
Glossary
The glossary includes the common abbreviations that are utilized within the thesis.
Acc Accuracy
EMG Electromyography
SD Standard Deviation
SI Strain Index
According to the way the human motion kinematics is collected, the ergonomics evaluation
methods include observational and quantitative measurement methods (Joshi and Deshpande,
2019). The former approach involves qualitative or semi-quantitative on-site observation of
workers' motion or analyzing recorded videos, while the latter approach generally uses wearable
sensors to directly acquire accurate postures of the workers during a work process (Jasiewicz
et al., 2007; Li and Buckle, 1999; Wang et al., 2015). The observational method is simple but
it is subjective and depends on the experience of the observer (Li and Xu, 2019). The
quantitative measurement approach, obtaining information by directly gathering data from
sensors that are attached to the body of the subject, can cause several obstructions to workers'
activities, posing an important challenge in industrial applications. Therefore, it is primarily
2
General Introduction
used as a reference method in controlled lab environments to evaluate the accuracy of other
techniques.
The limitations of the above two approaches yielded the development of camera-based human
pose estimation method. This method leverages computer vision associated with artificial
intelligence techniques to estimate human poses based on either the images or videos obtained
by normal cameras (RGB cameras), providing more accurate estimates without the influence of
human experience. However, several scientific issues may arise with the RGB-based pose
estimation. First of all, training machine vision models requires the acquisition and storage of
image datasets of motion, while the storage of such datasets involves privacy issues for the
subjects. Although the cameras are installed primarily for the health and safety of workers, these
cameras can be abused. In addition, due to privacy concerns, dataset sharing is restricted since
the images or videos are normally not fully anonymized, resulting in difficulties to integrate the
corresponding datasets to train neural network models with better performance.
Secondly, the quantity and quality of datasets obtained by the cameras are essential in computer
vision algorithms. In particular, camera occlusion is a serious problem during motion capture
for workers in factories. While a multi-view stereo vision system can be utilized, processing
the multi-view image dataset taking into account occlusion is nontrivial.
Thirdly, accurate estimation of the joint loads for workers during an industrial task from images
or recorded videos is important for enabling further study of the relationship between
biomechanical indicators and the MSDs risk. This biomechanical analysis requires knowledge
of the body segment inertial parameters (BSIPs) of workers. Technically, if the accurate
geometry of each body segment and the exact body mass distribution are known, one can
estimate reasonably the inter-segmental loads of the human body. However, obtaining the exact
geometry and mass distribution of the human body does not always seem possible.
It is the objective of the present manuscript to explore the possibility to use RGB cameras to
provide accurate measurements despite occlusions that may occur during a lifting/lowering task
and to rely on this quantitative motion capture in order to build musculoskeletal model for joint
load estimation. In the following chapter, a literature review will be conducted on the above-
mentioned scientific issues and the personal contributions will be displayed next.
3
4
State of the art
In this chapter, we will first conduct a literature review on the ergonomics assessment methods
currently used in factories to identify the shortcomings. Next, we provide a comprehensive
overview of the methods used for biomechanical motion measurement, and their pros and cons
are discussed. This is followed by an investigation of the biomechanical analysis method. The
scientific issues associated with the adopted techniques are presented in relevant sections.
There are three main methods of ergonomic analysis, namely the self-assessment method, the
observational method, and the quantitative measurement method. The self-assessment method
focuses on analyzing workers' self-perceptions during the industrial task to assess the risk of
MSDs, which is usually conducted with questionnaires or interviews (Burdorf and Laan, 1991;
Jacobs, 1998). The advantages of the self-assessment method lie in the fact that it is relatively
simple to carry out and has a wide range of applicability. It is also suitable for surveying a large
number of subjects at a relatively low cost. However, the self-assessment method has an
apparent disadvantage, that is, this method is highly subjective and easily influenced by various
factors, such as the emotional, mental, and physical state of workers, which significantly affects
the reliability of this method for actual applications. Additionally, the comprehension and
question interpretation of the worker may also affect the results of the self-assessment method.
Quantitative measurement methods usually attach sensors to the subject's body, such as IMU,
EMG, and collect the motion and electromyographic signal, etc., to analyze the risk of
5
State of the art
musculoskeletal disorders in workers (Burdorf and van der Beek, 1999). Nowadays, thanks to
the rapid advancement of technology, sensors are becoming smaller and smaller in size (Lin et
al., 2018), which allows them to capture information of workers with almost negligible impact
on their daily work. In (Granzow et al., 2018), researchers used EMG and IMU to capture
muscle activation and posture information of forestation hand planters during their work to
assess if they were exposed to musculoskeletal disease risk. In the work of (Nordander et al.,
2016), the authors used inclinometry, EMG and electrogoniometry, to collect information from
workers and explore exposure-response relationships for musculoskeletal disorders of neck and
shoulder. In the study of (Schall et al., 2016), the authors performed physical demands
estimation of nurses during their work, using IMUs and waist-worn physical activity monitor.
However, although direct measurement sensors such as IMUs can already have a minimum
impact on the work of employees, there are still a number of barriers to their large-scale use in
the real workplace. Besides the issue of the privacy, workers need to be trained to work with
wearable sensors, and in their daily work, workers may not always wear these sensors in
accordance with their training (Schall et al., 2018). In addition, the quantitative measurement
method requires a significant amount of time for the analysis and interpretation of the data,
which is impractical to be used in the industrial environment. Last but not least, a considerable
amount of investment is needed for acquiring the equipment, together with the cost associated
with maintenance and highly skilled biomechanics specialists or engineers to ensure the
operation of the equipment. Therefore, despite being considered highly accurate for ergonomic
evaluation, this method is mostly limited to research and laboratory use because of the
difficulties to deploy in real work scenarios (Li and Buckle, 1999; Schall et al., 2018).
The conventional observational methods developed for documenting workplace exposure that
can be assessed by an evaluator or pre-designed form are shown in Table 2-1. Among them, the
most widely used MSDs risk assessment method is the RULA method (Rapid Upper Limb
Assessment) (Dockrell et al., 2012; Kee et al., 2020; McAtamney and Nigel Corlett, 1993).
Figure 2-1 illustrates a RULA worksheet that is used to assess the RULA score (Wijsman et al.,
6
State of the art
2019). The principal idea of the RULA is to observe the work process of workers and identify
the riskiest postures, analyzing the joint angles and scoring them. The body was grouped into
two groups, named Group A and Group B, containing different body segments. During the
analysis stage, a score will be assigned to each posture according to the ranges of the movement.
Despite that the RULA is developed for ergonomic risk assessment of the upper limb, the
applicability of the RULA for predicting low back pain has been demonstrated in (Labbafinejad
et al., 2016; Rachmawati et al., 2022; Rezapur-Shahkolai et al., 2020).
Figure 2-1. A RULA assessment worksheet used to assess the RULA score (Wijsman et al., 2019)
Another widely employed MSDs risk assessment method is the REBA (Rapid Entire Body
Assessment) method (Hignett and McAtamney, 2000), an expansion of the original RULA
method to include a larger range of postures, which aims to provide a postural analysis system
sensitive to MSD risks in a variety of tasks. REBA method divides the body into segments to
evaluate individually with regard to postures and movement planes. A scoring system for
muscle activity caused by static, dynamic, rapid changing and unstable postures is then
provided by REBA. Additionally, this method considers coupling as a vital variable in the
7
State of the art
handling of loads and gives an action level output with an indication of urgency.
RULA and REBA are commonly adopted by ergonomists in real factory environments due to
its convenience, providing easy-to-use assessment tools that require little time, effort, and
equipment, several limitations exist (Takala et al., 2010). The methods require separate scoring
of the right and left hand, with no provided technique for merging this information. Besides,
duration of the task is not taken in account (Hignett and McAtamney, 2000; McAtamney and
Nigel Corlett, 1993).
The NIOSH lifting equation has been widely employed by occupational health practitioners to
evaluate the risk of low back pain related to lifting and lowering tasks (WATERS et al., 1993).
It is based on a mathematical formula to compute a “lifting index” that accounts for several risk
factors such as the magnitude of the load being lifted, the distance of the load from the worker’s
body, the frequency of the tasks. The NIOSH equation incorporates the worker's postures into
the evaluation. The coefficients related to body distance, vertical distance from the ground, and
torsion are dependent on the postures performed by the worker. This approach has been
extensively documented and verified through numerous experiments conducted in laboratories,
establishing a strong foundation rooted in scientific research. The results are directly connected
to the potential health risks concerning the back, and online tools for calculation are also
accessible. However, many limitations of the NIOSH lifting equation exist regarding its
practical use. In particular, the need for multiple technical measurements and computations
leads to additional expertise and time required to perform the assessment (Takala et al., 2010).
There are also several other observation methods, such as the European Assembly Worksheet,
that can enable assessment in a more comprehensive manner, taking into account factors such
as repetition, exertion and lifting/lowering tasks (Schaub et al., 2013). The methods exist in
addition to the above mentioned techniques, such as mentioned in Table 2-1.
Nevertheless, all of the observational methods suffer from many similar limitations (Takala et
al., 2010), which we summarize and analyze as follows. Firstly, since the primary purpose of
these methods is to identify key MSD risk factors expeditiously and economically, most of them
were not designed to take into account the worker’s individual characteristics, such as age,
gender, and physical condition. Secondly, methods such as RULA and REBA focus primarily
9
State of the art
on assessing posture of the worker and have strict threshold limits, making them unsuitable for
accurately assessing tasks involving delicate manual object manipulation. Besides, the mono-
task assessment methods like the RULA and the REBA methods do not take into account the
dynamic movements of specific tasks since it only evaluates the posture of a worker in a static
position. Thirdly, one of the important limitations present in all observational techniques is their
dependence on the judgment, expertise, and experience of those conducting the assessments.
Consequently, variations in assessments between different evaluators can frequently arise even
on the same task (Eliasson et al., 2017; Rhén and Forsman, 2020; Widyanti A, 2020). Moreover,
in terms of the reliability and validity of observational methods, they are affected by both
external and internal factors. External factors are linked to the assessment's context,
encompassing the ergonomists who conduct assessments, the environment of the specific
workstation, and the individuals being evaluated, etc. Whereas, internal factors consist of
elements associated with the approach itself, e.g. the assessment process, the employed
thresholds limits, and the exact standard applied in the risk evaluation.
Due to the subjectivity of ergonomists when analyzing work processes, the accuracy of the
conventional observational method is lower than the quantitative measurement method (Li and
Xu, 2019). Nevertheless, it can analyze workers' movements without disrupting their work,
which is suitable to be carried out in factories, a substantial advantage. Meanwhile, numerous
recent studies have developed new observational methods that leverage images or recorded
videos with computer vision algorithms to efficiently obtain the required motion kinematics for
ergonomic analysis (Abobakr et al., 2019; Li and Xu, 2019; Plantard et al., 2017c), which is
pursued later in sections 2.2 and 2.3 of this manuscript. This automated, quantitative, and non-
interfering tool may be of great importance to facilitate the estimation of the risk of
musculoskeletal disorders for workers in factories. While the above studies demonstrate the
feasibility of the computer vision based approaches for ergonomic analysis, there are several
limitations that need to be addressed for actual applications in the workplace. These limitations
will be discussed in detail in the following sections.
In biomechanical and ergonomic research, human motion measurement systems can be grouped
10
State of the art
into two categories, namely, quantitative measurement systems and marker-less motion capture
systems. In quantitative measurement systems, sensors or markers are attached to the human
body to capture directly its motion. In contrast, marker-less motion capture systems leverage
computer vision techniques to estimate human motion based on images or recorded videos from
depth or RGB cameras. In the following, both the quantitative and marker-less motion capture
systems are investigated.
regarding the potential of this technology, the suitability of the system were tested only in a
limited number of brief trial experiments (Caputo et al., 2019). As previously noted, the
drawbacks of IMU systems hinder their practical use in real-world industrial settings. These
drawbacks include the laborious setup process due to sensor placement and the intricate data
analysis required for professionals and specialists in occupational biomechanics (Schall et al.,
2018).
Figure 2-2. An Xsens MVN is composed of 17 inertial and magnetic sensor modules. Suit is
designed for efficient and convenient placement of sensors and cables (Roetenberg et al., 2013).
Marker-based optical motion capture systems: Marker-based optical motion capture systems,
such as the Vicon System (Vicon Motion Systems Ltd, Oxford, UK), as shown in Figure 2-3,
are by far the most accurate among all motion measurement tools. During the motion capture,
subjects are attached with LEDs or reflective markers on their body, and surrounded by multiple
synchronized cameras to capture the infrared rays, either reflected from the markers or emitted
from the LEDs (Kovacic et al., 2018). Subsequently, the accurate 3D positions of the markers
are calculated. In (Kolahi et al., 2007), the authors described a complete design of a high-speed
optical motion analyzer system, where the differential algorithm procedure has been utilized
for the image processing unit. In the work of (Prakash et al., 2015), the researchers developed
12
State of the art
a passive maker-based optical motion capture system to obtain gait parameters. The prototype
of the system was shown to provide decent quantitative joint angles. In a recent study presented
in (Siaw et al., 2023), the researchers proposed a relatively new marker-based optical motion
capture system that uses a smartphone camera to capture the motions of red markers as well as
to track their coordinates and calculate the angles.
Figure 2-3. A 3D Vicon motion capture system that uses several cameras to track reflective
markers in real time (Kovacic et al., 2018)
It is noted that marker-based optical motion capture systems are advantageous to other systems
due to the high accuracy of measuring human motion kinematics. In biomechanics or
ergonomics research, they are often used as a reference to evaluate the accuracy of other
systems. Nevertheless, these systems require sophisticated calibration and sufficient
illumination backgrounds and they are not widely available due to the high cost. As a matter of
fact, the use of such systems is generally limited to the laboratory.
The markerless motion capture systems leverage computer vision algorithms to identify body
segment position and orientation by using images or videos recorded by depth or classical
cameras (RGB cameras). In this section, these two types of cameras are first reviewed. This is
followed by a detailed discussion of the scientific challenges faced by marker-less motion
13
State of the art
capture systems.
Depth Cameras: Depth cameras, such as Kinect shown in Figure 2-4, usually contain an
infrared emitting device, which emits infrared light that can be reflected by objects in the scene
and then captured by an infrared camera to identify the depth image of the scene. With the
booming of deep learning techniques, the accuracy of depth cameras in 3D human pose
estimation has been increasingly improved. In (Wu et al., 2021), the authors combined deep
learning algorithms with RGB-D images to estimate the pose of infants, which has achieved an
average joint accuracy of 13.76 mm. In the field of biomechanics and ergonomics, a large
number of studies can be found that estimate workers' postures during an industrial task
(Abobakr et al., 2019; Bortolini et al., 2018; Faccio et al., 2019; Haggag et al., 2013; Halim and
Radin Umar, 2018; Plantard et al., 2017a). Particularly, given the complex factory environment
where workers' bodies are often obscured by machines, in the study by (Plantard et al., 2017a),
the researchers performed an experiment wherein seven male workers were involved in a car
manufacturer factory. Five different workstations were assessed and the work task was recorded
by a Microsoft Kinect 2 sensor. Based on these results, the Filtered Pose Graph was proposed
to resolve the occlusion problem on getting and putting task through pose correction, reaching
an accuracy of 90±20mm on the 3D joint location. The development of these algorithms has
greatly improved the performance of depth cameras for human pose estimation in complex
industrial environments.
14
State of the art
Figure 2-4. Microsoft Kinect sensor. The picture on top shows the external appearance of the
Kinect sensor for Xbox 360. The image on the bottom is the infrared projector, RGB camera
and infrared camera included in the Kinect sensor (Zhang, 2012).
RGB Cameras: In the last decade, classical cameras, or RGB cameras, have gained significant
popularity as motion capture systems. Thanks to the development of artificial intelligence, the
accuracy of human pose estimation based on RGB images has been remarkably improved in
recent years, from the initial 2D human pose estimation to the high accuracy 3D human pose
estimation. To date, several studies have proposed various classical detectors for 2D human
pose estimation, such as OpenPose (Cao et al., 2019), CPN (Chen et al., 2018), AlphaPose(Fang
et al., 2018), HRNet(Cheng et al., 2020). The vast majority of these algorithms are based on
convolutional neural networks (CNN). Significant advancements have also been made in recent
years regarding the human pose estimation algorithm in 3D, which is the core of the present
manuscript.
The 3D single human pose estimation algorithms based on single-view RGB images include
the one-stage and two-stage methods. The one-stage method uses neural networks to map RGB
images directly to the 3D coordinates of the joints, as shown in Figure 2-5. This approach allows
end-to-end training of the neural network, but the algorithm architecture is often excessively
sophisticated and has relatively low interpretability. The two-stage approach first detects 2D
joint positions using a 2D human pose estimation algorithm, and subsequently, lifts the joint
positions from 2D to 3D. An increasing number of studies employed the two-stage approach
15
State of the art
for 3D human pose estimation (Jiang et al., 2021; Li et al., 2021, p. 1; Zheng et al., 2021b). The
main challenge of the single-view RGB-based human pose estimation is the occlusion issue
typically encountered during a work process, degrading severely the estimation accuracy. An
additional disadvantage of the single-view 3D human pose estimation is that its accuracy is
limited. A state-of-the-art performance was reported by the MotionBERT model with an
average MPJPE of 16.9mm on the Human3.6M dataset. It should be noted, however, that the
best performance of MPJPE of 16.9mm for 3D pose estimation was obtained using 2D ground
truth joint points as input, while using the detected 2D pose sequence, poor performance with
MPJPE of 37.5mm was observed (Zhu et al., 2023).
Figure 2-5. 3D human pose directly from the 2D image (Zheng et al., 2021a).
The significant improvement in the accuracy of 3D human pose estimation from RGB images
or videos, has boosted the wide usage of RGB cameras for human motion capture in the field
of biomechanics and ergonomics (Li et al., 2020a; Mehrizi et al., 2019; Vafadar et al., 2021).
Particularly, in the study by (Mehrizi et al., 2019), two RGB cameras were utilized to capture
the motion of the worker carrying a heavy load and applied inverse dynamics to calculate the
joint forces. In (Vafadar et al., 2021), the researchers used four RGB cameras to capture human
motion on the strength of learnable triangulation methods and then carried out gait analysis
based on the result of human pose estimation, achieving an accuracy of mean position difference
of 13.1 mm across all joints. Evaluation of different studies on the RGB-based human pose
estimation is presented in Table 2-2. It is observed that while some authors have taken into
account the effect of occlusion on the accuracy of the human pose estimation in their
publications, the effect of face-bluring has not been well investigated.
Table 2-2. Evaluation of different studies on the RGB-based human post estimation
Face-
Publication Occlusion Applications MPJPE* mAP**
blurring
2D Real-time multi-person keypoint
detection in the presence of occlusion,
(Cao et al. (2019) No Yes - 79%
crowding, contact, viewpoint, and
appearance variation
2D Multi-person pose estimation such
(Chen et al., 2018) No Yes as occluded keypoints, invisible - 73%
keypoints, and complex background
Multi-person pose estimation in the
(Fang et al., 2018) No Yes presence of inaccurate human bounding - 76.7%
boxes and redundant detections
(Cheng et al., Bottom-up human pose estimation in
No No - 70.5%
2020) the presence of scale variation
3D human pose estimation from a
single monoscopic video and in the
(Jiang et al., 2021) No Yes --
presence of the temporal evolution of
the scene and skeleton
(Li et al., 2021) No No Video-based 3D human pose estimation 28.5mm
(Zheng et al.,
No No Video-based 3D human pose estimation 31.3mm
2021)
(Iskakov et al.,
No Yes Multi-view 3D human pose estimation 13.7mm
2019)
3D human poses from multiple
calibrated cameras in the presence of
(Qiu et al., 2019) No Yes 26 mm
occlusion, self-occlusion, and varying
camera viewpoint
3D human joint estimation in the
(He et al., 2020) No Yes presence of occlusion and oblique 26.9mm
viewing angles
(Reddy et al., 3D pose estimation and tracking of
No Yes 17mm
2021) multiple people
* MPJPE: Mean Per Joint Position Error; **MAP: Mean Average Precision
17
State of the art
While several tools could be considered, it appears that markerless motion capture systems are
more advantageous to the maker-based motion measurement technique since the former does
not cause disruption to the working process in industrial applications. Compared with the color
images of ordinary RGB cameras, depth images are simpler, hence the required image
processing algorithms are relatively easy to implement. Additionally, depth information, often
combined with RGB information to obtain RGB-D images, can achieve a high degree of
accuracy with appropriate algorithms. However, the depth camera is extremely sensitive to light,
indicating that its accuracy will be reduced in a complex lighting environment, e.g., the
industrial environment. What’s more, the detection range of the depth camera is quite limited,
hence the camera fails to calculate effectively the depth when the object in the scene is far from
it (Yu et al., 2018). Both types of cameras encounter the issue of object occlusion when
capturing images, therefore computer vision algorithms are particularly useful in the subsequent
image processing. In this work, we will explore the ability to use RGB cameras and deep
learning to provide accurate measurements despite occlusions that may occur during human
pose estimation. However, several challenges in the RGB-based deep learning of the human
pose estimation method need to be addressed in this manuscript.
The first challenge is related to privacy protection in datasets for computer vision tasks. Thus
far, a substantial body of work has been found in the literature dealing with the investigation of
various approaches, such as face blurring, to protect subject privacy with minimal impact on
the performance of deep learning models (Dave et al., 2022; Fan, 2019; Frome et al., 2009;
Imran et al., 2020; Nam Bach et al., 2022; Ren et al., 2018; Ribaric et al., 2016; Sazonova et
al., 2011; Tomei et al., 2021; Yang et al., 2022; Zhu et al., 2020). Especially, in (Frome et al.,
2009), a large-scale face detection and blurring algorithm was proposed. However, the authors
did not quantify the impact of this anonymization on any computer vision task. In another work
(Ren et al., 2018), the authors explored the feasibility of employing adversarial training methods
to remove privacy-sensitive features from faces while minimizing the impact on action
recognition. Nevertheless, the authors did not analyze the statistical significance of this impact.
In (Tomei et al., 2021), a performance reduction for video action classification due to face
blurring was quantified and a generalized distillation algorithm was developed to mitigate this
effect. Similarly, in the work from (Dave et al., 2022), the researchers developed a self-
18
State of the art
supervised framework for action recognition to eliminate privacy information from videos with
no need for privacy labels. In (Yang et al., 2022), work has been conducted using a large-scale
dataset to examine more comprehensively the effect of face blurring on different computer
vision tasks. The authors first annotated and blurred human faces in ImageNet (Deng et al.,
2009). Afterward, the authors benchmarked several neural network models using the face-
blurred dataset to examine the effect of face-blurring on the recognition task. Eventually, they
studied the feature transferability of these models on “object recognition, scene recognition,
object detection, and face attribute classification” , with the models pre-trained on the
original/face-blurred dataset. It has been found that, in the vision tasks above, face blurring did
not degrade the performance of recognition results. In close connection with our study, a facial
swapping technique has been applied using videos of patients with Parkinson’s disease, and 2D
human pose estimation was performed in (Zhu et al., 2020). It is concluded that facial swapping
keeps the 2D keypoints almost invariant. This study, however, has a major limitation, that is, it
was limited to only two subjects. No reported work has been found dealing with the effect of
face-blurring on the 3D multi-view human pose estimation in the open literature.
The deep learning-based human pose estimation necessitates the acquisition of sufficient and
accurate datasets. The latter also provides a fair assessment of different algorithms. Although
high-quality 2D human pose datasets are widely accessible in the literature, obtaining accurate
3D datasets for human pose estimation is challenging. This is because it necessitates the use of
motion capture systems, like wearable IMUs or marker-based optical systems, within controlled
lab environments. In this section, we focus on the review of the most widely used datasets for
deep learning-based 3D human pose estimation, as illustrated in Figure 2-6.
The most widely-used dataset for 3D human pose estimation is the Human3.6M obtained from
monocular images and videos, introduced by (C. Ionescu et al., 2014). It is acquired by
recording the performance of 5 female and 6 male subjects with a diverse set of motions and
poses encountered as part of typical human activities under 4 different viewpoints. It contains
3.6 million 3D human poses with 3D ground truth annotation, which were captured by an
accurate marker-based MoCap system. However, despite the high number of human poses, the
limitation of Human3.6M dataset is that it contains motion data of small number of subjects
and includes only limited actions such as running, walking and sitting.
19
State of the art
The CMU dataset was introduced by (Joo et al., 2015) for 3D human pose estimation from
monocular RGB videos. It was captured by using a large-scale motion capture system consisting
of 480 VGA cameras covering multiple people engaged in social activities such as walking,
running, and dancing.
The MPI-INF-3DHP dataset, proposed by (Mehta et al., 2017), consists of both indoor and
complex outdoor scenes. 8 actors performing 8 activities were recorded with a commercial
marker-less MoCap system. It consists of more than 1.3 million frames captured from the 14
camera views. Later, in the study by (Mehta et al., 2018), researchers introduced a new 8000-
frame dataset called MuPoTS-3D. It contains a multi-person test set and its ground-truth poses
in 5 indoor and 15 outdoor scenes. Challenging samples with occlusions, drastic illumination
changes, and lens flares were introduced in the new dataset.
Despite being very rich with regard to the number of gestures and images, the above open-
access posture datasets available from the computer vision community primarily focus on the
pose and motion of daily activities and are less adaptive to the human pose estimation of specific
aims at workplaces. Furthermore, the human keypoints in the datasets do not conform to the
strict biomechanical definitions. Therefore, the acquisition of a biomechanically meaningful
dataset suitable for the industrial environment is the second challenge of the study. In response
to this limitation, in (Mehrizi et al., 2019), the authors introduced a lifting dataset consisting of
12 subjects performing various types of lifting tasks captured by a marker-based motion capture
system. In the study by (Li et al., 2020b), researchers presented the MOPED25 dataset that
includes full-body kinematics data and the synchronized videos of 11 participants, performing
commonly seen tasks at workplaces. Nevertheless the image of this dataset was captured with
only one camera viewpoint, making it difficult to address the occlusion problem. The ENSAM
dataset was presented by (Vafadar et al., 2021) that contains twenty-two asymptomatic adults,
one adult with scoliosis, one adult with spondylolisthesis, and seven children with bone disease
performed ten walking trials. It was captured both by a multi-view stereovision system and a
reference system – combining a marker-based motion capture system and a medical imaging
system (EOS). However this dataset was only for gait studies and was not applicable for
lifting/lowering movements
20
State of the art
In this manuscript, we focus on the study of lumbar spine loads, with specific attention to the
intersegmental forces/moments at L5-S1 joint. The L5/S1 joint load is considered to be one of
the most critical loads to evaluate MSDs risk (Mehrizi et al., 2017). Inter-segmental forces are
the net forces between two adjacent segments of the human body, which can be obtained from
inverse dynamics calculations. In inverse dynamics calculations, the human body is generally
modeled as a kinetic chain of rigid body segments, where the kinematic information can be
obtained based on the 3D position of each segment at every instant. Subsequently, the inter-
segmental force can be calculated once the inertia parameters of each segment are determined.
The estimation of the body segment inertial parameters is fundamental, and the classical method
proposed by (De Leva, 1996) is usually employed to obtain the inertial information of each
segment, such as length, mass, the center of mass, and inertia matrix, leveraging the information
about the joint position, the subject's gender, and the total body mass as inputs. However, this
method does not take into account the personalized anthropometric measurements of the
subject's body. Therefore it is difficult to obtain accurate parameters for biomechanical analysis.
To address this issue, in (Pillet et al., 2010), the authors proposed to build a 3D personalized
human model of the subject based on RGB images and calculate the inertial parameters of each
segment accordingly.
Two main paradigms exist to calculate the L5/S1 loads: top-down and bottom-up models. Both
of these 2 paradigms terminate the load calculation procedure at L5/S1 joint. The difference is
that, in the top-down model, the calculation starts from the head, while in the bottom-up model,
the start points are the feet. For both of these 2 paradigms, external forces are necessary to
initiate the calculations. In the top-down model, the external force can be estimated by the
kinematic information of the box. In contrast, in the bottom-up model, the external
23
State of the art
forces/moments applied on the feet from the ground need to be measured with force plates
(Mehrizi et al., 2019).
The common solution in the literature is the proportional method, which calculates the BSIPs
for each body segment based on statistics. One of the most widely used methods is the one
proposed in (De Leva, 1996), taking the subject's gender and body mass as input and yielding
the position of the center of mass of the segment, the segment mass, and the inertia tensor with
respect to the central inertia principal axis. However, individual differences are not considered
in this type of method. We argue that the inter-segmental joint loads calculated with the inertial
parameters derived from such methods cannot achieve sufficient accuracy to serve as ground
truth values. Therefore, in this thesis, we use the acquired biplanar radiographic images to build
an accurate personalized 3D model of the human body for each subject and use it to calculate
BSIPs. This has the potential to improve the accuracy of the ground truth values of human
intersegmental loads and make a valuable contribution to the evaluation of the joint load
calculation.
The aim of this manuscript is to investigate the potential of utilizing RGB cameras to achieve
precise motion capture of workers in factories and to create a musculoskeletal model utilizing
the resulting quantitative motion data. This musculoskeletal model can then be used to estimate
joint loads at L5S1 and perform ergonomic analysis during lifting/lowering tasks. Based on the
literature review, it is concluded that the following challenges remain for biomechanical
analysis of workers in factories using computer vision,
24
State of the art
1. For almost all computer vision datasets, the image used to train the AI model was not
subjected to face blurring, thus not effectively protecting the subject's privacy. Due to
privacy concerns, dataset sharing is limited since the images are normally not fully
anonymized, resulting in difficulties to integrate the corresponding datasets to train better
performing neural network models. In addition, the installation of cameras in a factory
environment requires addressing privacy issues in the first place.
2. There is a lack of extensive and accurate multi-view datasets of workers for lifting/lowering
movements in industrial environments. Camera occlusion is a serious problem during the
motion capture for workers in factories. Most of the datasets in the literature are of
movements in daily life, with only a few works collecting datasets of workers' movements,
where there are normally only 1-2 camera views in the camera setup. Moreover, the
annotation of human keypoints of the datasets does not follow the strict biomechanical
definition.
3. It is difficult to calculate the accurate BSIPs of the worker. The proportional method,
calculating the BSIPs for each body segment based on statistics, does not take into account
individual differences. The inter-segmental joint loads calculated with the inertial
parameters derived from such methods cannot achieve sufficient accuracy to serve as
ground truth values.
Therefore, in response to the challenges summarized above, our specific research objectives,
accordingly, are as follows,
1. Face blurring can be an effective solution to protect workers' privacy, to a certain extent,
allowing installation of cameras in factories with less privacy risk, and facilitating the
sharing and integration of data sets which can accelerate convergence of ergonomic
applications. In this study, the effect of face blurring on human pose estimation will be
explored first, upon which the subsequent research will be carried out.
2. A stereo vision system consisting of four RGB cameras will be employed to acquire a high-
precision multi-view human pose estimation dataset to address the occlusion problem.
Besides, we will collect biplanar radiographs of each subject in our experiments and all the
markers were placed on the subject by a professional surgeon to improve the accuracy of
joint position and to guarantee the biomechanical significance of the key points annotation
25
State of the art
on human body, thus the image dataset will be more suitable for biomechanical calculations
& ergonomic analysis.
3. The acquired biplanar radiographic images will be used to build an accurate personalized
3D model of the human body for each subject and use it to calculate BSIPs. This would
significantly improve the accuracy of the ground truth values of human intersegmental loads
and contribute greatly to the evaluation of the joint load calculation.
In this chapter, we presented a literature review that allowed identifying the scientific challenges
that guided us in our scientific approach, namely, privacy protection associated with images or
recorded videos, accurate motion capture in the presence of occlusion, and building MSK
models with markerless system. In our document, the first 3 chapters will deal with these issues:
the effect of face-blurring on human pose estimation is investigated in Chapter 3. In Chapter 4,
a multi-view face-blurred image dataset for the lifting/lowering task will be collected and the
corresponding human pose estimation will be displayed; In Chapter 5, the joint load at L5S1
will be estimated with inverse dynamics calculation based on the multi-view markerless motion
capture system; Finally, we will compare results of MSK model to the RULA in Chapter 6.
Chapter 7 draws pertinent conclusions and presents future work. Figure 2-7 illustrates the
structure of this thesis.
26
State of the art
Motion Capture
Chapter 3: Effect of face blurring on Chapter 4: Human pose estimation for
human pose estimation lifting/lowering
Motion Analysis
Chapter 5: Estimation of intersegmental Chapter 6: Application in ergonomic
load at L5 -S1 analysis
In this manuscript, the effect of the face blurring of the images on machine-learning human
pose estimation and subsequent kinematic analysis is studied. This work is motivated by the
fact that protecting privacy through face blurring has become increasingly important
nowadays but current computer vision techniques are typically trained and validated on raw
(unblurred) datasets. The impact of face blurring on the accuracy of human pose estimation
with neural networks is not well-understood in the literature.
Toward this end, we explored blurring the subject’s face in the collected image dataset, and
then trained our neural network model using both the face-blurred and the raw datasets. We
evaluated the performance of the neural networks in terms of landmark localization and joint
angle estimations on both blurred and unblurred testing datasets, respectively. Our study
shows that face blurring does not degrade significantly the prediction performance of the
deep neural networks while preserving the privacy of the subjects in the collected datasets.
29
Effect of Face Blurring on Human Pose Estimation
Abstract: The face blurring of images plays a key role in protecting privacy. However, in
computer vision, especially for the human pose estimation task, machine-learning models are
currently trained, validated, and tested on original datasets without face blurring. Additionally,
the accuracy of human pose estimation is of great importance for kinematic analysis. This
analysis is relevant in areas such as occupational safety and clinical gait analysis where privacy
is crucial. Therefore, in this study, we explore the impact of face blurring on human pose
estimation and the subsequent kinematic analysis. Firstly, we blurred the subjects' heads in the
image dataset. Then we trained our neural networks using the face-blurred and the original
unblurred dataset. Subsequently, the performances of the different models, in terms of landmark
localization and joint angles, were estimated on blurred and unblurred testing data. Finally, we
examined the statistical significance of the effect of face blurring on the kinematic analysis
along with the strength of the effect. Our results reveal that the strength of the effect of face
blurring was low and within acceptable limits (<1°). We have thus shown that for human pose
estimation, face blurring guarantees subject privacy while not degrading the prediction
performance of a deep learning model.
3.1 Introduction
Human pose estimation is a highly important task in the field of computer vision. The focus of
human pose estimation is the calculation of human body keypoint coordinates based on images.
Combined with kinematic analysis, it has the potential for many applications in different fields,
e.g., ergonomics (Bortolini et al., 2018; Haggag et al., 2013; Li et al., 2020b; Mehrizi et al.,
2019) or orthopedics (Vafadar et al., 2022, 2021). In vision-based human pose estimation tasks,
the datasets used for training and testing models often consist of images where the human face
is clearly visible. In applications, this fact raises a significant privacy problem. For instance, in
ergonomics, vision-based human pose estimation can help workers prevent musculoskeletal
disorders (Abobakr et al., 2019; Bortolini et al., 2018; Halim and Radin Umar, 2018; Malaise
et al., 2019; Plantard et al., 2017c). However, installing cameras in factories to capture workers'
motion leads to significant privacy concerns and workers might legitimately reject this tool. As
30
Effect of Face Blurring on Human Pose Estimation
a result, addressing the privacy issues of computer vision datasets is an essential task.
Fortunately, simple face blurring could solve most of these privacy concerns. However, the
effect of face blurring on human pose estimation and subsequent kinematic analysis is unclear.
Preservation of subject privacy in datasets for computer vision tasks is an emerging research
topic. So far, few studies have investigated approaches, such as face blurring, to preserve
subject privacy with a minimal impact on the performance of deep learning models (Dave et
al., 2022; Fan, 2019; Frome et al., 2009; Imran et al., 2020; Nam Bach et al., 2022; Ren et al.,
2018; Ribaric et al., 2016; Sazonova et al., 2011; Tomei et al., 2021; Yang et al., 2022; Zhu et
al., 2020). Specifically, in (Frome et al., 2009), the authors proposed a large-scale face detection
and blurring algorithm but did not quantify the impact of this anonymization on any computer
vision task. In another work (Ren et al., 2018), the possibility of using adversarial training
methods was explored to remove privacy-sensitive features from faces while minimizing the
impact on action recognition. However, the statistical significance of this impact was not
analyzed. In (Tomei et al., 2021), the authors quantified a performance reduction for video
action classification due to face blurring, and also proposed a generalized distillation algorithm
to mitigate this effect. Similarly, a self-supervised framework was proposed in (Dave et al.,
2022) for action recognition to eliminate privacy information from videos with no need for
privacy labels. In (Yang et al., 2022), research has been carried out using a large-scale dataset
to examine more comprehensively the effect of face blurring on different computer vision tasks.
The authors first annotated and blurred human faces in ImageNet (Deng et al., 2009). Afterward,
they benchmarked several neural network models using the face-blurred dataset to examine the
effect of face blurring on the recognition task. Finally, with the models pre-trained on the
original/face-blurred dataset, they studied the feature transferability of these models on “object
recognition, scene recognition, object detection, and face attribute classification”. The results
of the experiment suggested that, in the vision tasks above, face blurring did not cause a
significant loss of accuracy. In close connection with our study, a facial swapping technique
has been applied using videos of patients with Parkinson’s disease and 2D human pose
estimation was performed in (Zhu et al., 2020). The authors concluded that facial swapping
keeps the 2D keypoints almost invariant, but this study was limited to only two subjects.
Therefore, previous works have focused on the task of classification, action recognition, or 2D
human pose estimation in videos. Thus far, the effect of face blurring on 3D human pose
31
Effect of Face Blurring on Human Pose Estimation
estimation and, more importantly, subsequent kinematic analysis has not yet been investigated
on a consistent cohort. Considering their importance in biomechanical and ergonomic domains,
we study the statistical significance as well as the strength of the effect of face blurring on 3D
human pose estimation and kinematic analysis in this paper. Our contribution consists of three
main parts. First, to the best of our knowledge, this study is the first one focusing on the effect
of face blurring on multi-view 3D human pose estimation. Second, based on the 3D keypoint
coordinates obtained from the human pose estimation, we calculated joint kinematics and
analyzed the eventual impact of face blurring. Third, both statistical significance and strength
were calculated to more comprehensively evaluate the effect of face blurring on the
performance of a deep learning model.
In this study, we first performed subject face blurring on an image dataset acquired in our
previous gait study (Vafadar et al., 2021). Then, using different training strategies, we obtained
distinct deep learning models and tested the performance of each model on either the face
blurring or the original dataset, thus examining the effect of face blurring. Figure 3-1 outlines
the research flow of the work in this paper.
Original dataset
Face blurred dataset Gaussian blur
Dataset
#3
Test Joints position
#1 #2 #3
ANOVA Variation Δ
3.2.1 Dataset
The dataset used in this research was a multi-view human gait dataset, namely, the ENSAM
dataset collected in a previous study (Vafadar et al., 2021). The dataset contained a total of 43
subjects (19 females and 24 males; age range: 6–44 years; weight: 56.0 ± 20.7 kg; height: 159.2
± 21.5 cm) which were split into a training set of 27 subjects (14 females and 13 males; age
range: 8–41 years; weight: 54.0 ± 20.2 kg; height: 157.7 ± 21.8 cm) and a test set of 16 subjects
(5 females and 11 males; age range: 6–44 years; weight: 59.6 ± 21.8 kg; height: 161.8 ± 21.3
cm). In the training set, 14 subjects were asymptomatic adults (≥18 years), one adult had
scoliosis, and 12 minors (<18 years) suffered from X-linked hypophosphatemia (XLH) disease.
While in the test set, 8 adults were asymptomatic, one adult had spondylolisthesis, and 7
children had XLH (Vafadar et al., 2022, 2021). The dataset comprised a total of 120,293 frames,
each containing four images from four calibrated and synchronized cameras (GoPro Hero 7
Black). Participants in the datasets were instructed to complete several walking trials at their
own self-chosen pace, with both marker-less and marker-based motion capture systems
recording their movements. The 3D positions of 51 markers attached to the body of subjects
were captured by a marker-based motion capture system (VICON system, Oxford Metrics, UK).
The camera parameters of four cameras and biplanar radiographs acquired by the X-ray system
(EOS system, EOS imaging, Paris, France) were also collected. With the help of the markers’
3D positions acquired from the Vicon system along with the 3D reconstructions of lower limbs
from bi-planar radiographs (Chaibi et al., 2012), the 3D coordinates of 17 keypoints were
annotated on the human body. As shown in Figure 3-2, the keypoints were, namely, head (H),
neck (N), shoulders (𝑆𝑅 , 𝑆𝐿 ), elbows (𝐸𝑅 , 𝐸𝐿 ), wrists (𝑊𝑅 , 𝑊𝐿 ), pelvis (𝐻𝐶 ), hips (𝐻𝑅 , 𝐻𝐿 ),
knees (𝐾𝑅 , 𝐾𝐿 ), ankles (𝐴𝑅 , 𝐴𝐿 ), and feet (𝐹𝑅 , 𝐹𝐿 ).
33
Effect of Face Blurring on Human Pose Estimation
Figure 3-2. (a) The human model defined in the ENSAM dataset. (b) The 17 keypoints projected
on two camera views.
With the camera parameters collected in the ENSAM dataset, the reference annotations of the
head and the neck were projected onto the corresponding images. A circle covering the subject's
face was drawn based on these projections in the images. The pixels inside the circle were
blurred using Gaussian blur. A Gaussian kernel size of 25 × 25 was carefully selected with
which the faces of different sizes could all be blurred properly. The standard deviation of the
kernel was set to 4.1 and determined automatically using OpenCV (“OpenCV,” 2021). Using
this method, face blurring of the images of all 43 subjects in the dataset was performed. Figure
3-3 shows some example images after face blurring.
34
Effect of Face Blurring on Human Pose Estimation
Figure 3-3. Example images of the ENSAM Pose dataset after face blurring.
As in (Vafadar et al., 2021), the 3D human pose estimation algorithm applied in this paper was
the learnable triangulation algorithm proposed by (Iskakov et al., 2019b). The algorithm
consists of two main parts, namely, 2D and 3D human pose estimation. The 2D human pose
estimation was performed for each camera view, and subsequently, the information from all
views was fused to derive the 3D coordinates of keypoints of the human body using a trainable
triangulation approach. Two approaches were proposed in the article, i.e., the algebraic and the
volumetric triangulation, where the latter required the pelvis position to be estimated by the
former.
Three training experiments were conducted, namely, #1, #2, and #3. The model training setup
is summarized in Figure 1. In the experiments, original images or face-blurred images were
utilized as the training set, with initial weights of the network being those provided in (Iskakov
et al., 2019b) or those acquired from the training of experiment #3. It is worth mentioning that
the weights provided in (Iskakov et al., 2019b) were obtained by training the network on the
Human 3.6M dataset (Ionescu et al., 2014). Regarding the number of training epochs, in
experiments #1 and #3, the algebraic and the volumetric module were trained for 50 and 30
epochs, respectively, whereas in experiment #2, the 2 modules were finetuned for 30 and 20
epochs, respectively. At the end of each training epoch, we recorded the current model
performance and the network weights. After the model training reached the number of epochs
35
Effect of Face Blurring on Human Pose Estimation
as described above, in each experiment, the epochs with the minimum error on the test set were
then selected for the subsequent inference. The number of training epochs for the selected
model is listed in Figure 1. Three models were obtained from the experiments, where models
#1 and #2 were the experimental models, and model #3 was the control model. The training
was performed using the Adam optimizer, with learning rates of 10–5 and 10–4 for the algebraic
and the volumetric network, respectively. The training and evaluation of the neural network
during the experiment was effectuated on a Linux server under Ubuntu 20.04.1 LTS 64 bits.
The machine consisted of an AMD Ryzen 9 3900X 12-core processor and 125 GB RAM. It
was equipped with 2 Nvidia TITAN RTX GPUs with 24 GB of RAM, one of which was
employed in this study.
For the three models obtained in the experiments, their performances for human pose estimation
and the subsequent kinematic analysis were analyzed on the original or the face-blurred test set.
The human pose was defined by the 3D coordinates of the 17 keypoints in our human model.
Using the 17 coordinates, joint angles were computed for the lower and upper extremities. For
the joint angles of the lower limbs, we followed the calculation method proposed in (Vafadar
et al., 2022). For the upper limbs, the approach employed in (Plantard et al., 2017b) was
modified to fit the human model defined in this study, and we applied it to establish the local
coordinate system of the segment and calculate the corresponding joint angles (see Figure 3-4).
36
Effect of Face Blurring on Human Pose Estimation
Figure 3-4. (a) Schematic diagram of the coordinate systems of shoulders and trunk. (b) The 17
keypoints and the coordinate systems of shoulders and trunk projected on two camera views,
where the red, green, and blue axis represent the x, y, and z-axis, respectively, with a length of
20 cm in the world global coordinate system.
The coordinate system of the trunk was defined as follows: The Y-axis is the vector ⃗⃗⃗⃗⃗⃗⃗⃗
𝐻𝐶 𝑁. The
X-axis was perpendicular to the Y-axis and ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐻𝑅 𝐻𝐿 . Then, the Z-axis was calculated from the X-
axis and Y-axis according to the right-hand rule. The coordinate system origin was placed at
𝐻𝐶 , and all axes were normalized to unit vectors.
The method employed in (Plantard et al., 2017b) was adopted for the computation of neck
flexion, neck side bend, and elbow flexion, as well as the definition of shoulder coordinate
systems. The rotation matrices of the shoulder coordinate systems relative to the trunk
coordinate system were calculated. As suggested by (Šenk and Chèze, 2006), ZXY
decomposition was performed to calculate the joint angles of the shoulder.
37
Effect of Face Blurring on Human Pose Estimation
3.2.4 Evaluation
The evaluation metric MPJPE (Mean Per Joint Position Error) was chosen in this paper to
examine the performance of different models by comparing the joint positions estimated via the
neural networks against the references acquired from the marker-based motion capture system.
MPJPE was computed as follows (Zheng et al., 2021a):
∑𝑁 ̂
𝑖=1 ||𝐽𝑘 − 𝐽𝑘 ||
𝑀𝑃𝐽𝑃𝐸 = ( 3-1 )
𝑁
where 𝐽̂𝑘 and 𝐽𝑘 denote respectively the estimated and the reference 3D position of the
keypoint 𝑘. 𝑁 is the total number of keypoints. For each keypoint, the effect of face blurring
was analyzed by calculating the maximum variation Δ due to the face blurring, i.e., Δj = | μmax-
μmin |j, where the μmax and μmin were the maximum and minimum values among all the 3 inference
results estimated with different models on joint j.
To analyze the effect of face blurring on kinematic calculations, we performed statistical tests
on joint angles differences (with the reference system) calculated using different data sets based
on different models.
Since the accuracies of human pose estimation on the original and the face-blurred test set are
subject-by-subject matched, a one-way repeated ANOVA test was used to examine the effect
of face blurring. In this test, we had one within-subject factor, namely, experiment setup, which
had three levels, i.e., experiment #1, #2, and #3. The root mean square error (RMSE) of the
joint angle of each subject was used as the dependent variable. We assumed that our data were
consistent with the assumption of sphericity since different joint angle calculations did not
affect each other.
Finally, to further evaluate the strength of these effects, we quantitatively investigated the
variations in joint angle estimation using different models. As a measure of the strength of the
effect, the maximum variation Δ of angle estimation due to face blurring was defined for each
joint j, i.e.,
∆𝑗 (𝑅𝑀𝑆𝐸) = |𝑅𝑀𝑆𝐸𝑚𝑎𝑥 − 𝑅𝑀𝑆𝐸𝑚𝑖𝑛 |𝑗 ( 3-2 )
where 𝑅𝑀𝑆𝐸𝑚𝑎𝑥 and 𝑅𝑀𝑆𝐸𝑚𝑖𝑛 were respectively the maximum and minimum values among
all the inference results estimated with different models on joint angle j, with 𝑆𝐷𝑚𝑎𝑥 and
𝑆𝐷𝑚𝑖𝑛 being those of the standard deviations of the differences.
3.3 Results
Joint localization performances from the above experiments are shown in Table 3-1, where the
mean and standard deviation of the prediction errors for each joint are reported. Notably, the
differences of 3D keypoint coordinates in the lower extremities (ankles, knees, hips, and pelvis)
were lower than those in the upper extremities (neck, head, wrists, elbows, and shoulders). The
range of variation in the average difference of all joints was less than 1 mm (see MPJPE column
in Table 3-1). Experiment #3 (on original images) and experiment #1 (on face-blurred images)
achieved comparable performance (MPJPE = 13.3 mm vs. 13.0 mm). On the other hand, the
performance of model #2, which was first trained on the original images and then fine-tuned on
the face-blurred images, did not show any improvement (MPJPE = 13.4 mm).
Table 3-1. Inference results with the models trained on blurred/original images. μ: the mean
value of joint position differences with the reference Vicon system (mm). σ: standard deviation
(mm).
Feet Ank.a Knees Hips Pelvis Neck Head Wrists Elbows Sho.s MPJPE
inference on face-blurred images, with model #1, trained on face-blurred images
μ 10.1 7.2 11.0 15.8 12.7 11.0 11.1 14.7 15.2 21.7 13.0
2σ 13.6 9.9 11.2 15.0 11.0 11.3 11.9 23.2 19.9 24.1 /
inference on face-blurred images, with model #2, finetuned on face-blurred images
μ 9.7 7.2 10.7 16.3 12.9 10.4 13.3 15.4 15.4 22.3 13.4
2σ 13.0 9.9 10.5 15.8 11.5 11.6 14.1 34.2 21.4 38.7 /
inference on the original images, with model #3, trained on the original images
μ 9.8 7.1 11.0 16.4 13.1 10.0 13.8 15.0 15.4 21.8 13.3
2σ 13.1 10.2 12.2 17.0 11.7 11.4 14.3 27.7 24.4 36.0 /
variation between the maximum and minimum values of μ
Δ 0.4 0.1 0.3 0.6 0.4 1.0 2.7 0.7 0.2 0.6 0.4
a s
ankles, shoulders.
The keypoints with the highest variations were the head and neck, with a Δ of 2.7 mm and 1.0
mm, respectively. For the head, the best- and worst-performing settings were #1 (μ = 11.1) and
39
Effect of Face Blurring on Human Pose Estimation
#3 (μ = 13.8), respectively. Unlike the head, #1 (μ = 11.0) and #3 (μ = 10.0) were the worst-
and best-performing settings for the neck, respectively.
Figure 3-5 presents an overview of joint angle estimation differences (the RMSE values being
calculated for each subject in the test set). For all the models, knee flexion, femur abduction,
and femur flexion were the angles estimated as having the smallest differences to the reference.
On the other hand, elbow flexion, neck flexion, and shoulder flexion were the angles with the
largest differences. Nevertheless, neither for the upper limbs nor lower limbs, no variation
larger than 5° was observed in the differences of joint angles estimated with different models.
The significance of the effects of face blurring on the angle estimation differences was then
revealed by statistical tests.
Joint angle diff (°) Joint angle diff (°) Joint angle diff (°) Joint angle diff (°)
Figure 3-5. Distribution of joint angle differences (RMSEs of the frames for each subject in the
40
Effect of Face Blurring on Human Pose Estimation
testset, scheme: 75th/25th percentile). The whiskers indicate the minimum and maximum value
of the data, except for the outliers marked with diamond-shaped black dots.
The statistical test results shown in Table 3-2 were the p-values of one-way repeated measures
ANOVA tests for each joint angle. In the ANOVA test, our null hypothesis was that there was
no variation in the mean value of the angle calculation RMSE in different experiments.
Therefore, when the p-value was lower than 0.05, we rejected the null hypothesis whereby the
effect of face blurring on corresponding joint angle calculations was considered statistically
significant. From the results, no statistically significant variation was found on all joint angle
calculations except one. The elbow flexion was affected with statistical significance (p-value =
0.040) by the different experiment setups.
Table 3-2. Results (p-value) of one-way repeated measures ANOVA tests for the effect of face
blurring on kinematics calculation. Numbers with underscores suggest significant effects (p-
value < 0.05).
Joint Angles p-value
shoulder flexion 0.259
shoulder abduction 0.338
elbow flexion 0.040
neck side bend 0.237
neck flexion 0.896
hip abduction 0.895
knee flexion 0.320
femur flexion 0.931
femur abduction 0.217
pelvis abduction 0.756
ankle flexion 0.106
We have presented the statistical significance of the effect of face blurring on the joint angle
calculation using ANOVA tests. To further investigate the strength of the effect of face blurring,
variations of joint angle computation were quantified.
Table 3-3 provides the root mean square values and standard deviations of the errors of joint
angles estimated on the face-blurred and original test sets via the models obtained from the
experiments. Most RMSE values did not exceed 5°. Only shoulder flexion (RMSE: 5.5, SD:
5.3), neck flexion (RMSE: 5.2, SD: 5.2), and the elbow flexion (RMSE: 7.4, SD: 6.7) were
above this limit, and the maximum values came from model #2.
41
Effect of Face Blurring on Human Pose Estimation
The maximum variations Δ of joint angles are also presented in Table 3-3. Overall, the maximum
variation Δ of the RMSE (+SD of all the frames) was smaller than 1° for all joint angles. It is
worth pointing out that the largest variation was observed for shoulder flexion, which was 0.6
(0.6).
Table 3-3. The RMSE (+SD of all the frames) of joint angles calculated with the joints’ position
from the models trained on blurred/original images (°).
Model #1 Model #2 Model #3
Trained on Finetuned on Trained on
Joint Angles Blurred Images Blurred Images Original Images Δ
Inference on Blurred Inference on Inference on
Images Blurred Images Original Images
shoulder flexion 4.9 (4.7) 5.5 (5.3) 4.9 (4.7) 0.6 (0.6)
shoulder abduction 2.9 (2.9) 3.1 (3.1) 3.1 (3.1) 0.2 (0.2)
elbow flexion 7.2 (6.5) 7.4 (6.7) 7.1 (6.4) 0.3 (0.3)
neck side-bend 2.7 (2.2) 2.6 (2.2) 2.6 (2.2) 0.1 (0.0)
neck flexion 5.1 (5.1) 5.2 (5.2) 5.2 (5.2) 0.1 (0.1)
hip abduction 3.2 (3.1) 3.2 (3.1) 3.2 (3.1) 0.0 (0.0)
knee flexion 2.6 (2.6) 2.6 (2.6) 2.6 (2.6) 0.0 (0.0)
femur flexion 1.8 (1.8) 1.8 (1.8) 1.8 (1.8) 0.0 (0.0)
femur abduction 1.2 (1.2) 1.3 (1.3) 1.3 (1.3) 0.1 (0.1)
pelvis abduction 2.6 (2.5) 2.6 (2.6) 2.6 (2.5) 0.0 (0.1)
ankle flexion 4.7 (4.6) 5.0 (4.9) 4.8 (4.7) 0.3 (0.3)
3.4 Discussion
In this study, we aimed to assess the significance of the effect of face blurring on landmark
localization performances and the effect on the subsequent kinematic analysis. To that end, a
comparison between a control model (#3: trained and evaluated on unblurred images) and
models trained or finetuned (#1 and #2) and evaluated on blurred images was led.
Concerning keypoint localization, regardless of the different experiments, the errors in the
upper extremities (neck, head, wrists, elbows, and shoulders) were larger than those in the lower
extremities (ankles, knees, hips, and pelvis). One possible reason was that the annotations in
the training set of lower limb keypoints were refined by the 3D reconstructions from the
biplanar radiographs. Another important reason is that the arm activities during movement show
42
Effect of Face Blurring on Human Pose Estimation
more variation between different subjects, so the algorithm is less robust when predicting the
keypoints' positions.
Regarding the average performance, the MPJPE of experiment #1 was comparable to that of
experiment #3. For experiment #2, the MPJPE decreased marginally, indicating that we can
train the model directly on the face-blurred images without pre-training on the original images.
As expected, the most impacted keypoints by face blurring were the head and neck. Surprisingly,
the head localization showed lower differences in model #1 than the control model #3. On the
other hand, the other keypoints were impacted almost negligibly (maximum average variation
was 0.7 mm).
In addition to the experiments presented in this paper, we also evaluated the performance of the
three models for all possible combinations with both blurred/original test sets. As expected, we
found that when the training data were of different types than the test data, keypoint localization
performance slightly decreased.
One-way repeated measures ANOVA tests revealed that elbow flexion was statistically
significantly affected, although for all other joint angle calculations we did not observe a
statistically significant effect of different experimental settings on kinematic calculations.
However, whether the strength of these effects was within acceptable limits needed to be
analyzed quantitatively. As stated in (McGinley et al., 2009), a joint angle estimation is
“regarded as reasonable” if the difference is less than 5°. Therefore, in this paper, 5° was
adopted as an acceptable difference. In other words, the joint angle estimation was considered
acceptable when the difference between the angle estimated with the marker-less and the
marker-based motion capture systems was less than 5°.
We have demonstrated in Table 3 that the maximum variation of the RMSE of joint angles was
negligible (0.6°), implying that there was little difference between the central values of joint
angle confidence intervals. Meanwhile, RMSE ± 1.96×SD is the 95% confidence interval of
the joint angle estimate; therefore, we can consider that face blurring does not have a strong
impact on the kinematic analysis of one joint angle if the maximum variation Δ of SD on this
angle is less than 5°/ (1.96×2), which is 1.27°. In our results, only slight variations Δ of SD
(less than 1°) were observed for most of the joint angles. Even so, closer inspection of the results
43
Effect of Face Blurring on Human Pose Estimation
showed that in accord with our observations in the previous analyses, the most affected angle
was the shoulder flexion, with variations of 0.6°. Even for the elbow flexion, the calculation of
which was deemed to be impacted with a high statistical significance, the maximum variations
Δ of SD was also less than 1°, demonstrating that the impact of face blurring on the calculation
of this angle is still acceptable.
There are also several limitations in this present study. The dataset used in the research was a
gait dataset, and other types of movements were not investigated. Most subjects in our dataset
were masked because of the context of the COVID-19 pandemic. Moreover, a single camera
setup was investigated. It would be then interesting to examine the effect of face blurring on
other datasets including other motions (Ionescu et al., 2014; Li et al., 2020b), unmasked subjects,
and different camera setups.
3.5 Conclusions
In this study, we present the first comprehensive investigation of the effect of face blurring on
3D human pose estimation. We have performed subject face blurring on an image dataset
acquired in a previous gait study and investigated the impact of face blurring on human pose
estimation and the subsequent kinematic analysis. Following this, we examined the statistical
significance of the effects of face blurring on joint angle calculations with a further analysis of
the strength of these effects. The results show that training the model on face-blurred images
does not have a large impact on the performance of the model. The effects of face blurring are
not found statistically significant on kinematic calculations for all joint angles except one
(elbow flexion; however, this effect is relatively weak and acceptable). Moreover, we can train
the neural network directly on face-blurred images without pre-training on the original images.
Our findings indicate that it is feasible to utilize face-blurred image datasets for human pose
estimation, which can effectively protect the privacy of subjects in training datasets without
loss of performance in the subsequent kinematic analysis, thus facilitating data sharing that can
accelerate the convergence of clinical or ergonomic applications.
44
45
Chapter 4 of this dissertation in large part is based on a short paper submitted for publication
in the Springer book series Lecture notes in computational vision and biomechanics,
presented by the dissertation author at the following international conference:
Having studied the effect of face blurring on the neural network’s performance for human
pose estimation in Chapter 3, the following chapter focused on human pose estimation in
lifting and lowering tasks commonly performed in the workplace. To date, there has been
little accurate lifting/lowering dataset in the industrial setting in the open literature.
Therefore, a set of lifting/lowering trials were performed in our laboratory to obtain 3D
annotated multi-view high-accuracy image dataset that can be used for human pose
estimation. Afterward, the subjects' faces are blurred to protect their privacy as in Chapter 3.
The face-blurred dataset was then employed to train a neural network and identify the 3D
joint positions despite occlusions that may occur. We also examined the presence of unseen
cardboard box in the test set to demonstrate how it affects the network’s accuracy.
46
Human pose estimation for lifting/lowering tasks
Abstract: Computer vision-based human pose estimation has a high potential for applications
in the prevention of musculoskeletal disorders among workers. However, there is a lack of
accurate datasets on the motion of workers that can be utilized for human pose estimation in the
industrial environment. In this chapter, we collected a 3D annotated multi-view high-accuracy
image dataset for human pose estimation in lifting/lowering trials, with all the subjects' faces
blurred to protect their privacy. Furthermore, a neural network was trained on the collected
dataset to evaluate the effect of face blurring on the network’s performance for human pose
estimation. Additionally, the effect of the unseen cardboard box in the test set was subsequently
examined. An average MPJPE(mean per-joint position error) of 12.76±4.94mm was achieved
in estimating the 3D joint positions on the whole test dataset. In the meantime, the result
illustrated that slight differences between the actual factory scene and the training set could
have a large impact on the model performance. Our study indicates that improving the
robustness of the model to new environments is of great importance.
4.1 Introduction
human motion measurement can be generally divided into two classes, direct measurement
systems and markerless motion capture systems. Direct measurement systems, which require
sensors or markers to be attached to the human body, are often used in laboratory settings to
avoid occlusion problems and allow relatively accurate estimation of human pose. However,
for a real production system in industrial environments, direct measurement systems will cause
several obstructions to workers' activities, for which they are usually used as a reference system
in laboratory experiments to evaluate the accuracy of markerless motion capture systems. The
markerless motion capture systems have two main types, depth cameras and classical cameras
(RGB cameras).
The present work focuses on the classical cameras, or RGB cameras, which are receiving more
and more attention as motion capture systems. Thanks to the development of artificial
intelligence, the accuracy of human pose estimation based on RGB images has been remarkably
improved in recent years. Various algorithms have been continuously proposed, from the initial
2D human pose estimation to the high accuracy 3D human pose estimation. Compared to single-
view 3D human pose estimation, multi-view 3D human pose estimation can achieve higher
accuracy. A great deal of previous research has proposed different models for multi-view 3D
human pose estimation (He et al., 2020; Iskakov et al., 2019a; Qiu et al., 2019; Reddy et al.,
2021). Especially, in the study of (Iskakov et al., 2019a), the researchers proposed learnable
triangulation methods, which combine a 2D backbone network with a subsequent triangulation
module for end-to-end training to obtain accurate 3D human pose estimation. Benefiting from
the rapid improvement in the accuracy of 3D human pose estimation from RGB images, RGB
cameras are also increasingly used for human motion capture in the field of biomechanics and
ergonomics (Li et al., 2020a; Mehrizi et al., 2019; Vafadar et al., 2021). Particularly, in (Mehrizi
et al., 2019), the authors utilized two RGB cameras to capture the motion of the worker carrying
a heavy load (MPJPE : 14.72 ± 2.96 mm) and applied inverse dynamics to calculate the joint
forces. In the work by (Vafadar et al., 2021), the researchers used four RGB cameras to capture
human motion on the strength of learnable triangulation methods and then carried out gait
analysis based on the result of human pose estimation.
Despite significant interest in computer vision-based human pose estimation, its application in
the domain of ergonomics is scarce because there are no adequate highly accurate datasets in
48
Human pose estimation for lifting/lowering tasks
industrial settings. Furthermore, the images are normally not fully anonymized, yielding an
important privacy concern that needs to be addressed.
4.2.1 Participants
Twelve subjects (7 females and 5 males, 24.2±2.3 years, 172.4±10.1 cm, 65.9±14.7 kg)
participated in the lifting/lowering experiment. The experiment was approved by the ethics
committee (Protocol 06036, Ile de France VI - Groupe Hospitalier Pitié-Salpétrière). All
subjects have signed an informed consent form. Before the experiment, a medical examination
was issued by an orthopedic surgeon to determine if there was no contra-indication for the
subjects to perform the experiment.
During the experiment, 57 markers were attached to the subject's body and 11 markers were
attached to a cardboard box. Low-dose radiographs of the subjects with attached markers were
acquired using the EOS System (EOS imaging, France). A 15-camera marker-based motion
capture system (Vicon Motion Systems Ltd, Oxford, UK) captured the precise location of the
57 markers. The subject stood on force plates to measure the force and moment between the
subject's feet and the ground. The signal acquisition frequency was set to 100 Hz. Meanwhile,
the experiment was recorded by a multi-view vision system consisting of four cameras (GoPro
Hero 7 Black) (Vafadar et al., 2022, 2021; Jiang et al., 2022a), each two of which were fixed to
a long aluminum bar with a stereo vision baseline of 95 cm and an angle of approximately 15°
between each to increase the motion capture volume. In Figure 4-1, we present the placement
of the Vicon System and RGB video cameras, where a flashing light was used for the
49
Human pose estimation for lifting/lowering tasks
synchronization between these two different systems. The camera mode was set to the linear
field of view, with a resolution of 1920 × 1080 at 100 frames per second.
As a special case, one of the 12 subjects carried a larger size cardboard box (weight: ≈0kg, size:
60.0*40.0*40.0cm). For simplicity, this subject is referred to as S01 in subsequent sections of
this chapter.
A total of 181,624 frames were acquired in the experiment. Among them, 540 frames with
missing markers and 804 frames with camera occlusion were excluded from the datasets.
50
Human pose estimation for lifting/lowering tasks
Therefore, 180,820 frames were used for model training and testing. Each frame has four
camera views with corresponding camera intrinsic and extrinsic parameters.
Using the EOS images, the 3D models of the bone segments and the attached markers were
reconstructed for the lower limbs, whereby the alignment between the anatomical frames and
the marker-based coordinates was performed and the position of the joint center in the marker-
based coordinate system was then computed (Vafadar et al., 2021). Based on the coordinates of
the markers given by Vicon, aligned with the data from EOS we calculated the exact 3D
coordinates of 17 key points: ankles, knees, pelvis, hips, shoulders, elbows, neck, L4/L5, C01,
and wrists. Using signals from LEDs placed in the motion capture area, the GoPro camera was
synchronized with Vicon's data (Vafadar et al., 2021). The cameras were calibrated using
Matlab Computer Vision Toolbox (MathWorks, Natick, MA, USA) to obtain the intrinsic and
extrinsic parameters. With the help of these parameters, the exact 3D positions of the 17 key
points were subsequently projected onto each frame of the GoPro video synchronized with the
Vicon data, thus enabling the automatic annotation of human joint positions in the 2D RGB
images.
In order to protect the privacy of the subjects, the faces of the subjects in the image set were
blurred(Jiang et al., 2022a). Based on the reference data acquired from Vicon, the key points of
the subject's head and neck were projected onto the corresponding image views. A circular
region is determined with these two key points to cover the subject's head, and the pixels in the
circular region are blurred using the Gaussian blur algorithm with default parameters in
OpenCV.
In this study, the learnable triangulation algorithm (Iskakov et al., 2019a) was used for human
pose estimation. It was composed of two parts, where the first part performed two-dimensional
human key point detection and localization for each view, which was then used by the network
in the second part to determine the 3D position of the human key points with the help of camera
parameters. Two steps were proposed in learnable triangulation algorithm, namely, algebraic
1 The C0 is defined as the midpoint of the neck and the head, which is anatomically close to the odontoid and C1.
51
Human pose estimation for lifting/lowering tasks
and volumetric triangulation, where the pelvis position from the former was subsequently given
to the latter as input.
We used the data of 6 subjects (89,025 frames) in our dataset as the training dataset, and hold-
out 6 subjects (91,255 frames) as the test dataset. Among those 6 subjects in the test set, the
data from S01 were included but were not present in the training set. With the weights given by
(Iskakov et al., 2019a), trained on Human3.6M(Ionescu et al., 2014), as an initial weight, the
network was retrained on our training set. We trained algebraic triangulation for 57 epochs with
a batch size of 8, and volumetric triangulation for 10 epochs with a batch size of 5. The model
was trained and evaluated on a Linux server running Ubuntu 20.04.1 LTS 64 bits, equipped
with an AMD Ryzen 9 3900X 12-Core Processor (125 GB RAM) and a NVIDIA TITAN RTX
GPU (24 GB RAM).
4.2.6 Evaluation
The MPJPE (Mean Per Joint Position Error)was utilized to evaluate the human pose estimation
accuracy, where the Euclidean distance between the estimate and the reference values was
calculated for each frame and each key point. The Euclidean distance between 2 points was
calculated with the using the formulae below,
where 𝑝1𝑥 , 𝑝1𝑦 and 𝑝1𝑧 ,denote the x, y, and z coordinates of the point 𝑝1, respectively.
Same for point p2.
The MPJPE calculation was performed then according to the following formula (Zheng et al.,
2021a),
∑𝑁 ̂
𝑖=1 ||𝐽𝑘 − 𝐽𝑘 ||
𝑀𝑃𝐽𝑃𝐸 = ( 4-2 )
𝑁
where 𝐽𝑘 and 𝐽̂𝑘 represent the reference and estimate of the position for the key point 𝑘. 𝑁
denotes the keypoints numbers. It is worth mentioning that, in this work, the evaluation was
performed directly on the estimate from the deep learning model, without any data smoothing
or filtering. In the test set, five subjects were with the same context (cardboard box size=42.0
*32.0*36.0cm) and one was with a different cardboard box dimension (cardboard box
size=60.0*40.0*40.0cm) in order to explore the effect of cardboard box size parameter. We
52
Human pose estimation for lifting/lowering tasks
evaluate separately the accuracy of the five subjects and the hold-out one.
4.3 Results
The average MPJPE of the neural network for human pose estimation over the 5 subjects with
the small cardbox is 11.97 ±2.10 mm while the MPJPE of that in the hold out subject is 22.56
±12.02 mm.
The MPJPE of the neural network for human pose estimation over the whole test dataset is
12.76±4.94 mm. Figure 4-2 presents the distribution of the 3-dimensional position difference
for each key point between the estimations from the neural network on the test dataset and the
reference results from Vicon. As observed, the largest differences are in L4/L5 (20.55 ± 9.02
mm) and hips (16.16 ± 8.89 mm),while the differences on other key points are around or less
than 15 mm, providing a good support for the neural network approach.
Figure 4-2. Difference distribution of the 3D position estimation of each key point on the test set.
The middle line and left/right sides of the box represent the median and 75th/25th percentile.
The whiskers indicate the minimum and maximum values of the data.
The qualitative results of the human pose estimation are shown in Figure 4-3. It is noted that
for different subjects at different stages of lifting/lowering trials, the human pose estimations
from the neural network are in good agreement with the reference results from Vicon.
53
Human pose estimation for lifting/lowering tasks
Figure 4-3. Samples of the human pose estimation results on the test set. Each column
represents 2 of the 4 views in a single frame with the 3D positions of the key points projected to
the corresponding image. Green Skeleton: Estimate from multi-view RGB images. Red
Skeleton: Reference from Vicon
Table 4-1 illustrates the statistics of the differences between the neural network estimations and
reference values for each key point on the test dataset of human pose estimation. The results
show that the differences for ankles, knees, C0, neck, and head are relatively smaller compared
to other key points for all subjects. Moreover, subject S01 has a higher mean and standard
deviation in all key points compared to other subjects. The MPJPE in S01 is 22.56±12.02mm,
while for other subjects, the maximum MPJPE is 13.57mm (S12), and the maximum standard
deviation is 3.17mm (S06).
Table 4-1. Mean value (μ) and standard deviation (σ) of Euclidean distance between the estimates and the
references of key points position on the test set by keypoints for each subject. (anks.=ankles,
shouls.=shoulders, elbs.=elbows)
Subject anks. knees hips wrists shouls. elbs. pelvic L4/L5 C0 neck head MPJPE
μ 14.2 11.66 31.26 45.89 20.57 24.18 27.81 29.64 8.14 10.6 11.74 22.56
S01
σ 20.38 11.1 13.52 74.17 8.88 22.03 14.99 11.92 4.28 4.89 5.71 12.02
μ 9.24 10.18 19.36 15.52 14.48 7.88 13.4 16.74 6.41 8.86 12.37 12.42
S06
σ 5.85 2.54 6.89 6.45 16.99 4.91 6.95 4.77 2.87 4.32 3.43 3.17
μ 5.44 6.55 22.77 11.41 13.54 14.57 17.68 15.58 5.64 8.12 6.97 11.91
S10
σ 2.86 2.74 2.98 6.45 3.46 4.25 3.9 12.39 2.81 4.14 3.9 1.85
μ 5.35 8.7 14.78 10.99 22.79 19.24 15.4 15.66 8.77 10.5 16.6 13.57
S12
σ 1.27 2.72 6.59 5.3 4.7 4.47 6.14 5.39 3.97 3.07 6.44 1.81
S13 μ 7.41 8.06 9.57 12.56 12.11 15.8 10.22 26.56 6.81 9.44 7.28 11.25
54
Human pose estimation for lifting/lowering tasks
σ 4.91 3.31 3.35 6.73 3.59 5.42 3.73 6.21 2.45 3.41 3.54 1.6
μ 4.97 6.67 10.46 15.41 11.98 12.7 10.15 23.01 7.82 7.33 9.12 10.69
S14
σ 3.79 5.25 3.82 7.6 5.32 6.18 4.22 4.85 2.91 3.87 3.71 2.04
Figure 4-4 shows the distribution of differences between the estimation from the neural network
against the reference results from Vicon for subject S01 and the other five subjects in the test
set. It should be noted that the S01 was holding a larger cardbox than the remaining subjects,
As observed in the figure, for subject S01, the largest differences were found in wrists and hips,
while for the other subjects, it was in L4/L5 and hips. For all keypoints, differences in S01 were
larger than those in the other subjects, and the key points with the largest difference increase
were wrists, with a mean value of 45.89mm and a standard deviation of 74.17mm. The
qualitative results of human pose estimation on S01 are illustrated in Figure 4-5. It is observed
that the large cardbox in S01 cause heavy occlusion to the human body
Figure 4-4. Difference distribution of the 3D position estimation of each key point in S01 and
the other subjects in the test set. The middle line and left/right sides of the box represent the
median and 75th/25th percentile. The whiskers indicate the minimum and maximum values of
the data.
55
Human pose estimation for lifting/lowering tasks
Figure 4-5. Samples of the human pose estimation results on S01. Each column represents 2 of
the 4 views in a single frame. Green Skeleton: Estimate from multi-view RGB Images. Red
Skeleton: Reference from Vicon
4.4 Discussion
First, the faces of the subjects in this dataset are blurred to ensure the privacy of the subjects.
To the best of the author's knowledge, it was the first time that face blurring was performed in
the human pose estimation for lifting/lowering movement. It is recalled that the face blurring
has little effect on most of the metrics while walking, as demonstrated in Chapter 3. Second,
56
Human pose estimation for lifting/lowering tasks
during the human key point annotation, the markers were attached to the subjects by a
professional surgeon. In addition, the X-ray radiographs of the subjects were collected to
improve the annotation accuracy, hence the key points in the dataset were more suitable for
biomechanical calculations in ergonomic analysis. Furthermore, a neural network was then
trained on this dataset for human pose estimation in lifting/lowering, where the effect of the
unseen larger cardboard box on the accuracy of the network was preliminarily explored.
With regards to the performance of the neural network on the whole test set, an average MPJPE
of 12.76±4.94mm was achieved in estimating the 3D joint positions on the whole test dataset,
indicating that the model performance in key point localization did not suffer a great loss after
face blurring compared to the results from (Vafadar et al., 2022). It should be mentioned that
the average MPJPE value in gait analysis was 13.1 mm reported by (Vafadar et al., 2022) and
13.0 mm in chapter 3 of the present manuscript. The key points with the largest difference were
L4/L5 and hips. This can be attributed to the fact that the vision-based detection of these two
key points was susceptible to the soft tissue artifact.
Concerning subject S01, the box (60.0*40.0*40.0 cm) used in the experiment of S01 was larger
than the box for other subjects in the test dataset which has the same size as in the training
dataset (42.0*32.0*36.0 cm). In both S01 and the other subjects, relatively large differences
between the estimates and the reference results were found in hips, indicating that the
localization of these key points was inherently more difficult compared to the other key points.
As expected, for all the key points, the differences in S01 were larger than in the other subjects.
A possible explanation for this is that the box caused more occlusion to the body of the subject
when the box was larger. As a matter of fact, the deep learning model has never learned the
features of this unseen larger box in the training dataset. Therefore in some scenarios, the model
confused certain points on the box with the human body key points. In S01, the key points with
the largest difference increase were wrists, of which the standard deviation was also the highest,
indicating a large degree of variability in wrist measurements across subjects. Since the wrist
was always closer to the box when the subject was holding it, the localization of this keypoint
was most significantly affected.
57
Human pose estimation for lifting/lowering tasks
4.4.3 Limitations
There were several limitations in our study. For instance, the laboratory setup did not fully
replicate the work scenario in the factory. In addition, the degradation of the performance of the
deep learning model in the presence of unseen larger box in the test dataset illustrated that slight
differences between the actual factory scene and the training dataset could have a large impact
on the model performance. In the actual work scenario, there might be plenty of differences
from the training dataset, such as the arrangement of machines in the factory, the different
dressing of workers, and the interaction between workers and machines, which were difficult
to be fully replicated in the training dataset during the experiment. A further study with more
focus on synthetic image sets is therefore suggested 2 . Based on the data collected in the
laboratory, computer graphics techniques could be used to synthesize different datasets of the
work scenes in the factory, after which the synthetic image sets can be employed to train the
neural network, thus reducing the cost of data collection.
4.5 Conclusion
In this chapter, a multi-view image dataset annotated with high accuracy was presented for
human pose estimation during lifting and lowering, in which all subjects' faces were blurred to
preserve personal privacy. In addition, we trained a neural network on this dataset for evaluation
of its performance on the face-blurred dataset in the human pose estimation task. Moreover, the
impact of the unseen cardboard box in the test set was also explored.
With the help of multiple views, the occlusion problem of cameras was alleviated. A high
accuracy was achieved for the human pose estimation achieved in the lifting/lowering
experiment. The obtained human key point locations will be used for the following
biomechanical and ergonomic analysis in the next chapter. In addition, we found that changes
in the environment had an impact on the performance of the neural network, where synthetic
data could be used to enhance the network in the future.
2
Synthetic Images refer to visual representations created through computer graphics, simulation techniques, or artificial intelligence (AI)
aiming to accurately depict the reality with a high degree of realism. This technique allows to generate datasets to train AI models.
58
59
J. Jiang, W. Skalli, A. Siadat, and L. Gajny, Estimation of intersegmental load at l5-s1 during
lifting/lowering task with markerless motion capture, 28th Congress of the European Society
of Biomechanics (ESBiomech23), Maastricht, The Netherlands, 2023
Soon, the contents of this chapter will be submitted to a journal for publication.
We start by estimating the 3D positions of the human body key points at each time frame
with the help of deep neural networks. Next, based on the joint positions obtained from the
3D human pose estimation, the kinematic analysis was performed to calculate the velocity
and acceleration information of each human segment. In addition, the external forces and
moments exerted on feet were measured by the force plate. Then, we calculated the net force
and moment on the L5-S1 joint using inverse dynamics. The accuracy of the estimated joint
loads was validated by comparing the results obtained from datasets with maker-based and
markerless motion capture systems. It has been further verified by comparing the results
obtained from both bottom-up and top-down approaches.
60
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Abstract: Accurate estimation of joint load during a lifting/lowering task is necessary for better
understanding of the pathogenesis and development of the MSDs. Particularly, the values of the
net force and moment at the L5-S1 joint are considered to be an important criterion to identify
the unsafe lifting/lowering tasks. In this study, the joint load at L5S1 was estimated from the
motion kinematics acquired by a multi-view marker less motion capture system both with and
without external forces and moments measured by the force plate. To this end, the 3D positions
of the human body key points at each time frame were first obtained with the help of deep
neural networks. Based on the joint positions obtained from the 3D human pose estimation, the
kinematic analysis was performed to calculate the velocity and acceleration information of each
human segment. Then, we calculated the net force and moment on the L5S1 joint using inverse
dynamics. Therein, a 3D model of the human body for each subject was built using biplanar X-
ray images, based on which the personalized human body segments inertial parameters were
obtained. A state-of-the-art level of accuracy was achieved in our study. We found that the
accuracy of the stereo vision system can meet the L5-S1 load calculation requirement, paving
the way for the use of cameras in factories to prevent musculoskeletal disorders among workers.
5.1 Introduction
Intersegmental load estimation plays an important role in biomechanical analysis, which is one
of the most important methods for musculoskeletal disorders prediction (De Looze et al., 2000;
Mehrizi et al., 2019). It performs calculations based on information such as human posture,
external forces applied to the body, anthropometric measurements, and other physiological
information. Through inverse dynamics, we can estimate the inter-segmental forces/moments
between adjacent body segments. In a lifting task, the lower back is found to be the vulnerable
body part where musculoskeletal disorders occur most frequently (Da Costa and Vieira, 2009;
Mohd Nur et al., 2018). Therefore, in our study, we focus on the analysis of lumbar spine loads,
especially the intersegmental forces/moments at L5-S1 joint.
Inter-segmental forces are the net forces between two adjacent segments of the human body,
61
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
which can be obtained from inverse dynamics calculations. In inverse dynamics calculations,
the human body is generally modeled as a kinetic chain of rigid body segments, where the
kinematic information can be obtained based on the 3D position of each segment at every
instant. Subsequently, the inter-segmental force can be calculated once the inertia parameters
of each segment are determined. The estimation of the body segment inertial parameters is
fundamental, and the classic method proposed by (De Leva, 1996) is usually employed to obtain
the inertial information of each segment, such as length, mass, the center of mass, and inertia
matrix, by leveraging the information about the joint position, the subject's gender, and the total
body mass as inputs. However, this method does not take into account the personalized
anthropometric measurements of the subject's body, which are difficult to obtain by experiments.
Therefore, numerous studies (Kollia et al., 2012; Pillet et al., 2010; Robert et al., 2017; Venture
et al., 2019) have been conducted to calculate the personalized body segment inertial parameters
(BSIPs), where several non-invasive methods are proposed for personalized BSIPs calculation
that can be used for biomechanical and ergonomic analysis.
As the L5/S1 joint load is one of the most principal criteria to evaluate the MSDs risk, we focus
on the force/moment calculation at this joint. Two main paradigms exist to calculate the L5/S1
loads, namely, the top-down and bottom-up models. Both of these 2 paradigms terminate the
load calculation procedure at L5/S1 joint. The difference between the two approaches lies in
the fact that, in the top-down model, the calculation starts from the upper limbs, while in the
bottom-up model, the calculation starts from lower limbs. For both of these 2 paradigms, the
external forces are needed to initiate the calculations. In the top-down model, the external force
can be estimated by the kinematic information of the box. In contrast, in the bottom-up model,
the external forces/moments applied on the feet from the ground need to be measured by the
force plates (Mehrizi et al., 2019).
The top down method is adopted in this study to estimate the load on the worker's lumbar spine
for the facilitation of deployment in a factory environment since the force platform is not a
requisite. It starts from the head and hands, down along the kinetic chain, and accumulates the
forces/moments of all upper segments to the L5-S1 joint of the lumbar spine. To initiate the
calculation, this method requires information on external forces applied to body segments at
each instant, where the external forces come mainly from the box held by both hands and can
therefore be estimated from its kinematic information and mass.
62
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
To validate the accuracy of the estimated joint loads from the force platform free markerless
motion capture system, it is common in the literature to use the same paradigm to calculate
reference values of human joint loads with marker-based motion data and then compare the
differences of the estimated loads by the two approaches. In this chapter, a new technique was
proposed by using the bottom up paradigm to calculate reference values with marker-based
motion data, BSIPs calculated with geometric model as well as force platform data. Then
comparing the result from the top down estimation can provide more rigorous validation of the
correctness and the accuracy of the calculations. Compared to the top down approach, the L5S1
load calculated by the bottom up approach is closer to the true value (Kingma et al., 1996). This
is due to the fact that the calculation of the center of mass and BSIPs of the lower limbs can
achieve higher accuracy. Secondly, the force measurement platform is directly used to measure
the ground reaction force in the bottom up paradigm, while in the top down approach, the initial
forces are computed indirectly from the kinematic analysis which introduces additional errors
in the theoretical analysis.
In summary, quantitative motion capture and associated dynamic analysis is a way to estimate
intersegmental loads associated to a given posture. While recognized as useful in ergonomics,
it has not been used in workplace due to several issues. Inverse dynamics requires force
platforms that are not available routinely in a workplace. In the study of (Pillet et al., 2010), the
authors demonstrated that top down approach could allow force platform free dynamic analysis,
but the evaluation was only performed on the ground reaction values, comparing the force plate
values to the estimated ones. Secondly, in inverse dynamics, BSIPs of body segments are of
particular importance. These inertial parameters could be either proportional or subject specific.
Barycentremetry from biplanar X-Rays allows to get accurate estimation (Dumas et al., 2005;
Sandoz et al., 2010). However it is not useable in routine workplace environment. Therefore, it
is the objective of the present chapter to propose a force plate free top down approach to
estimate the intersegmental load at L5-S1 during the lifting/lowering tasks. This approach
utilizes RGB cameras for motion tracking and uses proportional BSIPs. The proposed method
is validated by comparison with a reference method with force plates and with accurate subject
specific BSIPs, in which the motion capture is performed using a reference system.
63
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
5.2.1 Participants
Twelve subjects (24.2±2.3 years, 172.4±10.1 cm, 65.9±14.7 kg) participated in the experiment.
The experiment was approved by the ethics committee (Protocol 06036, Ile de France VI –
Groupe Hospitalier Pitié-Salpétrière). All subjects have signed an informed consent form.
Before the experiment, a medical examination was issued by an orthopedic surgeon to
determine if the subject was able to perform the experiment (same as the previous chapter). The
experimental setup and the related dataset preparation were detailed in Chapter 4. As described
in Chapter 4, a neural network of learnable triangulation was trained/evaluated on the dataset,
where 6 subjects were randomly selected as the training set and the rest were in the test set. Of
all the subjects, S01 carried a larger cardboard box (weight: ≈0kg, size: 60.0*40.0*40.0cm)
while the others carried a smaller box (weight: 5kg, size: 42.0*32.0*36.0cm). The data
corresponding to S01 was included in the test set.
In this study, inverse dynamics is employed to calculate the load on body joint L5-S1, where
top down and bottom up paradigms are explored.
Two methods were adopted to calculate the body inertia parameters (BSIPs), namely the
proportional method and the geometric method. Having estimated the 16 joints from the multi-
view stereo system, the human body model was built as a linked chain consisting of 10 rigid
body segments, with the upper limb segments including forearms, upper arms, head as well as
trunk, and the lower limb segments including thighs and shanks.
In the proportional method, the inertial properties of each body segment were calculated
through the approach proposed by (De Leva, 1996). The input to the method is the subject's
gender and body mass, which yields the position of the center of mass of the segment, the
segment mass, and the inertia tensor with respect to the central inertia principal axis.
64
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
In the geometric method, based on the biplanar radiographs collected in the experiment, a
personalized 3D digital model of the human body was created for each subject in this work with
the method provided in (Nérot et al., 2015). The 3D body reconstruction includes altering a
template to align with the surface profiles in biplanar radiographic, where the process consists
of three stages: firstly, global morphing to scale the body segments of the template and to match
the participant's position, subsequently, gross deformation & a fine deformation were performed
to adapt the surface profiles of the template to the radiographic. The outcome is a 3D human
body envelope which can be used to calculate BSIPs.
The 3D digital model was segmented into different body segments, following which the mass,
center of mass, and inertia of each segment were calculated based on the human body density
data given by (Dempster, 1955), with the thorax density updated according to (Amabile et al.,
2016). For the lower limbs, accurate 3D models of the human body were built by using the X-
ray images in order to calculate the reference values of the L5/S1 loads. In the case of some
upper limb segments, due to experimental limitations, the x-ray images covered only a part of
the segment, with the rest built manually, resulting in less accurate estimate. Therefore, the
results of the inertial parameters of the 3D model of the upper limb segments were not used as
reference values, but only for comparison with the proportional method. A sample of biplanar
radiographs and the corresponding 3D body envelope are shown in Figure 5-1 .
65
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Figure 5-1. Sample of biplanar radiographs and the corresponding 3D body envelope. (a)
Sample images of low-dose biplanar X-ray radiographs; (b) 3D digital envelope of human body
based on radiography. Each segment is distinguished by a different color. The red circle on each
segment represents the center of mass, and the axes represent its inertial principal axis, with the
length indicating the relative magnitude of the rotational inertia about the corresponding axis.
The red crosses denote the body key points, which are used for the localization of the center of
mass position.
66
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
To establish a local coordinate system for a rigid body, coordinates of at least three points on
the rigid body are required. The use of the markerless motion capture system for human pose
estimation produces only 17 key points, with only two key points on each segment, i.e., the
proximal and distal points. Hence the local coordinate system of the rigid body needs to be
established based on the 3D coordinate information of those two points. In this case, while the
rotation of the segment with respect to one of the principal axes cannot be estimated, a local
coordinate system for each segment can be still established for dynamics calculation as follows:
The z' axis of the local coordinate system on the segment was the unit vector from the proximal
end to the distal end. The y' axis is a unit vector in the zy plane in the global coordinate system,
perpendicular to the z' axis. Based on z' axis and y', x' axis is obtained by the right hand rule.
When calculating the anatomical components of the load of L5-S1 in different directions, we
established a local anatomical coordinate system of the pelvis based on the joint points of the
human body. The coordinate system of the pelvis was defined as follows.
N SL
SR
ER HR HL EL
HC
WR WL
KR KL
AR AL
FR
FL
Figure 5-2. The local coordinate system of the pelvis
67
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
As depicted in Figure 5-2, The Y-axis was defined as the vector ⃗⃗⃗⃗⃗⃗⃗⃗
𝐻𝐶 𝑁 . The X-axis was
perpendicular to the Y-axis and ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐻𝑅 𝐻𝐿 . The Z-axis was calculated from the X-axis and Y-axis
according to the right-hand rule. The coordinate system origin was placed at 𝐻𝐶 , and all axes
were normalized to unit vectors. The frontal, transverse, and sagittal planes were defined as
planes with the X, Y, and Z axes as their respective normal vectors.
The forces in the anterior-posterior, vertical, and medial-lateral directions were computed as the
respective components along the X, Y, and Z axes. Similarly, the moments in the frontal,
transverse, and sagittal planes were calculated as components along the X, Y, and Z axes,
respectively.
Based on the joint positions obtained from the 3D human pose estimation, we performed the
kinematic analysis to calculate the velocity and acceleration information of each human
segment. Specifically, the motion data were filtered using a low pass butterworth fourth-order
filter with a cutoff frequency of 2.5 Hz. The center-of-mass acceleration was calculated
employing the second-order differentiation of the center-of-mass position with respect to time.
For the angular acceleration, we first calculated the helical axis and the angular velocity using
the finite difference of the rotation matrix, after which the angular velocity was differentiated
against time to obtain the angular acceleration.
The force exerted on a human segment by an external object other than the human segments is
considered an external force. In the process of handling the cardboard box, the main external
forces were the ground reaction force on the subject foot and the force from the cardboard box
on the hands.
The force exerted by the box on the hand was calculated via the mass and acceleration of the
box. The mass of the box itself was negligible (≈0kg, which was neglected for the subsequent
calculation). During the experiment, a metal disk (mass = 5kg, radius = 10cm, height = 2cm)
was placed inside the box. Hence the mass distribution was relatively concentrated, allowing
68
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
us to simplify it to a mass point during the inverse dynamics calculation. The acceleration of
the subject's wrists was calculated and used to approximate the acceleration of the box, since
there was no relative sliding between the hands and the box when the subject was carrying the
box. During the process of lifting the box, when the box just started to move, the support force
on the box from the ground/table was the weight of the box. When the box was completely off
the ground, the support force became 0. Although there was a transition phase between these
two states, the change in support force was simplified to be completed instantaneously at the
moment when the box started to move. The same simplification was performed for the phase
of placing the boxes on the ground/table. Time labels at which the boxes started moving or
completely stopped were manually annotated and verified using the motion data of the markers
on the box.
The ground reaction force was measured by the force plates (one force plate for each foot).
The measurements of the force plate are the net forces Fx′ , Fy′ , Fz′ and the net moments (about
the plate origin) Mx′ , My′ , Mz′ in the force plate coordinate system. The following equations
were employed to calculate the reaction forces and the free moments from the ground on human
feet.
COPz′ = 0 ( 5-3 )
F ⃗′
⃗ = R′ F ( 5-6 )
Γ = R′ Γ ′ ( 5-7 )
where COPx′ , COPy′ , COPz′ denote the center of pressure in the coordinate system of the force
⃗ and ⃗F ′ are the ground reaction forces in the global coordinate system and the force
plate. F
plate coordinate system, respectively. Γ and Γ ′ represent the free moment from the ground in
the corresponding coordinate system. R′ is the rotation matrix that transforms the vectors from
the force plate coordinate system into the global coordinate system. As the subjects' feet are not
sticky in the experiment (which means that the subject's foot cannot exert an upward pull or
69
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
grasp on the ground), Γx′ and Γy′ are 0, yet the ground could still exert a free moment Γz′ in the
z direction on the feet.
Based on the above calculations, the loads on the subject's joints at L5-S1, including the forces
as well as the moments, were calculated as follows:
k
where FL5S1 and ML5S1 denote the intersegmental forces/moments acting on the joint L5-S1.
Fr and rr are the external force exerted on human segments and the vector to its point of
application. Γr is an external free moment. ri and rL5S1 are the vectors to the center of mass
(COM) of segment i and L5-S1 joint, respectively. mi and Ii′ are the mass of the segment i
and the inertial tensors in the segment local coordinate system. ai is the translational
acceleration of the COM in the global coordinate system. α′i and ω′i are the angular
acceleration and velocity of segment i. g is the gravitational acceleration. ΓiI is the inertial
contribution of segment i in the global coordinate system. R′i is the rotation matrix that
transform the coordinates of the vector from the local coordinate system of segment i to global
coordinate system.
5.2.3 Evaluation
In this study, the bottom-up method was implemented to provide the reference values of L5-
S1 load which were then compared with estimates from the top-down approach. The
computational procedure starts from the foot, up along the kinetic chain, and accumulates the
forces/moments of all upper segments to the L5-S1 joint of the lumbar spine. The reaction
70
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
forces and the free moments from the ground on the subject's feet, measured by the force plates,
were used to initiate the calculation. To enhance the calculation accuracy of L5S1 load reference,
the geometric models reconstructed from biplanar radiographic were employed for the BSIPs
calculation. In order to evaluate the accuracy of the L5S1 load estimates, we calculated the
component differences between the estimates and the reference along each direction, as well as
the difference between their Euclidean norms3 (i.e., vector modulus).
5.3 Results
We will present the result in two parts. Firstly, the results in 5.3.1 present the differences
between the L5/S1 load calculated on marker-based and marker-less data for each combination
of different BSIP models (Proportional or Geometric) and calculation paradigm (top-down or
bottom-up approaches). Then, the results in 5.3.2 will validate the estimation of L5/S1 load
against the reference value. The estimates of the L5/S1 load were calculated with the top-down
strategy, De Leva BSIP model, and markerless motion capture data, while the reference values
of the L5/S1 load were calculated with the bottom-up strategy, geometric BSIP model, and
marker-based motion capture data.
5.3.1 Differences between the results calculated on marker-based & marker-less data
Table 5-1 presents the RMSE between the values of the load at L5/S1 calculated on marker-
based motion data and marker-less motion data for each combination of different BSIP models
and calculation paradigm. In terms of norm, based on the proportional BSIP model, the RMSE
of the top-down calculation strategy reached 2.51±0.91N, 3.24±1.38Nm, while the RMSE of
the bottom-up strategy was 1.64±0.74, 5.7±3.14Nm. The results with the geometric model were
close to those with the proportional BSIP model.
Table 5-1. RMSE between the L5/S1 load calculation with marker-based motion capture and
marker-less motion capture for each combination of BSIP model and calculation strategy. μ and
σ denote the mean and deviation of the RMSE across all trials, respectively.
Force (N) Moment (Nm)
Ant.-
Vert. Medi.-Lat. Norm Front. Trans. Sag. Norm
Post.
Proportional μ 6.52 4.97 13.71 2.51 4.87 0.65 3.32 3.24
3 Both forces and moments are vectors, and here we compute their L2 norms. For a vector 𝑣⃑ = (𝑣𝑥 , 𝑣𝑦 , 𝑣𝑧 ), the Euclidean norms can be
calculated as ||𝑣⃑|| = √𝑣𝑥2 + 𝑣𝑦2 + 𝑣𝑧2 ,
71
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Top Down σ 3.05 3.7 5.22 0.91 1.88 0.27 1.54 1.38
Figure 5-3 to Figure 5-6 show the distribution of the differences between the values of the load
at L5/S1 calculated on marker-based motion data and marker-less motion data for each
combination of different BSIP models and calculation paradigm. It can be found that the
differences in the geometric models were close to those with the proportional models in each
calculation strategy. Furthermore, for all calculation strategies and BSIP models, the difference
in S01 was remarkably higher than in the other subjects, which was particularly true in the top
down strategy. In addition, the difference of the load norm could be smaller compared to the
components in different directions, especially in the force calculations.
Figure 5-3. Distribution of RMSE of the forces generated by top down calculation strategy for
the marker based and markerless motion capture data
72
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Figure 5-4. Distribution of RMSE of the moments generated by top down calculation strategy
for the marker based and markerless motion capture data
Figure 5-5. Distribution of RMSE of the forces generated by bottom up calculation strategy for
the marker based and markerless motion capture data
73
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Figure 5-6. Distribution of RMSE of the moments generated by bottom up calculation strategy
for the marker based and markerless motion capture data
In order to validate accuracy of the estimated L5/S1 load, the differences predicted by different
calculation strategies were calculated. Since in our study, biplanar radiographs were used to
build 3D envelopes with high modeling accuracy for the lower limbs, and force measurement
plates were used in the bottom up calculation strategy, the calculation accuracy of the bottom-
up strategy was higher than the top-down approach. Therefore, we employed the following
configuration to calculate the reference values for the L5/S1 load calculation: bottom up
strategy, geometric BSIP model, and marker-based motion capture data. For the L5/S1 load
estimates to be examined and validated, considering the implementability in actual industrial
environments, the following configurations were adopted: top down strategy, De Leva BSIP
model, markerless motion capture data.
Figure 5-7 and Figure 5-8 show the comparison of the load estimates of L5-S1 with the
reference values during the subject's lifting and lowering trials, respectively, including the
components of force and moment in each orientation and their Euclidean norms. Each of these
results is the loads on a given subject during a specific lifting or lowering trial. It is observed
74
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
that the estimated values of the loads are in good agreement with the reference values. In
addition, the components of the force in the vertical and anterior-posterior directions are
relatively large, while the components in the lateral direction are small. The moment has a
relatively large component in the sagittal plane and a moderate component in the coronal plane,
with a considerably smaller component in the transverse plane. At the beginning or end of the
handling action, the component of the force in the vertical direction is relatively large, while
the component of the force in the other directions is approximately zero and the component of
the moment in each direction is also nearly zero. Meanwhile, we note that there are two sharp
variations in the load estimates. For smaller load components in some planes, such as the
moment in the transverse plane, the pattern of the load estimate may differ from that of the
reference value.
75
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Figure 5-7. Example result of L5-S1 load versus time during a lifting task (subject ID: S08, task
trial: #19). Est.=estimate, Ref.=reference, Ant.-Post=anteroposterior, Vert.=vertical, Medi.-Lat.=
mediolateral , Front. =frontal plane, Trans.= transverse plane, Sag.= sagittal plane, Norm=
Euclidean norm
76
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Figure 5-8. Example result of L5-S1 load versus time during a lowering task (subject ID: S12,
task trial: #00).
Table 5-2 provides the RMSE results of the difference between the reference and estimate
values for each subject. From Table 1, it can be found that the difference in S01 remains highest
compared to the other subjects for each component of the force estimate, but for the moment
component, the difference in subject S01 is not the highest among all subjects. For the norm of
the load, the STD of the difference in subject S01 (Force: 34.69N, Moment: 3.89Nm) was
higher in comparison to the other subjects. As for the mean value of the difference, the highest
values were observed in subject S13 for the force (27.91N) and in subject S12 for the moment
(11.82Nm), respectively. The average differences of the estimates for force and moment among
all subjects were 14.03 ± 6.91N and 8.97 ± 2.3Nm, respectively.
Table 5-2. RMSE of the difference between the reference and estimate of L5-S1 load for each
subject across the trials
Force (N) Moment (Nm)
subject statistics Ant-Post Vert. Medi.-Lat Norm Front. Trans. Sag. Norm
μ 17.52 17.57 19.23 12.82 10.28 3.27 10.22 8.12
S01
σ 5.58 3.78 10.22 3.35 2.21 0.57 1.94 1.32
μ 9.39 10.51 13.92 9.71 10.9 6.27 8.41 9.13
S06
σ 1.29 1.54 2.92 1.91 1.54 2.05 1.06 1.12
77
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Table 5-3 presents the result of the peak value difference between the reference and estimate of
L5-S1 load for each subject across the trials. From the table, it can be noticed that in terms of
peak value, the difference of subject S01 is no longer the highest among all subjects. For the
differences in the peak values of the norm, the highest differences in moments and forces were
found in subject S12 (23.65±6.28Nm) and subject S13 (9.63±6.32N), respectively. The average
peak value differences of the estimates for force and moment among all subjects were
10.79±8.91N and 11.89±9.51Nm, respectively.
Table 5-3. Peak value difference between the reference and estimate of L5-S1 load for each
subject across the trials
Force Moment
subject statistics Ant-Post Vert. Medi.-Lat Norm Front. Trans. Sag. Norm
μ 6.27 5.64 27.61 5.22 12.65 3.34 15.22 14.85
S01
σ 2.29 2.85 24.36 2.67 5.58 1.54 4.85 5.2
μ 2.75 3.75 10.13 3.01 10.03 7.93 12.98 13.2
S06
σ 2.81 3.4 6.39 3.65 7 5.91 6.1 6.81
μ 14.85 9.31 10.18 15.09 4.69 7.3 13.93 14.12
S10
σ 11.87 7.97 6.62 11.29 5.06 2.81 10.03 10.05
μ 11.14 6.02 12.37 10.12 5.34 6.33 23.4 23.65
S12
σ 6.24 4.55 9.76 5.33 3.72 3.55 6.6 6.28
μ 14.97 21.29 14.72 19.63 9.98 5.51 3.28 3.77
S13
σ 7.45 5.76 8.1 6.32 4.37 3.33 2.77 3.01
μ 8.84 4.7 7.7 8.96 3.47 6.86 2.83 3.32
S14
σ 7.05 3.21 6.21 7.32 2.06 6.02 2.09 2.14
μ 10.1 8.69 12.4 10.79 7.22 6.51 11.62 11.89
All
σ 8.51 8 11.04 8.91 5.67 4.48 9.53 9.51
78
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
5.4 Discussion
The objective of this study is to estimate the intersegmental forces/moments at L5/S1 joint with
a multi-view stereo system and validate the calculation with the help of BSIPs computed from
the accurate 3D human body models.
We first analyze the differences between the results calculated on marker-based and marker-
less data. Since the accuracy assessment of the joint load calculation in the literature usually
calculates the difference between the results calculated on the marker-less & marker-based
motion capture data, we calculated this metric to facilitate the comparison with the results in
the literature. This indicator provides a reflection of the uncertainty of the calculated results.
With the same calculation strategy, the difference in the geometric BSIP model was close to
that of the proportional model. Moreover, we obtained, with both BSIP models, a state of art
level of accuracy (top down, force: 2.51±0.91N, moment: 3.24±1.38Nm), indicating that the
calculation accuracy of the stereo vision system in our study can meet the L5-S1 load
calculation requirement. Our results surpass the accuracy achieved in (Mehrizi et al., 2019) (top
down, force: 4.85±4.85N, moment: 9.06±7.60Nm), which is considered as the study with the
highest L5-S1 load estimation accuracy using 2 RGB cameras. It should be noted that in our
work we utilized a setup consisting of 4 RGB cameras which were mounted on 2 tripods. In our
future study, the potential benefits of incorporating a pair of cameras instead of a single one
will be explored in order to quantify the added value it may offer.
For all combinations of different calculation strategies and BSIP models, the difference in
calculation results was significantly higher in S01 than in other subjects. This is because the
box used by S01 in the experiment was of a larger size and was never present in the training set.
This was particularly true in the top down strategy, most likely because the upper limbs of S01
were the most affected regarding the accuracy of human key point detection. The difference of
the load norm could be smaller compared to the components in different directions, especially
in the force calculations, for a possible reason that the body mass distribution had a dominant
effect on the force norm when using the same calculation strategy. The body segments used in
the same calculation strategy were identical, therefore, with the small difference between the
marker-less and marker-based motion data, the difference between the L5-S1 load results was
79
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
only in the force direction, while force norm had little variation.
We note that the chosen metric above may have certain limitations in providing a thorough
assessment of accuracy of the L5-S1 joint load calculation. In fact, when employing the same
calculation strategy, regardless of whether it is a top down or bottom up approach, the difference
between the results calculated on the marker-based and marker-less motion data may only
reflect the sensitivity of motion capture methods on the joint load calculation. Therefore, in the
present study, totally different configurations were employed to calculate the reference and
estimate of the L5-S1 load so that a more rigorous validation could be performed.
The reference value and the estimated value of the load adopted completely independent
calculation strategies, with the considered body segments and the external forces totally
different. Therefore, the good agreement between the estimate and the reference can sufficiently
validate the correctness of the load estimation. During the subject's handling of the box, the
subject's posture was upright at the beginning as well as at the end of the movement, thus the
moment at L5-S1 was close to 0, while the component of force was only non-zero in the vertical
direction, which was approximately half of the subject's weight. However, when the subject
bent over, the force in the anterior-posterior direction and the moment in the sagittal plane
increased significantly. Since there were asymmetric handling movements in the experiment,
the box might be on the subject's left side, which would increase the moment component on the
frontal plane. During the lifting of a box, the entire process from the beginning of the box's
movement until the box completely left the support surface was considered to be instantaneous.
Ditto for the instant when placing the box, the process from the point at which the carton began
to touch the support surface to the point at which the carton got fully supported by the surface
was also regarded as being completed instantaneously. Thus no account was taken of
transitional changes in the support force from the floor or table to the box, leading to two
dramatic load changes in the estimates. From the comparison of the estimates with the reference
values, we observe that the effect of this simplification on the calculation of the load is within
the acceptable range. When the load value was small, such as the moment component in the
transverse plane, the load was highly susceptible to noise, thus the estimated load might have a
relatively large pattern difference from the reference, yet this difference had minimal effect on
the Euclidean norm of the load.
80
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
A close look at the difference between the calculated bottom up and top down results revealed
that although the RMSE of the differences in subject S01 was no longer the highest on several
components, S01 still had the highest RMSE on many components. However, regarding the
peak value, the difference in S01 was not the highest on the vast majority of components. A
possible reason for this was that since RMSE took into account the differences in all frames,
the average accuracy of the human pose estimation for all frames had an impact on the RMSE
results. For the peak, nevertheless, the accuracy was affected only by the few frames at the peak
moment. In addition, the modeling quality of human 3D envelope reconstruction might also
have a large impact. The difference between moment and force was highest in subjects S12 and
S13, respectively, probably because the BMI of these two subjects was not within the statistical
range of De Leva model thus resulting in inaccurate estimation of BSIP parameters. In an actual
industrial environment, the primary concern in ergonomic assessments is often the peak load at
L5/S1 of workers during work. As a consequence, while improving the accuracy and robustness
of human pose estimation is important, the accurate reconstruction of human 3D envelope is as
well essential in the future research.
However, the present work has a limitation as it focuses on a population comprising only young
adults. As a result, the observed differences in BSIPs between proportional and subject specific
(geometric) methods are small. It is important to note that the workers in real life may exhibit
significant variations in age and weight/height, which can affect particularly the upper body
and the abdomen part.
Last but not least, it should be mentioned that the accuracy of the BSIPs calculation with the
proportional 3D human model is still limited, and the reconstruction of geometric model with
X-rays is radioactive to the workers. To address this issue, one can perform digital 3D body
model reconstruction of workers based on RGB images, and then calculate BSIPs with the
personalized body models. With the help of graphics technology, synthetic images could be
generated in the simulators, such as Unity, Unreal Engine, for any type of motion and any type
of industrial configuration. The deep learning model can be then trained on the synthetic images.
However, deep learning models trained on synthetic images may have limited performance in
real scenes, where domain gaps between the synthetic and real data need to be addressed.
81
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
Figure 5-9. 3D digital human body reconstruction of the subjects based on SMPL-X. The
images in the first row are the input RGB image, with the human face blurred. The images in
the second row present the reconstructed 3D human models. The images in the third row
demonstrate the projection of the 3D body models on the corresponding images.
So far, we have reconstructed 3D digital human models for each subject based on the SMPL-X
(Pavlakos et al., 2019) using a single RGB image, as shown in Figure 5-9. Due to time
limitations, however, further research could not be conducted during this Ph.D. study. In future
studies, we propose to use multiple views to enhance the reconstruction accuracy and to
calculate BSIPs for body segments based on the reconstructed 3D models. This will, on the one
hand, improve the computational accuracy of human BSIPs, and on the other hand, the
reconstructed human models can also be employed to generate synthetic images for enhancing
the performance of human pose estimation networks.
82
Estimation of intersegmental load at L5-S1 during lifting/lowering tasks
5.5 Conclusion
In this study, we utilized a stereo vision system together with artificial intelligence technology
to capture the motion kinematics of the lifting boxes. In addition, we used radiography to build
a 3D human model of the subjects and calculated personalized body segment inertia parameters.
The top-down and bottom-up methods were applied to calculate L5/S1 loads. The top-down
method was used for estimation with the RGB images, whereas the bottom-up was calculated
with the data from Vicon and employed as a reference. The calculation of the load estimate and
the reference adopted completely independent calculation strategies, with the considered body
segments and the external forces being totally different. Therefore, encouraging results were
observed for a population with young adults, which validated the correctness of the joint load
estimation method. While further validation for population with different age and weight/height
in industrial setting should be performed, this study establishes the feasibility of estimating L5-
S1 load thanks to a multiview markerless motion capture system
83
84
In Chapter 5, we have calculated the load of L5-S1 based on the markerless motion capture
system built by the RGB cameras, from which the load estimates are compared against the
reference values to verify the correctness of the load estimates and to evaluate the accuracy
of the estimation. The results indicated that the load estimation of L5-S1 can be achieved
with high accuracy based on the stereo vision system.
As we have already described in the literature review, the RULA method is currently mainly
used in industrial practice for ergonomic analysis. However the relationship between the
RULA method and the L5-S1 load has not been clearly investigated. Therefore, in chapter 6,
we will perform RULA analysis based on the motion captured by the stereo vision system,
and compare the RULA score with the loads of L5-S1 calculation to explore how to combine
the two evaluation metrics more effectively in the actual ergonomic analysis.
85
Application in ergonomic analysis
6.1 Introduction
In the previous chapters, we have verified the validity of the stereo vision system with AI-based
algorithms to capture human motion and perform inverse dynamics analysis in workers’
lifting/lowering tasks. As a first step, we explored the effect of face blurring on human pose
estimation and found that the strength of this effect was weak. Following that, we performed
the research using the face-blurred lifting/lowering image dataset to protect the subjects' privacy.
A neural network was then trained on this data set, allowing for human pose estimation in the
presence of severe occlusions during lifting/lowering tasks. After that, we calculated, using the
top-down inverse dynamics approach, the intersegmental load at L5-S1 based on the human
motion captured from the stereo vision system, which was then validated in a rigorous way by
the bottom-up approach. The results show that with AI algorithms, the stereo vision marker-
less motion capture system could be used to capture human motion and perform subsequent
inverse dynamics calculations with high accuracy in workers’ lifting/lowering tasks. The
advantage of this stereo vision marker-less motion capture system is that it can be easily
deployed in industrial environments to provide objective biomechanical indicators of workers'
movements without interfering with their daily activities.
In ergonomics, various methods exist to evaluate the MSDs risk of workers with repetitive and
forceful movements. Reviewing the methods we mentioned in the state of the art in Chapter 2,
the most widely used MSDs risk assessment method is the RULA method (Rapid Upper Limb
Assessment) (Dockrell et al., 2012; Kee et al., 2020; McAtamney and Nigel Corlett, 1993). The
86
Application in ergonomic analysis
principal idea of the RULA is to observe the work process of workers and identify the riskiest
postures, analyzing the joint angles and scoring them. A disadvantage of the RULA method is
that it relies on the ergonomist to assess the joint angle of the subject, and different ergonomists
may give different results. However, the RULA method also presents significant advantages in
that it can be performed conveniently and quickly in the factory without the use of additional
equipment thus leaving the workers' tasks uninterrupted.
Despite the fact that the RULA is initially developed for ergonomic risk assessment of the upper
limb, it is not limited to the upper limbs, as it takes into account trunk angle, leg angle, foot
support conditions, as well as various other factors. As a result, it has been successfully applied
to assess the lower back pain risk for workers as well. For instance, the study by (Rachmawati
et al., 2022) employed the RULA for determining the relationship between age, work period,
and work posture on complaints of low back pain in rice mill workers. In (Labbafinejad et al.,
2016), the RULA was utilized to explore the ergonomic risk factors for low back pain and neck
pain in an industry where only light tasks are performed. In the work of (Rezapur-Shahkolai et
al., 2020), the researchers investigated the prevalence of low back pain and its risk factors
among elementary students, where RULA was used to assess posture and psychosocial elements.
In (Hussain et al., 2019), the authors analyzed the working postures of manual workers in small-
scale industries by using the RULA assessment in CATIA V5R20 software in order to assess
the development of work-related MSDs and low back pain. The above studies demonstrated
that the RULA can provide an objective assessment of posture in order to predict low back pain
among workers.
Consequently, two issues of great importance were naturally raised. First, since it has been
possible to capture human motion with the help of machine vision, can we automate and
objectify the RULA evaluation through this type of technology. Second, given that both the
RULA evaluation and the L5S1 load calculation consider the movement condition of the trunk
and upper limbs, whether there is a correlation between the RULA score and the biomechanical
87
Application in ergonomic analysis
Regarding the first question, the answer is positive. With the rapid advancement of artificial
intelligence technology, the subjectivity issue above can be solved effectively. More and more
studies have been trying to assess the MSDs risk of workers using computer vision (Jiang et al.,
2022b; Li et al., 2020b; Mehta et al., 2018; Yu et al., 2018). In this chapter, we will utilize four
RGB cameras to build a stereo vision system to automate and objectify the RULA assessment,
which can improve the accuracy and reliability of MSDs risk assessment.
To answer the second question, the present chapter will explore the correlation between the
computed load at L5S1 and the ergonomics assessment based on the RULA to better estimate
the MSDs risk of workers. Yet, the relationship between lumbar force calculation and
ergonomic evaluation has not been fully studied in the literature. In this chapter, the moment at
the L5S1 joint was first computed using as information the motion kinematics obtained from
the marker-less motion capture system, Afterwards, the RULA score was calculated and the
correlation analysis between the charge at L5S1 and RULA score was performed. Figure 6-1
illustrates the workflow of this study.
Force L5/S1
Face blurring Inverse Dynamic
Moment L5/S1
6.2 Methods
6.2.1 Participants
A total of twelve participants, with an average age of 24.2±2.3 years, height of 172.4±10.1 cm,
and weight of 65.9±14.7 kg, took part in the study. The ethics committee approved the
experiment (Protocol 06036, Ile de France VI - Groupe Hospitalier Pitié-Salpétrière), and all
subjects provided written consent after being fully informed about the study. Prior to the
experiment, an orthopedic surgeon conducted a medical examination to assess the participants'
ability to perform the study.
In the present study, ergonomic assessment was performed using a stereo vision system.
Readers are referred to Chapter 4 for details of the experimental setup. A multi-view human
pose image dataset of lifting actions was collected, in which six subjects' images with face
blurring were used as the training set (S03,S04,S05, S07,S09,S11), and the remaining six
subjects' images, equally with face blurring, were used as the test set
(S01,S06,S10,S12,S13,S14). Based on this dataset, a neural network (Iskakov et al., 2019b)
was trained to perform human pose estimation, and subsequently, ergonomic analysis was
performed. The results of human posture estimation in our study had 17 key points, thus the
relevant joint angles needed to be calculated based on these 17 key points, with only two key
points on each body segment. Using the RULA assessment worksheet, each joint was scored
according to the joint angles. The scores of each joint were integrated to get the corresponding
scores in Form A, Form B and Form C of the assessment worksheet, with which the final grand
RULA score could be obtained. The RULA score allows an assessment of the worker's
occupational risk, following which the ergonomist can examine whether changes to the worker's
workflow and related operational configuration are required.
89
Application in ergonomic analysis
AR
P1: △HRNHL
AL AL
AR
Figure 6-2 Definition of human skeleton model and the key planes, adapted from (Li and Xu,
2019)
As shown in Figure 6-2 to begin with, three planes were defined, the coronal plane P1, the
sagittal plane P2, and the thoracic plane P3, determined by triangle △HRNHL, triangle △HCNL,
and triangle △SRSLHC, respectively.
The RULA calculation can be performed for the left and right sides separately, and we take the
left limbs to demonstrate the calculation of the RULA joint angles. For the calculation of joint
angles, the method in (Li and Xu, 2019) was adopted with a few adaptations. Among all the
joint angles, several were redefined to align with the human skeleton model utilized in the
present study. These angles included upper arm flexion, neck flexion, angles for detecting neck
lateral bending, angles for detecting trunk twisting, and angles for detecting trunk lateral
bending. For the other joint angles, the calculation method used in (Li and Xu, 2019) was
followed or with minor adjustments.
According to the definition of the joint angle provided in the RULA, the Upper Arm Flexion
⃗⃗⃗⃗⃗⃗⃗⃗⃗
was defined as the angle of the projection vector of 𝑆 ⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐿 𝐻𝐿 , 𝑆𝐿 𝐸𝐿 on P2. The Upper Arm
⃗⃗⃗⃗⃗⃗⃗⃗⃗
Adduction was defined as the angle of the projection vector of 𝑆 ⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐿 𝐻𝐿 , 𝑆𝐿 𝐸𝐿 on P1 (Figure
6-3). The Shoulder Raise angle was defined as the angle between ⃗⃗⃗⃗⃗⃗⃗
N𝑆𝐿 , ⃗⃗⃗⃗⃗⃗⃗⃗
𝑁𝐻𝐶 . The Body
Leaning angle was defined as the angle between the ⃗⃗⃗⃗⃗⃗⃗⃗
𝐻𝐶 𝑁 and the opposite direction of gravity.
90
Application in ergonomic analysis
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
Figure 6-3 Upper Arm Adduction was defined as the angle of the projection vector of 𝑺 𝑳 𝑯𝑳 ,
⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑺 𝑳 𝑬𝑳 on P1.
Figure 6-4 The lower arm was considered cross middle line of the body if the angle between the
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗𝑳 , 𝑵𝑺
projection vectors of 𝑵𝑾 ⃗⃗⃗⃗⃗⃗⃗⃗𝑳 on P1 was greater than 90 degrees.
To detect whether the lower arm crossed middle line of the body, we calculated the angle
between the projection vectors of ⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑁𝑊𝐿 , ⃗⃗⃗⃗⃗⃗⃗
𝑁𝑆𝐿 on P1. It was considered that the lower arm
crossed middle line if the angle was greater than 90 degrees (Figure 6-4). We calculated the
⃗⃗⃗⃗⃗⃗ on P2, where
angle formed by the projection vectors of the P3 normal and 𝑁H the
complementary angle of this angle was defined as the neck flexion. The neck side bending angle
91
Application in ergonomic analysis
a) b)
Figure 6-5 a) The complementary of the angle formed by the projection vectors of the P3
⃗⃗⃗⃗⃗⃗ on P2 was defined as the neck flexion. b) The neck side bending angle was
normal and 𝑵𝐇
⃗⃗⃗⃗⃗⃗⃗⃗⃗
defined as the angle between the projection vectors of 𝑯 ⃗⃗⃗⃗⃗⃗
𝑪 𝑵, 𝑵𝐇 on P1.
For the calculation of RULA score, both the left and the right limb should be analyzed, after
which the maximum value of which can be taken as the final score in the practical ergonomic
analysis. In practice, ergonomists usually observe the operator throughout the whole work
process and then select the worst-case posture to perform the RULA calculation. However, to
better examine the accuracy of stereo vision for ergonomic analysis, in this study, we calculated
the RULA values for all frames.
The calculation flow of RULA is shown in Figure 6-6, which is adapted from (McAtamney and
Nigel Corlett, 1993). The RULA assessment is performed according to the following steps. The
first step focuses on the development of the method for recording working posture. In this step,
the evaluator determines the postural angles of several body positions. The body was divided
into segments and was grouped into two groups, named Group A and Group B. Group A
contains the upper and lower arm and wrist while Group B contains the neck, trunk, and legs.
During the analysis stage, a score will be assigned to each posture according to the ranges of
the movement. Specifically, in group A, the upper arm movement will be scored between 1-6,
depending on the degree of shoulder flexion, together with any adjustment for the shoulder
being elevated or abducted. The lower arm movement will be scored between 1-3 based on the
elbow flexion. The wrist movement will be scored between 1-4. It is based on the degree of
wrist flexion or extension and an adjustment of the score should be considered in the case of
wrist deviation. The wrist twist is scored between 1-2, depending on if the wrist is the mid-
range or at or near the end of the range of twist. In group B, both neck and trunk scores are
93
Application in ergonomic analysis
between 1-6, based on the degree of neck/trunk flexion or extension. together with any
adjustment for any neck/trunk twisting or side bending. The leg score is between 1-2, depending
on if the leg and feet are supported or if there is uneven weight distribution.
The second step involves the development of a system for grouping the body part posture scores.
To establish such a system, the evaluator ranks each posture combination from the least to the
greatest loading based on biomechanical and muscle function criteria. This leads to a table of
consolidated body segment posture scores called scores A and B, respectively. Subsequently,
the muscle use and force scores are given to include the additional load on the musculoskeletal
system, calculated for each of Groups A and B. They are added to scores A and B, respectively,
which produce two scores called score C and score D.
In the third step of RULA, both score C and score D are incorporated into a single grand score.
The latter can be utilized as a guide for the priority of subsequent actions in order to reduce
excessive loading of the musculoskeletal system and the risk of injury to the operator.
6.2.6 Evaluation
In Chapter 5, we have estimated the lumbar spine load at L5-S1 using the top down method
based on the marker less motion capture system, which was validated with the bottom up
calculation. The top-down method, starting from the upper limbs and terminating the load
calculation procedure at L5/S1 joint, with the external force estimated by the kinematic
information of the cardboard box, can be more conveniently implemented to practical
applications and can provide the lumbar spine load estimate at each instant during the worker's
lifting task.
In this chapter, the RULA scores estimation has been performed based on the markerless motion
capture system. To evaluate the accuracy of the markerless motion capture in estimating RULA
scores, we compared the RULA score reference values with the estimates using different
metrics. The evaluation of the estimation involves the calculation of the difference between two
integer time series, where choosing different evaluation metrics can provide a more
comprehensive analysis of the accuracy of the markerless system for estimating RULA scores.
So far, this study has calculated RULA scores and L5-S1 loads, both of which can be employed
94
Application in ergonomic analysis
as important indicators for predicting lower back pain. The RULA score is widely used in
industrial practice, while the L5-S1 load is a more accurate representation from a biomechanical
point of view to reflect the risk of lumbar spine musculoskeletal disorders, so the study of the
correlation between the two is of great importance to promote the understanding of the causative
factors and risk prediction of lower back disorders. In this spirit, this study will continue to
explore the correlation between RULA scores and L5-S1, as well as how to apply both together
for a more comprehensive ergonomic analysis.
In this study, the reference and the estimate RULA score was calculated with the marker based
motion capture data and marker less motion capture data, respectively.
To evaluate the performance of the stereo vision system on ergonomic analysis, the RMSE, the
prediction accuracy Acc, and the Cohen's kappa value of the RULA scores among all the frames
were calculated.
∑𝑁
𝑖 (𝑆𝑐𝑜𝑟𝑒𝑟𝑒𝑓,𝑖 − 𝑆𝑐𝑜𝑟𝑒𝑒𝑠𝑡,𝑖 )
RMSE = √ ( 6-1 )
𝑁
where 𝑆𝑐𝑜𝑟𝑒𝑟𝑒𝑓,𝑖 and 𝑆𝑐𝑜𝑟𝑒𝑒𝑠𝑡,𝑖 denote the reference and estimate of RULA score for frame i,
respectively, N denotes the total number of frames.
𝑁𝑠𝑐𝑜𝑟𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒==𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒
Acc = ( 6-2 )
𝑁𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑎𝑚𝑒𝑠
where 𝑁𝑠𝑐𝑜𝑟𝑒 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒==𝑟𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 is the number of frames in which the reference RULA score
is strictly identical to the estimate. 𝑁𝑡𝑜𝑡𝑎𝑙 𝑓𝑟𝑎𝑚𝑒𝑠 represents the total number of frames.
Since the scores of RULA are natural numbers, the estimation of RULA scores can be regarded
as a classification problem, for which Cohen Kappa's evaluation metrics are commonly used in
order to evaluate their accuracy.
95
Application in ergonomic analysis
𝑝0 − 𝑝𝑒
𝜅= ( 6-3 )
1 − 𝑝𝑒
Cohen's kappa is a metric that evaluates the inter-rater reliability of qualitative/categorical items,
with a range of values between -1 and 1. κ score indicates the level of agreement between raters,
with a larger value representing better agreement. For detailed calculations of 𝑝0 and 𝑝𝑒 , the
reader is referred to (Cohen, 1960).
Based on the RULA score and the L5S1 force/moment estimation, the present chapter will
further explore the relationship between L5S1 loads and RULA scores. Towards this goal, we
calculated the Spearman's rank correlation coefficient 𝑟𝑠 , as shown in formula ( 6-4 ), between
the estimates of the total force/moment at L5S1 and the RULA score. In addition, two
representative trials were selected for more detailed analysis.
𝑐𝑜𝑣(𝑅(𝑆𝑐𝑜𝑟𝑒𝑅𝑈𝐿𝐴 ), 𝑅(𝐿𝑜𝑎𝑑𝐿5𝑆1 ))
𝑟𝑠 = ( 6-4 )
𝜎𝑅(𝑆𝑐𝑜𝑟𝑒𝑅𝑈𝐿𝐴) 𝜎𝑅(𝐿𝑜𝑎𝑑𝐿5𝑆1 )
Where 𝑅(𝑆𝑐𝑜𝑟𝑒𝑅𝑈𝐿𝐴 ) and 𝑅(𝐿𝑜𝑎𝑑𝐿5𝑆1 ) denote the ranks of RULA and L5-S1 total forces or
moments at each frame, respectively. 𝑐𝑜𝑣(𝑅(𝑆𝑐𝑜𝑟𝑒𝑅𝑈𝐿𝐴 ), 𝑅(𝐿𝑜𝑎𝑑𝐿5𝑆1 )) is the covariance of
the obtained ranks values. 𝜎𝑅(𝑆𝑐𝑜𝑟𝑒𝑅𝑈𝐿𝐴 ) and 𝜎𝑅(𝐿𝑜𝑎𝑑𝐿5𝑆1 ) are the standard deviation of these
rank values.
6.3 Results
Table 6-1,Table 6-2 and Table 6-3 present the Accuracy, Cohen Kappa, and RMSE Score of the
RULA estimation on the test set, respectively. From the tables, it can be seen that the average
performance of the RULA calculations on whole test set was 0.87 (left) & 0.89 (right) for
Accuracy, 0.82 (left) & 0.85 (right) for Cohen Kappa. As for the average RMSE value, 0.56
(left) & 0.49 (right) was achieved on the test set. Among all subjects, the performance for all
metrics on S01 was inferior to other subjects, with 0.7 (left) & 0.78 (right) for Accuracy, 0.58
96
Application in ergonomic analysis
(left) & 0.72 (right) for Cohen Kappa, and 0.78 (left) & 0.62 (right) for RMSE. Additionally,
regarding all the score components, the algorithm performed weakly in the calculation of Neck
Truck Leg scores compared to the other metrics, with the average of 0.87 for Accuracy, 0.83
for Cohen Kappa, and 0.72 for RMSE.
Table 6-2. Cohen Kappa of the RULA estimation on the test set
subject Upper Upper Lower Lower Wrist Wrist Neck Grand Grand
arm arm arm arm Arm Arm Trunk Score Score
Left Right Left Right Left Right Leg Left Right
S01 0.62 0.74 0.65 0.81 0.58 0.74 0.8 0.58 0.72
S06 0.86 0.93 0.91 0.93 0.89 0.92 0.84 0.85 0.87
S10 0.81 0.88 0.92 0.85 0.92 0.87 0.85 0.84 0.84
S12 0.92 0.94 0.97 0.91 0.94 0.92 0.7 0.78 0.8
S13 0.87 0.91 0.9 0.89 0.9 0.91 0.84 0.81 0.84
S14 0.9 0.93 0.65 0.72 0.86 0.85 0.93 0.91 0.93
All 0.86 0.91 0.86 0.87 0.88 0.88 0.83 0.82 0.85
Figure 6-7 shows the example of RULA Score vs Load L5-S1. In the example, the RULA score
97
Application in ergonomic analysis
shows a very strong correlation with Moment and Force. The time window for the peak of the
RULA score is coincident with the peak window of the force and moment.
Figure 6-8 Distribution of Spearman's rank correlation coefficient between the load at L5S1
and the RULA estimate for each subject. The middle line and top/ bottom sides of the box
represent the median and 75th/25th percentile. The whiskers indicate the minimum and
maximum value of the data, except for the outliers marked with diamond-shaped black dots.
Regarding the correlation between the RULA estimate and the total joint moment of L5S1, the
Spearman's rank correlation coefficient was greater than 0.7 for all actions in all subjects, and
all p-values were less than 0.0001. Figure 6-8 shows the distribution of the Spearman's rank
correlation coefficient between the moment L5S1 and the RULA estimate for each subject.
From the results, we can find that the correlation coefficient of subject S01 was lower than that
of other subjects, with a median of about 0.82, and the median correlation coefficients for all
other subjects were higher than 0.85.
98
Application in ergonomic analysis
b cd e
a b c d e f
Figure 6-9 RULA score versus L5S1 loads estimates during a specific lowering task I
The RULA score versus L5S1 loads estimates during a specific lowering task is shown in Figure
6-9. In this task, the subject was moving a cardboard box (box weight 5 kg) from the table to
the floor. At the point a, the subject turned sideways, slightly bent over, and the RULA was kept
at a low level with a moment of about 50 Nm. At points b, c, the subject's hands were about to
touch the box, the body bending amplitude increased, and the right hand crossed the midline of
the body, with the RULA score and the moment increased. At moments d and e, the bending
amplitudes were significantly higher, with both the RULA Score and moment values at high
levels. At the moment f, the subject resumed the initial stance where the RULA Score and
moment values were both at low levels.
From the figure, it can be observed that the L5S1 moments had a large overlap with the RULA
regarding the time windows in which the peaks occurred. The RULA score on both windows
was 7, while the L5S1 moment had different values at the two peaks, with a smaller moment at
the first peak (95Nm) and a more significant moment at the second peak (150Nm).
99
Application in ergonomic analysis
b c d e
a b c d e f
Figure 6-10 RULA score versus L5S1 loads estimates during a specific lifting/lowering task II
Figure 6-10 shows the comparison of the load of the RULA Score with the L5-S1 in the
lifting/lowering task. In this task, the subject was lifting an empty cardboard box (mass ≈ 0kg)
from the floor to the table and then lowering it down. At point a, subjects stood upright with
both RULA Score and L5-S1 load remaining at low levels. At the points b, c, the subject bent
down to carry the box, the RULA score, the force as well as the moment at L5-S1 were
significantly increased. At points d, e, the subject's body stood straight, turning sideways, with
the right hand crossing the midline of the body, the arm raised holding the box, the flextion of
the shoulder approaching or exceeding 90°, at which the moment and force were relatively
weak, but the RULA Score was high. At the point f, the L5-S1 load and RULA score were both
significantly increased as subjects bent over to place the box. From Figure 7, it can be
observed that in the time window [b, c], as well as at the f point, the RULA Score followed the
same pattern as the L5-S1 loads, while in the time window [d, e], the levels of the two indicators
are different.
6.4 Discussion
The main objective of this chapter is to estimate RULA values based on a markerless motion
capture system and subsequently explore its correlation with L5S1 loads. To this end, the
100
Application in ergonomic analysis
Spearman's rank correlation coefficient was computed between the evaluations of the overall
force/moment at L5S1 and the RULA score. Moreover, two exemplary experiments were
chosen for a more thorough examination.
The average performance of the RULA calculation was , in all metrics, close to or exceeded the
current state of the art results based on the RGBD. In (Abobakr et al., 2019), the RULA
calculation performed well with RGBD images, with the average of 0.51 (left) & 0.49 (right)
for RMSE, 0.85 (left) & 0.86 (right) for accuracy, and 0.67 (left) & 0.66 (right) for kappa. in
terms of the RULA Grand Score. Our study achieved better performance in all metrics except
for the RMSE of the left side RULA score, which was slightly higher than their result. In
addition, the larger size of the cardboard box used for subject S01 in our experiment may pose
a greater challenge to the accuracy evaluation of the RULA estimation. Taking the above into
account, the result in the present study indicated that the stereo vision system can achieve high
accuracy in ergonomic assessment of workers during lifting/lowering tasks.
The Spearman's rank correlation coefficient of RULA estimated value and L5S1 total joint
moment was greater than 0.7 in all actions of all subjects, indicating that the two had, at least
in lifting/lowering movement, a very high correlation. Affected by the size of the box carried
by subject S01, the correlation coefficient in subject S01 was slightly lower than that of other
subjects, but still above 0.7. According to the definition of Spearman's rank correlation
coefficient, if the magnitude of the correlation coefficient falls between 0.7 and 0.9, it suggests
a high correlation between the variables, or a very high correlation if the coefficient is greater
than 0.9. This shows that the RULA score and the L5S1 total moment had a high consistency
in assessing MSDs risk, and can be combined in specific applications to enhance the reliability
of risk assessment.
In specific applications, for task I, although the RULA metric is to some extent highly correlated
with the L5S1 loads, the L5S1 loads can provide more detailed information in order to give a
more accurate estimate of the risk of MSDs. The RULA score was 7 in both windows, namely
[b, c] and [d, e], because the RULA assessment did not take into account the subject's movement
acceleration, and the weight held by the subject was not considered precisely. Differently for
the L5S1 moment, the subject's posture, acceleration, and the weight held by the subject were
accurately considered during the inverse dynamics calculation. In this example, it was difficult
101
Application in ergonomic analysis
to distinguish from the RULA scores which one of the two peaks corresponds to a motion that
was more risky, while the moment analysis of L5S1 yields that the risk would be greater at the
second peak (time window [d, e]).
Regarding task II, at the points d, e, the force and moment on L5-S1 was comparatively small
since the subject's motion acceleration was weak, moreover, the body was upright and the box
mass was almost zero. The RULA scores, however, appeared to be higher since these risk
factors were fully taken into account in the RULA calculation: the subject turned sideways with
the right hand over the midline of the body, the arm raised holding the box, with the flextion of
the shoulder close to or over 90°. In fact, the RULA estimated the risk of MSD in subjects not
only in the lumbar spine, but also in the upper limbs such as the neck and shoulder of the arm,
which were not considered in the L5-S1 load estimation. In this example, the RULA score can
indicate the risk that the L5-S1 load results failed to reflect.
These two tasks also illustrate that the nature of the risks identified by the RULA score and the
L5S1 load are not the same. While the L5S1 load is a good indicator for the risk of low back
pain to provide more detailed information, RULA score is more suitable for comprehensively
considering the risk of MSD in all upper limbs collectively.
Overall, the alignment between these two indicators might not be impeccable across all postures,
particularly when the task entails intricate movements engaging multiple body segments. This
underscores the significance of incorporating various risk indicators while evaluating
ergonomic hazards in the workplace, in order to attain a holistic comprehension of the potential
risks confronting the employee. Thus, our study suggests that, in practical applications, the two
indicators, RULA score and L5S1 Load, could be complementary, in terms of the different
operating situations, to each other for more accurate estimation of the risk for workers to
develop MSDs.
There are still several limitations for the work in this present chapter. First, certain combinations
of RULA components may have stronger correlation with the L5-S1 load, in which it is an issue
worth exploring to identify the RULA scores components that have stronger correlation. Second,
the RULA algorithm developed in this chapter still needs to be subsequently validated against
the scores from ergonomists, which will be more accurate to the practical working scenario
102
Application in ergonomic analysis
after fine-tuning and calibration by the expert knowledge of ergonomists. Third, during the
lifting/lowering task, the acquisition of certain data is not yet accomplished by computer vision
methods, but is added to the RULA calculation program as prior information, such as the mass
of the box, the contact/detachment instant between the subject's hand and the cardboard box,
etc. If the budget allows, one possible solution is to equip the worker with gloves that can
measure the contact force, which would not interfere with the worker's daily work, while
protecting the worker's hands and also measuring the relevant data.
6.5 Conclusion
In this chapter, we captured the motion of workers carrying boxes using a markerless motion
capture system consisting of four cameras, based on which the RULA scores were calculated.
Compared with the results in the literature, our accuracy reaches the state of art level. The
correlation analysis between the L5S1 and RULA suggests that these two indicators, could be
complementary, in terms of the different operating situations, to each other for more accurate
estimation of the risk for workers to develop MSDs. Combined with biomechanical analysis,
RULA score will greatly contribute to the application of ergonomics in factories and enhance
the well-being of workers.
103
104
Conclusion and perspectives
In industries, musculoskeletal disorders (MSDs) are very common occupational issues among
workers due to the repetition of the same movements for long periods of time. Assessing the
risk of MSDs and preventing their development among workers are crucial in enhancing the
quality of employees' life and boosting the performance of enterprises. Risk evaluation often
relies on quantitative measurement assessment and associated biomechanical models, which
requires a sophisticated setup that cannot be used in work environment. This study has
demonstrated the feasibility of using RBG cameras to provide accurate motion capture, despite
occlusions that may occur during lifting tasks, and to perform joint load estimation and
ergonomic assessment based on musculoskeletal model built from motion kinematics obtained
from this quantitative motion capture system.
Towards this goal, we first investigated the effect of face blurring on human pose estimation in
order to protect the privacy of workers. Then, a multi-view face blurring dataset of
lifting/lowering was collected and a neural network was trained with this dataset for human
pose estimation. After that, the load at L5-S1 joint was calculated with inverse dynamics based
on the 3D key points obtained from the human pose estimation. Finally, the ergonomics
evaluation was performed based on the 3D human pose estimation results.
In the study of effect of face blurring on kinematic analysis, it is found that the effect of face
blurring was negligible and within acceptable limits (<1°). Therefore, in the subsequent study,
we used the face blurring dataset to study the human pose estimation in lifting/lowering to
guarantee subject privacy while not degrading the prediction performance of a deep learning
model.
Then, we collected a 3D annotated multi-view high-accuracy image dataset for human pose
estimation in lifting/lowering, with all the subjects' faces blurred to protect their privacy.
Furthermore, a neural network was trained on this dataset to evaluate its performance on the
face blurred dataset, in the human pose estimation task. Additionally, the effect of the unseen
cardboard box in the test set was subsequently examined. A state of the art average MPJPE of
105
Conclusion and perspectives
12.76±4.94mm was achieved in estimating the 3D joint positions on the whole test set.
Following that, we performed the biomechanical analysis on L5-S1 joint of the human body.
Based on the human pose estimation results, the human body was modeled as 10 body segments
where the inertial properties of each were calculated through the proportional and geometric
method. The top-down and bottom-up strategies were applied to calculate L5/S1 loads. The top-
down method was used for estimation with the RGB images, whereas the bottom-up was
calculated with the data from Vicon and employed as a reference. Our results showed that the
load estimates were in good agreement with the reference, with the accuracy of L5-S1
calculation reached a state of art level.
Lastly, ergonomic assessment was performed based on HPE results from the stereo vision
system. The RULA score was calculated following the RULA assessment work flow with both
the reference and estimated human motion data. The average performance of the RULA
calculation in our study was, in all metrics, close to or exceeded the current state of the art
results. An added advantage is that the stereo vision system can provide a quantitative and
objective ergonomic assessment of workers during lifting/lowering tasks.
There were still several limitations in this study, for instance, the laboratory setup didn’t fully
replicate the work scenario in the factory. In addition, the degradation of the performance of the
deep learning model when we used the unseen larger box in the test set illustrated that slight
differences between the actual factory scene and the training set could have a large impact on
the model performance. In the actual work scenario, there might be plenty of differences from
the training set, such as the arrangement of machines in the factory, the different dressing of
workers, and the interaction way between workers and machines, etc., which were difficult to
be fully replicated in the training set during the experiment. A further study with more focus on
synthetic image sets is therefore suggested. With the help of graphics technology, synthetic
images could be generated in the simulators, such as Unity, Unreal Engine, for any type of
motion and any type of industrial configuration. The deep learning model can be then trained
on the synthetic images, thereby addressing the challenges posed by real-world fluctuations
within industrial environments. It should be noted that deep learning models trained on
synthetic images may have limited performance in real scenes, where domain gaps between the
synthetic and real data need to be addressed.
106
Conclusion and perspectives
In addition, in this work, we used a proportional human model and a geometric human model
based on X-ray reconstruction to calculate BSIPs. However, the accuracy of the BSIPs
calculation with the proportional 3D human model is all limited, and the reconstruction of
geometric model with X-rays is radioactive to the workers. A feasible alternative is to perform
digital 3D body model reconstruction of workers based on RGB images, and then calculate
BSIPs with the personalized body models.
Another limitation is that, although the results of this thesis were compared to the corresponding
state of the arts results, due to time constraints, a more detailed comparison with more other
studies was not conducted. More research results will be gathered in subsequent studies for
more extensive comparisons.
Despite these limitations, the current manuscript allowed a proof-of-concept study of the
utilization of RBG cameras for motion capture, as well as the joint load estimation and
ergonomic assessment, relying on the musculoskeletal model constructed from the data
obtained from this quantitative motion capture system, in the laboratory setting. This
contributes a first step towards the MSD prediction in the actual work environment in future
work.
107
References
References
Abobakr, A., Nahavandi, D., Hossny, M., Iskander, J., Attia, M., Nahavandi, S., Smets, M.,
Akhmad, S., Arendra, A., Findiastuti, W., Lumintu, I., Pramudita, Y.D., Mualim, 2020.
Wearable IMU Wireless Sensors Network for Smart Instrument of Ergonomic Risk
Presented at the 2020 6th Information Technology International Seminar (ITIS), IEEE,
Amabile, C., Choisne, J., Nérot, A., Pillet, H., Skalli, W., 2016. Determination of a new uniform
thorax density representative of the living population from 3D external body shape
https://doi.org/10.1016/j.jbiomech.2016.03.006
Arjmand, N., Gagnon, D., Plamondon, A., Shirazi-Adl, A., Larivière, C., 2010. A comparative
study of two trunk biomechanical models under symmetric and asymmetric loadings.
Asadi, F., Arjmand, N., 2020. Marker-less versus marker-based driven musculoskeletal models
of the spine during static load-handling activities. Journal of Biomechanics 112, 110043.
https://doi.org/10.1016/j.jbiomech.2020.110043
108
References
Ausavanonkulporn, A., Areekul, K., Senavongse, W., Sukjamsri, C., 2019. Lumbar Spinal
Engineering and Technology - ICBET’ 19. Presented at the the 2019 9th International
https://doi.org/10.1145/3326172.3326210
Azari, F., Arjmand, N., Shirazi-Adl, A., Rahimi-Moghaddam, T., 2018. A combined passive and
Bortolini, M., Faccio, M., Gamberi, M., Pilati, F., 2018. Motion Analysis System (MAS) for
Burdorf, A., Laan, J., 1991. Comparison of methods for the assessment of postural load on the
Burdorf, A., van der Beek, A., 1999. Exposure assessment strategies for work-related risk
factors for musculoskeletal disorders. Scand J Work Environ Health 25 Suppl 4, 25–30.
https://doi.org/10.1109/TPAMI.2013.248
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y., 2019. OpenPose: Realtime Multi-Person
109
References
Caputo, F., Greco, A., D‘Amato, E., Notaro, I., Spada, S., 2019. IMU-Based Motion Capture
Wearable System for Ergonomic Assessment in Industrial Environment, in: Ahram, T.Z.
Chaibi, Y., Cresson, T., Aubert, B., Hausselle, J., Neyret, P., Hauger, O., de Guise, J.A., Skalli,
W., 2012. Fast 3D reconstruction of the lower limb using a parametric model and
https://doi.org/10.1080/10255842.2010.540758
Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., Sun, J., 2018. Cascaded Pyramid Network for
Multi-person Pose Estimation, in: 2018 IEEE/CVF Conference on Computer Vision and
and Pattern Recognition (CVPR), IEEE, Salt Lake City, UT, pp. 7103–7112.
https://doi.org/10.1109/CVPR.2018.00742
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L., 2020. HigherHRNet: Scale-
Cieza, A., Causey, K., Kamenov, K., Hanson, S.W., Chatterji, S., Vos, T., 2020. Global estimates
of the need for rehabilitation based on the Global Burden of Disease study 2019: a
systematic analysis for the Global Burden of Disease Study 2019. The Lancet 396,
2006–2017. https://doi.org/10.1016/S0140-6736(20)32340-0
110
References
Cohen, J., 1960. A Coefficient of Agreement for Nominal Scales. Educational and
Da Costa, B.R., Vieira, E.R., 2009. Risk factors for work-related musculoskeletal disorders: a
https://doi.org/10.1002/ajim.20750
Dave, I.R., Chen, C., Shah, M., 2022. SPAct: Self-supervised Privacy Preservation for Action
and Pattern Recognition (CVPR), IEEE, New Orleans, LA, USA, pp. 20132–20141.
https://doi.org/10.1109/CVPR52688.2022.01953
David, G., Woods, V., Li, G., Buckle, P., 2008. The development of the Quick Exposure Check
(QEC) for assessing exposure to risk factors for work-related musculoskeletal disorders.
David, G.C., 2005. Ergonomic methods for assessing exposure to risk factors for work-related
https://doi.org/10.1093/occmed/kqi082
De Looze, M.P., Van Greuningen, K., Rebel, J., Kingma, I., Kuijer, P.P.F.M., 2000. Force
direction and physical load in dynamic pushing and pulling. Ergonomics 43, 377–390.
https://doi.org/10.1080/001401300184477
111
References
Delp, S.L., Anderson, F.C., Arnold, A.S., Loan, P., Habib, A., John, C.T., Guendelman, E.,
Thelen, D.G., 2007. OpenSim: Open-Source Software to Create and Analyze Dynamic
https://doi.org/10.1109/TBME.2007.901024
https://www.semanticscholar.org/paper/SPACE-REQUIREMENTS-OF-THE-
SEATED-OPERATOR%2C-AND-OF-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L., 2009. ImageNet: A large-scale
hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern
Recognition. Presented at the 2009 IEEE Conference on Computer Vision and Pattern
Dockrell, S., O’Grady, E., Bennett, K., Mullarkey, C., Mc Connell, R., Ruddy, R., Twomey, S.,
Flannery, C., 2012. An investigation of the reliability of Rapid Upper Limb Assessment
Dumas, R., Aissaoui, R., Mitton, D., Skalli, W., deGuise, J.A., 2005. Personalized Body
Segment Parameters From Biplanar Low-Dose Radiography. IEEE Trans. Biomed. Eng.
Eliasson, K., Palm, P., Nyman, T., Forsman, M., 2017. Inter- and intra- observer reliability of
112
References
risk assessment of repetitive work without an explicit method. Applied Ergonomics 62,
1–8. https://doi.org/10.1016/j.apergo.2017.02.004
Faccio, M., Ferrari, E., Galizia, F.G., Gamberi, M., Pilati, F., 2019. Real-time assistance to
manual assembly through depth camera and visual feedback. Procedia CIRP 81, 1254–
1259. https://doi.org/10.1016/j.procir.2019.03.303
Fagarasanu, M., Kumar, S., 2002. Measurement instruments and data collection: a
Fan, L., 2019. Practical Image Obfuscation with Provable Privacy, in: 2019 IEEE International
Conference on Multimedia and Expo (ICME). Presented at the 2019 IEEE International
https://doi.org/10.1109/ICME.2019.00140
Fang, H.-S., Xie, S., Tai, Y.-W., Lu, C., 2018. RMPE: Regional Multi-person Pose Estimation.
arXiv:1612.00137 [cs].
Filippeschi, A., Schmitz, N., Miezal, M., Bleser, G., Ruffaldi, E., Stricker, D., 2017. Survey of
Motion Tracking Methods Based on Inertial Sensors: A Focus on Upper Limb Human
Frome, A., Cheung, G., Abdulkader, A., Zennaro, M., Bo Wu, Bissacco, A., Adam, H., Neven,
H., Vincent, L., 2009. Large-scale privacy protection in Google Street View, in: 2009
IEEE 12th International Conference on Computer Vision. Presented at the 2009 IEEE
12th International Conference on Computer Vision (ICCV), IEEE, Kyoto, pp. 2373–
113
References
2380. https://doi.org/10.1109/ICCV.2009.5459413
Goulermas, J.Y., Findlow, A.H., Nester, C.J., Liatsis, P., Zeng, X.-J., Kenney, L.P.J., Tresadern,
P., Thies, S.B., Howard, D., 2008. An instance-based algorithm with auxiliary similarity
information for the estimation of gait kinematics from wearable sensors. IEEE Trans
Granzow, R.F., Schall, M.C., Smidt, M.F., Chen, H., Fethke, N.B., Huangfu, R., 2018.
https://doi.org/10.1016/j.apergo.2017.07.013
Haggag, H., Hossny, M., Nahavandi, S., Creighton, D., 2013. Real Time Ergonomic Assessment
for Assembly Operations Using Kinect, in: 2013 UKSim 15th International Conference
Halim, I., Radin Umar, R.Z., 2018. Usability Study of Integrated RULA-KinectTM System for
He, Y., Yan, R., Fragkiadaki, K., Yu, S.-I., 2020. Epipolar Transformers. arXiv:2005.04551 [cs].
Hignett, S., McAtamney, L., 2000. Rapid Entire Body Assessment (REBA). Applied
Humadi, A., Nazarahari, M., Ahmad, R., Rouhani, H., 2021. In-field instrumented ergonomic
risk assessment: Inertial measurement units versus Kinect V2. International Journal of
114
References
Hussain, M.M., Qutubuddin, S., Kumar, K.P.R., Reddy, C.K., 2019. Digital human modeling in
Hwang, J., Knapik, G.G., Dufour, J.S., Best, T.M., Khan, S.N., Mendel, E., Marras, W.S., 2016.
Imran, J., Raman, B., Rajput, A.S., 2020. Robust, efficient and privacy-preserving violent
activity recognition in videos, in: Proceedings of the 35th Annual ACM Symposium on
Applied Computing. Presented at the SAC ’20: The 35th ACM/SIGAPP Symposium on
https://doi.org/10.1145/3341105.3373942
Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C., 2014. Human3.6M: Large Scale Datasets
and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans.
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y., 2019a. Learnable Triangulation of Human
Iskakov, K., Burkov, E., Lempitsky, V., Malkov, Y., 2019b. Learnable Triangulation of Human
Jacobs, J.A., 1998. Measuring time at work: are self-reports accurate? 12.
Jasiewicz, J.M., Treleaven, J., Condie, P., Jull, G., 2007. Wireless orientation sensors: their
suitability to measure head movement for neck pain assessment. Man Ther 12, 380–385.
https://doi.org/10.1016/j.math.2006.07.005
Jiang, J., Skalli, W., Siadat, A., Gajny, L., 2022a. Effect of Face Blurring on Human Pose
Jiang, J., Skalli, W., Siadat, A., Gajny, L., 2022b. Effect of Face Blurring on Human Pose
Applications 11.
Jiang, T., Camgoz, N.C., Bowden, R., 2021. Skeletor: Skeletal Transformers for Robust Body-
Pose Estimation 9.
Joo, H., Liu, H., Tan, L., Gui, L., Nabbe, B., Matthews, I., Kanade, T., Nobuhara, S., Sheikh,
Y., 2015. Panoptic studio: A massively multiview system for social motion capture.
pp. 3334–3342.
Joshi, M., Deshpande, V., 2019. A systematic review of comparative studies on ergonomic
https://doi.org/10.1016/j.ergon.2019.102865
116
References
Karhu, O., Kansi, P., Kuorinka, I., 1977. Correcting working postures in industry: A practical
6870(77)90164-8
Kee, D., Karwowski, W., 2001. LUBA: an assessment technique for postural loading on the
upper body based on joint motion discomfort and maximum holding time. Applied
Kee, D., Na, S., Chung, M.K., 2020. Comparison of the Ovako Working Posture Analysis
System, Rapid Upper Limb Assessment, and Rapid Entire Body Assessment based on
the maximum holding times. International Journal of Industrial Ergonomics 77, 102943.
https://doi.org/10.1016/j.ergon.2020.102943
Kemmlert, K., 1995. A method assigned for the identification of ergonomic hazards — PLIBEL.
Keyserling, W.M., Brouwer, M., Silverstein, B.A., 1992. A checklist for evaluating ergonomic
risk factors resulting from awkward postures of the legs, trunk and neck. International
8141(92)90062-5
Kim, H.-K., Zhang, Y., 2017. Estimation of lumbar spinal loading and trunk muscle forces
https://doi.org/10.1080/00140139.2016.1191679
Kingma, I., de Looze, M.P., Toussaint, H.M., Klijnsma, H.G., Bruijnen, T.B.M., 1996.
117
References
Validation of a full body 3-D dynamic linked segment model. Human Movement
Kok, J. de, Vroonhof, P., Snijders, J., Roullis, G., Clarke, M., Peereboom, K., Dorst, P. van,
Isusi, I., 2019. Work-related MSDs: prevalence, costs and demographics in the EU.
Kolahi, A., Hoviattalab, M., Rezaeian, T., Alizadeh, M., Bostan, M., Mokhtarzadeh, H., 2007.
Kollia, A., Pillet, H., Bascou, J., Villa, C., Sauret, C., Lavaste, F., 2012. Validation of a volumic
model to obtain personalized body segment inertial parameters for people sitting in a
209. https://doi.org/10.1080/10255842.2012.713701
Kong, W., Sessa, S., Cosentino, S., Zecca, M., Saito, K., Wang, C., Imtiaz, U., Lin, Z.,
Bartolomeo, L., Ishii, H., Ikai, T., Takanishi, A., 2013. Development of a real-time IMU-
based motion capture system for gait rehabilitation, in: 2013 IEEE International
Kovacic, I., Radomirovic, D., Zukovic, M., Pavel, B., Nikolic, M., 2018. Characterisation of
tree vibrations based on the model of orthogonal oscillations. Scientific Reports 8, 8558.
https://doi.org/10.1038/s41598-018-26726-5
Labbafinejad, Y., Imanizade, Z., Danesh, H., 2016. Ergonomic Risk Factors and Their
118
References
Association With Lower Back and Neck Pain Among Pharmaceutical Employees in Iran.
Lebel, K., Boissy, P., Nguyen, H., Duval, C., 2017. Inertial measurement systems for segments
0347-6
Li, G., Buckle, P., 1999. Current techniques for assessing physical exposure to work-related
695. https://doi.org/10.1080/001401399185388
Li, L., Martin, T., Xu, X., 2020a. A novel vision-based real-time method for evaluating postural
risk factors associated with musculoskeletal disorders. Applied Ergonomics 87, 103138.
https://doi.org/10.1016/j.apergo.2020.103138
Li, L., Xie, Z., Xu, X., 2020b. MOPED25: A multimodal dataset of full-body pose and motion
https://doi.org/10.1016/j.jbiomech.2020.110086
Li, L., Xu, X., 2019. A deep learning-based RULA method for working posture assessment.
Proceedings of the Human Factors and Ergonomics Society Annual Meeting 63, 1090–
1094. https://doi.org/10.1177/1071181319631174
Li, W., Liu, H., Ding, R., Liu, M., Wang, P., 2021. Lifting Transformer for 3D Human Pose
Lin, J.-H., Kirlik, A., Xu, X., 2018. New technologies in human factors and ergonomics
119
References
https://doi.org/10.1016/j.apergo.2017.08.012
Malaise, A., Maurice, P., Colas, F., Ivaldi, S., 2019. Activity Recognition for Ergonomics
Assessment of Industrial Tasks With Automatic Feature Selection. IEEE Robot. Autom.
McAtamney, L., Nigel Corlett, E., 1993. RULA: a survey method for the investigation of work-
6870(93)90080-s
McGinley, J.L., Baker, R., Wolfe, R., Morris, M.E., 2009. The reliability of three-dimensional
kinematic gait measurements: A systematic review. Gait & Posture 29, 360–369.
https://doi.org/10.1016/j.gaitpost.2008.09.003
Mehrizi, R., Peng, X., Metaxas, D.N., Xu, X., Zhang, S., Li, K., 2019. Predicting 3-D Lower
Back Joint Load in Lifting: A Deep Pose Estimation Approach. IEEE Trans. Human-
Mehrizi, R., Xu, X., Zhang, S., Pavlovic, V., Metaxas, D., Li, K., 2017. Using a marker-less
method for estimating L5/S1 moments during symmetrical lifting. Applied Ergonomics
Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C., 2017.
Monocular 3d human pose estimation in the wild using improved cnn supervision.
Presented at the 2017 international conference on 3D vision (3DV), IEEE, pp. 506–516.
Mehta, D., Sotnychenko, O., Mueller, F., Xu, W., Sridhar, S., Pons-Moll, G., Theobalt, C., 2018.
120
References
Single-shot multi-person 3d pose estimation from monocular rgb. Presented at the 2018
Mohd Nur, N., Mohamed Salleh, M.A.S., Minhat, M., Mahmud Zuhudi, N.Z., 2018. Load
Lifting and the Risk of Work-Related Musculoskeletal Disorders among Cabin Crews.
899X/370/1/012026
Mündermann, L., Corazza, S., Andriacchi, T.P., 2006. The evolution of methods for the capture
https://doi.org/10.1186/1743-0003-3-6
Nam Bach, T., Junger, D., Curio, C., Burgert, O., 2022. Towards Human Action Recognition
during Surgeries using De-identified Video Data: De-identification Prototype for Visual
112. https://doi.org/10.1515/cdbme-2022-0028
Nérot, A., Choisne, J., Amabile, C., Travert, C., Pillet, H., Wang, X., Skalli, W., 2015. A 3D
reconstruction method of the body envelope from biplanar X-rays: Evaluation of its
https://doi.org/10.1016/j.jbiomech.2015.10.044
Ning, X., Guo, G., 2013. Assessing Spinal Loading Using the Kinect Depth Sensor: A
https://doi.org/10.1109/JSEN.2012.2230252
121
References
Nordander, C., Hansson, G.-Å., Ohlsson, K., Arvidsson, I., Balogh, I., Strömberg, U., Rittner,
R., Skerfving, S., 2016. Exposure–response relationships for work-related neck and
OCCHIPINTI, E., 1998. OCRA: a concise index for the assessment of exposure to repetitive
https://doi.org/10.1080/001401398186315
https://docs.opencv.org/4.x/d4/d86/group__imgproc__filter.html#gac05a120c1ae92a6
Pavlakos, G., Choutas, V., Ghorbani, N., Bolkart, T., Osman, A.A., Tzionas, D., Black, M.J.,
2019. Expressive Body Capture: 3D Hands, Face, and Body From a Single Image, in:
https://doi.org/10.1109/CVPR.2019.01123
Pfister, A., West, A.M., Bronner, S., Noah, J.A., 2014. Comparative abilities of Microsoft
Kinect and Vicon 3D motion capture for gait analysis. Journal of Medical Engineering
Pillet, H., Bonnet, X., Lavaste, F., Skalli, W., 2010. Evaluation of force plate-less estimation of
the trajectory of the centre of pressure during gait. Comparison of two anthropometric
Plantard, P., Muller, A., Pontonnier, C., Dumont, G., Shum, H.P.H., Multon, F., 2017a. Inverse
https://doi.org/10.1016/j.ergon.2017.05.010
Plantard, P., Shum, H.P.H., Le Pierres, A.-S., Multon, F., 2017b. Validation of an ergonomic
assessment method using Kinect data in real workplace conditions. Applied Ergonomics
Plantard, P., Shum, H.P.H., Multon, F., 2017c. Usability of corrected Kinect measurement for
https://doi.org/10.1504/IJHFMS.2017.087018
Prakash, C., Gupta, K., Mittal, A., Kumar, R., Laxmi, V., 2015. Passive Marker Based Optical
System for Gait Kinematics for Lower Extremity. Procedia Computer Science 45, 176–
185. https://doi.org/10.1016/j.procs.2015.03.116
Prakash, K.C., Neupane, S., Leino-Arjas, P., von Bonsdorff, M.B., Rantanen, T., von Bonsdorff,
M.E., Seitsamo, J., Ilmarinen, J., Nygård, C.-H., 2017. Work-Related Biomechanical
Exposure and Job Strain as Separate and Joint Predictors of Musculoskeletal Diseases:
1267. https://doi.org/10.1093/aje/kwx189
Qiu, H., Wang, C., Wang, J., Wang, N., Zeng, W., 2019. Cross View Fusion for 3D Human Pose
Rachmawati, S., Suryadi, I., Pitanola, R.D., 2022. Low back pain: Based on Age, Working
Period and Work Posture. KEMAS: Jurnal Kesehatan Masyarakat; Vol 17, No 2 (2021).
https://doi.org/10.15294/kemas.v17i2.26313
Reddy, N.D., Guigues, L., Pishchulin, L., Eledath, J., Narasimhan, S.G., 2021. TesseTrack: End-
Ren, Z., Lee, Y.J., Ryoo, M.S., 2018. Learning to Anonymize Faces for Privacy Preserving
Action Detection, in: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (Eds.),
01246-5_38
Rezapur-Shahkolai, F., Gheysvandi, E., Tapak, L., Dianat, I., Karimi-Shahanjarini, A.,
Heidarimoghadam, R., 2020. Risk factors for low back pain among elementary school
students in western Iran using penalized logistic regression. Epidemiology and Health
42.
Rhén, I.-M., Forsman, M., 2020. Inter- and intra-rater reliability of the OCRA checklist method
https://doi.org/10.1016/j.apergo.2019.103025
Ribaric, S., Ariyaeeinia, A., Pavesic, N., 2016. De-identification for privacy protection in
https://doi.org/10.1016/j.image.2016.05.020
Robert, T., Leborgne, P., Abid, M., Bonnet, V., Venture, G., Dumas, R., 2017. Whole body
124
References
segment inertia parameters estimation from movement and ground reaction forces: a
S175–S176. https://doi.org/10.1080/10255842.2017.1382919
Rodrigues, T.B., Salgado, D.P., Catháin, C.Ó., O’Connor, N., Murray, N., 2020. Human gait
Roetenberg, D., Luinge, H., Slycke, P., 2013. Xsens MVN: Full 6DOF Human Motion Tracking
Roodbandi, A.J., Choobineh, A., Feyzi, V., 2015. The investigation of intra-rater and inter-rater
Sanchez-Lite, A., Garcia, M., Domingo, R., Angel Sebastian, M., 2013. Novel Ergonomic
https://doi.org/10.1371/journal.pone.0072703
Sandoz, B., Laporte, S., Skalli, W., Mitton, D., 2010. Subject-specific body segment parameters’
https://doi.org/10.1080/10255841003717608
Sazonova, N., Schuckers, S., Johnson, P., Lopez-Meyer, P., Sazonov, E., Hornak, L., 2011.
Impact of out-of-focus blur on iris recognition, in: Southern, S.O., Montgomery, K.N.,
125
References
Taylor, C.W., Weigl, B.H., Vijaya Kumar, B.V.K., Prabhakar, S., Ross, A.A. (Eds.), .
Presented at the SPIE Defense, Security, and Sensing, Orlando, Florida, United States,
p. 80291S. https://doi.org/10.1117/12.887052
Schall, M.C., Fethke, N.B., Chen, H., 2016. Working postures and physical activity among
https://doi.org/10.1016/j.apergo.2016.01.008
Schall, M.C., Sesek, R.F., Cavuoto, L.A., 2018. Barriers to the Adoption of Wearable Sensors
Schaub, K., Caragnano, G., Britzke, B., Bruder, R., 2013. The European Assembly Worksheet.
https://doi.org/10.1080/1463922X.2012.678283
Šenk, M., Chèze, L., 2006. Rotation sequence as an important factor in shoulder kinematics.
Siaw, T.U., Han, Y.C., Wong, K.I., 2023. A Low-Cost Marker-Based Optical Motion Capture
https://doi.org/10.1109/LSENS.2023.3239360
Steven Moore, J., Garg, A., 1995. The Strain Index: A Proposed Method to Analyze Jobs For
Takala, E.-P., Pehkonen, I., Forsman, M., Hansson, G.-Å., Mathiassen, S.E., Neumann, W.P.,
Sjøgaard, G., Veiersted, K.B., Westgaard, R.H., Winkel, J., 2010. Systematic evaluation
Tomei, M., Baraldi, L., Bronzin, S., Cucchiara, R., 2021. Estimating (and fixing) the Effect of
Vision and Pattern Recognition Workshops (CVPRW). Presented at the 2021 IEEE/CVF
https://doi.org/10.1109/CVPRW53098.2021.00364
Vafadar, S., Skalli, W., Bonnet-Lebrun, A., Assi, A., Gajny, L., 2022. Assessment of a novel
deep learning-based marker-less motion capture system for gait study. Gait & Posture
Vafadar, S., Skalli, W., Bonnet-Lebrun, A., Khalifé, M., Renaudin, M., Hamza, A., Gajny, L.,
2021. A novel dataset and deep learning-based approach for marker-less motion capture
Venture, G., Bonnet, V., Kulic, D., 2019. Creating Personalized Dynamic Models, in: Venture,
https://doi.org/10.1007/978-3-319-93870-7_5
Wang, D., Dai, F., Ning, X., 2015. Risk Assessment of Work-Related Musculoskeletal Disorders
7862.0000979
WATERS, T.R., PUTZ-ANDERSON, V., GARG, A., FINE, L.J., 1993. Revised NIOSH
equation for the design and evaluation of manual lifting tasks. Ergonomics 36, 749–776.
https://doi.org/10.1080/00140139308967940
https://doi.org/10.37268/mjphm/vol.20/no.Special1/art.707
Wijsman, P.J.M., Molenaar, L., van‘t Hullenaar, C.D.P., van Vugt, B.S.T., Bleeker, W.A.,
https://doi.org/10.1007/s00464-019-06678-1
Winnemuller, L.L., Spielholz, P.O., Daniell, W.E., Kaufman, J.D., 2004. Comparison of
Wu, Q., Xu, G., Wei, F., Chen, L., Zhang, S., 2021. RGB-D Videos-Based Early Prediction of
Infant Cerebral Palsy via General Movements Complexity. IEEE Access 9, 42314–
42324. https://doi.org/10.1109/ACCESS.2021.3066148
Yahya, M., Shah, J.A., Kadir, K.A., Yusof, Z.M., Khan, S., Warsi, A., 2019. Motion capture
sensing techniques used in human upper limb motion: a review. SR 39, 504–511.
https://doi.org/10.1108/SR-10-2018-0270
Yang, K., Yau, J.H., Fei-Fei, L., Deng, J., Russakovsky, O., 2022. A Study of Face Obfuscation
128
References
Yu, Y., Li, H., Yang, X., Umer, W., 2018. Estimating Construction Workers’ Physical Workload
by Fusing Computer Vision and Smart Insole Technologies. Presented at the 34th
https://doi.org/10.22260/ISARC2018/0168
Zhang, Z., 2012. Microsoft Kinect Sensor and Its Effect. IEEE MultiMedia 19, 4–10.
https://doi.org/10.1109/MMUL.2012.24
Zheng, C., Wu, W., Yang, T., Zhu, S., Chen, C., Liu, R., Shen, J., Kehtarnavaz, N., Shah, M.,
Zheng, C., Zhu, S., Mendieta, M., Yang, T., Chen, C., Ding, Z., 2021b. 3D Human Pose
Zhu, B., Fang, H., Sui, Y., Li, L., 2020. Deepfakes for Medical Video De-Identification: Privacy
Conference on AI, Ethics, and Society. Presented at the AIES ’20: AAAI/ACM
Conference on AI, Ethics, and Society, ACM, New York NY USA, pp. 414–420.
https://doi.org/10.1145/3375627.3375849
Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y., 2023. Learning Human Motion
Appendix
Figure A-1 Placement of the markers for human movement data acquisition in this
study, where the red dots represent anatomical markers, the green dots represent
technical markers, and the blue dots represent static markers
130
Appendix
International Conferences
[1]. J. Jiang, W. Skalli, A. Siadat, and L. Gajny, Kinematic parameters estimation during gait
based on a multi-view markerless motion capture system, European Society for Movement
Analysis in Adults and Children (ESMAC 2022), Dublin, Ireland, 2022.
[2]. J. Jiang, W. Skalli, A. Siadat, and L. Gajny, Towards biomechanical analysis in workplace
ergonomics using marker-less motion capture, 18th International Symposium on Computer
Methods in Biomechanics and Biomedical Engineering (CMBBE 2023), Paris, France,
2023.
[3]. J. Jiang, W. Skalli, A. Siadat, and L. Gajny, Estimation of intersegmental load at l5-s1
during lifting/lowering task with markerless motion capture, 28th Congress of the
European Society of Biomechanics (ESBiomech23), Maastricht, The Netherlands, 2023
Conference Proceedings
[1]. J. Jiang, W. Skalli, A. Siadat, and L. Gajny, “Kinematic parameters estimation during gait
based on a multi-view markerless motion capture system,” in Gait & Posture, vol. 97,
Elsevier, 2022, S17–S18 [Q1, Impact Factor: 2.746].
Journal Articles
[1]. J. Jiang, W. Skalli, A. Siadat, and L. Gajny, “Effect of face blurring on human pose
estimation: Ensuring subject privacy for medical and occupational health applications,”
Sensors, vol. 22, no. 23, p. 9376, 2022 [Q1, Impact Factor: 3.847].
[2]. J. Jiang, J. Wu, Q. Chen, G. Chatzigeorgiou, and F. Meraghni, “Physically informed deep
homogenization neural network for unidirectional multiphase/multi-inclusion
thermoconductive composites,” Computer Methods in Applied Mechanics and
Engineering, vol. 409, p. 115 972, 2023 [Q1, Impact Factor: 6.588].
[3]. J. Jiang, J. Zhao, S. Pang, F. Meraghni, A. Siadat, and Q. Chen, “Physics-informed deep
neural network enabled discovery of size-dependent deformation mechanisms in
131
Appendix
nanostructures,” International Journal of Solids and Structures, vol. 236, p. 111 320, 2022
[Q1, Impact Factor: 3.667].
Résumé
Dans un environnement industriel, les opérateurs sont souvent amenés à répéter des gestes spécifiques
sur le poste de travail, ce qui peut provoquer des troubles musculo-squelettiques (TMS). Cependant,
en raison de la complexité de l'environnement industriel, il n'existe pas actuellement de système
automatique capable d'effectuer une analyse biomécanique quantitative des opérateurs sans interférer
avec leur travail quotidien au poste de travail. L'objectif de ce manuscrit est d'explorer la possibilité
d'effectuer des analyses biomécaniques et ergonomiques via la vision par ordinateur. Un système a été
développé en utilisant 4 caméras numériques pour estimer la position humaine en 3D des opérateurs.
Grâce à l'approche de vision par ordinateur multi-vues, les mouvements des opérateurs pendant des
tâches de levage et d'abaissement peuvent être capturés dans un environnement d'usine simulé
complexe. La charge au niveau de l'articulation L5-S1 ainsi que l'évaluation ergonomique sont
calculées par la suite. Les résultats de l'étude montrent que les systèmes basés sur la vision stéréo ont
un fort potentiel pour l'analyse biomécanique et ergonomique quantitative automatisée dans les
environnements industriels.
Abstract
In an industrial environment, workers are often required to repeat specific gestures in the workstation,
which can cause musculoskeletal disorders (MSDs). However, due to the complexity of the factory
environment, an automatic system that can perform quantitative biomechanical analyses of workers
without interfering with their daily work at the workstation is not currently available. The objective of
this manuscript is to explore the possibility of conducting biomechanical and ergonomic analysis using
computer vision. A framework was developed based on 4 digital cameras to estimate the 3D human
pose of workers. With the multi-view computer vision approach, the worker’s motion during
lifting/lowering tasks can be accurately captured in a complex simulated factory environment. The
load at L5-S1 joint as well as the ergonomics evaluation were calculated subsequently. The results of
the study show that stereo-vision-based systems have the potential for automated quantitative
biomechanical and ergonomic analysis in industrial environments.
Key Words: Lifting/lowering task; L5/S1 load estimation; Stereovision system;
Biomechanics; Ergonomics; Inverse dynamics; Deep learning