Real-Time Full-Body Motion Capture From Video and Imus

This document presents a real-time full-body motion capture system that uses input from inertial measurement units (IMUs) and images from standard video cameras, requiring no optical markers. A real-time optimization framework incorporates constraints from IMUs, cameras, and a pose prior model. Combining video and IMU data allows full 6-DOF motion to be recovered, including limb rotation and drift-free global position. The approach was tested using indoor and outdoor captured data and demonstrates effective real-time human motion tracking in unconstrained scenes.

Uploaded by

PAL ROBOT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views9 pages

Real-Time Full-Body Motion Capture From Video and Imus

Uploaded by

PAL ROBOT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Real-time Full-Body Motion Capture from Video and IMUs

Charles Malleson Marco Volino Andrew Gilbert Matthew Trumble

John Collomosse Adrian Hilton
CVSSP, University of Surrey, Guildford, U.K.
[email protected], [email protected]

Abstract or more synchronised video camera views. In principle, the

solver is agnostic as to the source of the inputs.
A real-time full-body motion capture system is presented Combining video and IMU data improves the tracking
which uses input from a sparse set of inertial measurement performance compared to one or the other. The IMUs pro-
units (IMUs) along with images from two or more stan- vide full rotational information for body segments, while
dard video cameras and requires no optical markers or spe- the video information provides drift-free global position in-
cialized infra-red cameras. A real-time optimization-based formation.
framework is proposed which incorporates constraints from
the IMUs, cameras and a prior pose model. The combina- 2. Related work
tion of video and IMU data allows the full 6-DOF motion to
IMUs and multi-view video data were combined by
be recovered including axial rotation of limbs and drift-free
von Marcard et al. [17] to exploit the complementary prop-
global position. The approach was tested using both indoor
erties of the data sources, i.e. drift free position from video
and outdoor captured data. The results demonstrate the ef-
and 3D limb orientation from IMUs. However no compari-
fectiveness of the approach for tracking a wide range of hu-
son is performed against commercial reference-quality mo-
man motion in real time in unconstrained indoor/outdoor
tion capture (instead the results are compared with respect
scenes.
to consistency with silhouettes and IMU measurements),
and processing time is not specified.
1. Introduction The ‘Sparse Inertial Poser’ (SIP) system proposed by
von Marcard et al. [18] uses orientation and acceleration
Real-time capture of human motion is of considerable from 6 IMUs as input and is assisted by a prior pose model
interest in various domains including entertainment and the in the form of the SMPL body model [11]. However, SIP
life sciences. Recent advances in computer vision and the processes sequences as a batch and is thus not suitable for
availability of commodity wireless inertial sensors [13, 3] real-time, online operation. Furthermore, it is susceptible
are beginning to take motion capture from constrained stu- to drift in global position since it does not use visual infor-
dio settings to more natural, outdoor environments, and mation. Our system requires cameras in addition to sparse
with less encumbrance of the performer from specialized IMUs, but processes sequences online in real-time and with-
costumes and optical marker setups traditionally required out accumulating drift in global position.
(e.g. [4, 2]), while still retaining a high level of capture fi- Andrews et al. [6] perform real-time body tracking us-
delity. ing a sparse set of labelled optical markers, IMUs, and a
In this work, a novel optimization-based approach is pro- motion prior in an inverse dynamics formulation. In con-
posed which combines multi-modal input from inertial sen- trast, our method is markerless and does not require setting
sors and cameras to produce an estimate of the full-body up a physics-rig of the subject.
pose in real time without requiring optical markers or a Convolutional Pose Machines (CPMs) [19, 7] use deep
complex hardware setup. The solver optimizes the kine- neural networks to estimate 2D pose (joint locations) for
matic pose of the subject based on a cost function compris- multiple people from a single image, with video rate detec-
ing orientation, acceleration, position and pose prior terms. tion possible using GPU acceleration. In Tome et al. [14],
In our setup, the orientation and acceleration constraints are CPMs are extended to detect 3D pose from a single RGB
provided by a sparse set of 6-13 Xsens inertial measurement image by incorporating knowledge of plausible human
units (IMUs) attached to body segments, and positional con- poses in the training. In VNect [12], 3D pose is estimated
straints are obtained from 2D joint detections [7] from two in realtime from a single camera using CNNs and kinematic

1
fitting, while Zhou et al. [22] use CNNs for 2D joint detec- transforms along the kinematic chain as follows:
tion and offline Expectation-Maximization over an entire
Y Rb0 tb0
sequence for 3D pose. Due to the monocular input, these Tgb (θ) = (1)
methods are subject to depth ambiguity. 0
0 1
b ∈P(b)
Trumble et al. [15] use convolutional neural networks on
multiview video data to perform real-time motion capture. where P(b) is the ordered set of parent joints of bone b.
However, this requires extensive training from multi-view We define a set of ni IMU track targets, i, each attached
video data and the axial rotation of the limbs cannot be re- to a bone bi . The rotational and translational offsets of
covered since the input is based on visual hulls. Further- the IMU w.r.t. the bone are denoted Rib and tib , respec-
more, controlled capture conditions are required for back- tively. The rotational transform between each IMU refer-
ground segmentation. In contrast, our method requires min- ence frame and the global coordinates is denoted Rig . IMU
imal, simple training of the pose prior, while using a pre- orientation measurements (w.r.t. the IMU inertial reference
trained CPM detector for 2D detections. By incorporating frame) and acceleration measurements (w.r.t. the IMU de-
IMU data, our method is able to recover axial rotation of vice frame) are denoted Ri and ai , respectively. Likewise,
the limbs while handling dynamic backgrounds and occlu- we define a set of np positional track targets, p, each at-
sions. In subsequent work, Trumble et al. [16] combined tached to a bone bp with translational offset tpb w.r.t. the
video and IMU input in a deep learning framework, includ- bone. Note that here we use the term ‘track target’ to refer
ing using LSTM (long short term memory, [9]) for temporal to a specific point on the body for which motion is esti-
prediction to reduce noise, but still require at least four cam- mated, not a physical optical marker. In our approach 2D
eras and relatively controlled capture conditions for visual joint positions are estimated using natural images and no
hull estimation. visual markers are required.
Other recent approaches to realtime body tracking Finally, we define a set of nc cameras, c with calibrated
use other types of capture hardware for example Kinect 3 × 4 projection matrices Pc and let tcp denote the 2D posi-
(RGBD) cameras [20, 10] Kinect plus IMUs [8], or HTC tion measurement for track target p in the local coordinates
Vive infra-red VR controllers strapped to the limbs [1]. of camera c.
Our work performs real-time, online, full-body marker-
less tracking in unconstrained environments using multiple- 3.2. Pose optimization
view video with as few as two cameras and 6 IMUs as input, The following pose optimization energy is used:
recovering the full DoFs including axial rotation and drift-
free global position. z }|
Data
{
E(θ) = ER (θ) + EP (θ) + EA (θ) +
3. Method (2)
P rior
z }| {
3.1. Notation and skeleton parametrization EP P (θ) + EP D (θ)
The kinematic skeleton consists of a pre-defined hierar- where ER (θ), EP (θ) and EA (θ) contain orientation, posi-
chy of nb rigid bones, b attached at joints. The root bone tion and acceleration constraints, respectively and EP P (θ)
b = 1 (i.e. the hips) has a global position, t1 and orientation, and EP D (θ) are pose projection and pose deviation priors,
R1 . Each child bone, b ∈ [2, nb ] is attached to its parent respectively. The data and prior constraints are visualized
with a fixed translational offset, tb , and pose-varying rota- in Figure 1. Each term is described in the following subsec-
tion, Rb , w.r.t. the parent bone coordinates. In this work, tions, where solved values have a ‘ˆ’ circumflex and their
nb = 21 bones are used. The total degrees of freedom dependence on θ is omitted for clarity. Unless otherwise
(DoF) are d = 3 + 3 × 21 = 66, consisting of the root trans- specified, values are for the current frame, t.
lation and 3 rotational degrees of freedom per joint. We
encode the pose of the skeleton as a single 66-dimensional
vector θ containing the 3D global translation of the root, 3.2.1 Orientation term
followed the stacked local joint rotations of each bone (in- For each IMU, i, an orientation constraint is added which
cluding the root), represented as 3D angle-axis vectors (i.e. seeks to minimize the relative orientation between the mea-
the axis of rotation multiplied by the angle of rotation in sured and solved global bone orientation (Figure 1).
radians). This parameter vector is the variable which is op- The measured global bone orientation, Rgbi is obtained
timized, with the root translation t1 and joint rotations Rb from the IMU measurement Ri using the IMU-bone offset
being extracted and used in calculations as applicable. Rib and IMU reference frame-global offset as follows:
For each bone, b, the global rigid body transform Tgb is
−1
computed by concatenating bone offset and joint rotation Rgbi = Rig · Ri · (Rib ) . (3)
Pos. (t - 1)
Solved pose
Pos. (t - 2)
Ori. meas. EPP(θ)
N(σ, μ)
EPD(θ)
ER(θ) 2D pos. meas. Acc. target
EP(θ) Projected pose
Ori. target EA(θ) Pos. (t)
Proj. pos. target Acc. meas. PCA subspace

(a) Orientation term (b) Position term (c) Acceleration term (d) PCA prior (figure adapted from [10])
Figure 1: Visualization of data and prior terms in the cost function (Equation 2).

The solved global bone orientation, R̂gb is obtained using where ccp ∈ [0, 1] is a confidence weighting for con-
the kinematic chain, ignoring translations: straint p obtained from the image-based position measure-
Y ment mechanism (Section 3.3), λP is a position constraint
R̂gb = Rb 0 . (4) weighting factor, ρP (·) is a loss function (see Section 3.2.5).
b0 ∈P(bi ) The confidence weighting cp and loss function enable ro-
bust output pose estimates in spite of persistently high levels
and the orientation cost is of noise and frequent outliers in input position detections.
2
! In these experiments, the track targets are located on a
−1 subset of the joints and thus have zero offset w.r.t. the bone
X
ER (θ) = ρR λR ψ (R̂gbi ) Rgbi (5)

i∈[1,ni ]
2 (tpb = 0). In general positional targets could be offset from
the joint locations (this would be the case if the positions
where ψ(·) extracts the vector part of the quaternion repre- were to come from optical markers attached to the surface
sentation of the rotation matrix, λR is orientation constraint of the body, rather than markerless joint detections).
weighting factor, ρR (·) is a loss function. Discussion of the
weightings and loss functions are deferred to Section 3.2.5.
3.2.3 Acceleration term

3.2.2 Position term In addition to orientation, the IMUs provide acceleration

measurements (in the local IMU coordinates). In order to
For each positional measurement from each camera, a con- include an acceleration term, it is necessary to consider a
straint is added which seeks to minimize the Euclidean dis- window of three frames, t (current frame), and previous
tance between the measured 2D location in camera coordi- two frames t − 1 and t − 2. For each IMU, a constraint
nates and the solved global track target location projected is added which seeks to minimize the difference between
into the camera (Figure 1). the measured and solved acceleration of the track target site
The solved global track target location, t̂gp is determined (Figure 1). The solved acceleration âgi is computed using
by applying the translational offset to the global bone trans- central finite differences using the solved pose from previ-
form Tgbp calculated according to Equation 1: ous two frames along with the current frame being solved:

t̂gp = τ t τ T (tpb ) · Tgbp (6) âgi (t − 1) = t̂gi (t) − 2t̂gi (t − 1) + t̂gi (t − 2) /(∆t)2 . (9)

where the operators τ T (·) and τ t (·) are shorthand for cre- where the solved IMU positions t̂gi are computed analo-
ating a transform matrix from a translation vector and ex- gously with Equation 6 (replacing subscripts p with i) and
tracting the translation vector from a transform matrix, re- ∆t is the frame period (in our case, 16.7 ms).
spectively. This global target position is projected into each The measured local accelerations from the previous
camera to obtain 2D solved targets t̂cp in camera coordi- frame of IMU data1 are converted to global coordinates as
nates: follows:
t̂cp = dh(Pc t̂gp ) (7)
agi (t − 1) = Rig · Ri (t − 1) · ai (t − 1) − ag (10)
where the operator dh(·) performs de-homogenization of a
homogeneous vector.
where ag = [0, 9.8707, 0]T is the acceleration of gravity,
The position cost is defined as
which needs to be subtracted. The acceleration cost is then
X X
2
EP (θ) = ρP λP ccp kt̂cp − tcp k2 (8) 1 The previous frame is used because central differences are used to

c∈[1,nc ] p∈[1,np ] estimate the solved acceleration.

simply defined as discourages deviation from the prior observed pose varia-
! tion (soft joint rotation limits) [10]. The pose projection
X 2 cost is
EA (θ) = ρA λA âgi − agi (11)

2
!
i∈[1,ni ] 2
T
EP P (θ) = ρP P λP P (θ̄−µ)−MM (θ̄−µ) (12)

2
where once again λA is a constraint weighting factor, ρA (·)
is a loss function (see Section 3.2.5). and the pose deviation cost is
Note that the orientation constraints only require the ori- !
entation offset of the IMU w.r.t. the bone to be known, 2
EP D (θ) = ρP D λP D diag(σ)−1 MT (θ̄ −µ) (13)

whereas the acceleration constraints require the transla-
2
tional offset to be known as well.
It is well known that double integrating acceleration to where, as with the data terms, weighting factors λ and loss
obtain position is prone to drift, thus these acceleration functions ρ are used (see Section 3.2.5). A geometric inter-
terms alone would not be sufficient to locate the body in pretation of these constraints is shown in Figure 1. Together
global coordinates over any length of time. The evaluation these terms produce soft constraints that yield plausible mo-
section considers the merits of including the acceleration tion while not strictly enforcing a reduced dimensionality
term in the optimization. on the solved pose, thus allowing novel motion to be more
faithfully reproduced at run time.
3.2.4 Pose prior terms
3.2.5 Energy minimization
In practice, not all the body segments are observed in the
input - the kinematic skeleton has more degrees of freedom As described in the previous subsections, weightings λ are
than are constrained by the IMUs and positional measure- used to control the contributions of each term to the overall
ments. For instance, the spine has several segments, but cost in Equation 2. These are required because the different
only one or two IMUs attached to it. A pose prior is there- terms compare different physical quantities, and because
fore required to constrain all degrees of freedom and pro- some sources of data may be more reliable than others - for
duce plausible poses in spite of sparse or noisy sensor input. instance IMU orientations may be more stable than noisy
In these experiments, two prior terms were incorporated position triangulations from images (refer to Section 4 for
based on a principal component analysis (PCA) of a corpus the values used).
of motion capture data. The pose prior should be invariant Furthermore, each term has a loss function, ρ(·) for each
to the global position and heading of the subject. We there- residual. The purpose of the loss function is to make the
fore use θ̄, denoting the dp = d − 6 pose vector excluding cost robust against outlier data (as well as to allow devia-
the first six elements, in the pose prior formulation. tion from the prior, when the measurements support it). For
A subset of ground-truth motion sequences from the To- the orientation constraints, a null loss is used (standard L2
tal Capture dataset [16], covering a wide variety of poses distance), since the IMUs tend not to produce outlier mea-
were used as training of the PCA pose model. In order surements. For the position, acceleration, PCA projection
to obtain a representative sample of poses without over- prior and PCA deviation prior a robust Cauchy loss func-
emphasis on commonly recurring poses for standing and tion is used, ρ(x) = log(1 + x). The Cauchy loss function
walking, for instance, we perform k-means clustering on limits the effect of gross outliers by penalizing large resid-
the full set of nf = 126, 000 training frames, with k = ual values proportionally less than small values. Using the
nf /100 = 1, 260. The cluster centres are concatenated to robust loss functions was found to be necessary to get good
form a k × dp data matrix D and PCA is performed on pose estimations in the presence of outlier measurements as
the mean-centered data. The dimensionality is reduced to well as novel unseen poses.
dr = 23 (chosen so as to keep 95% of the variance in the The pose cost function E(θ) is optimized using the
data) and the resulting PCA model is a dp × dr coefficient Ceres non-linear least-squares solver [5]. The individual
matrix, M, a dp -dimensional mean vector, µ and a dr - residuals for the data and prior terms are written using tem-
dimensional vector of standard deviations, σ (the square- plated types in order to use the autodifferentiation function-
roots of the principal component eigenvalues). ality of Ceres.
We use two priors based on the PCA of the pose: PCA The position, orientation and acceleration constraints are
projection and PCA deviation. The projection prior encour- only affected by parameters associated with the bone to
ages the solved body pose to lie close to the reduced di- which they are attached and its parent bones in the kine-
mensionality subspace of prior poses (soft reduction in the matic chain. Therefore, the Jacobian is sparse and its
degrees of freedom of the joints), while the deviation prior computation can be sped up by using parameter blocks.
The computation is further sped up using multi-threaded 3.4. Implementation details
Jacobian computation. The solving is performed using
For the IMU data, we use Xsens MTw wireless IMUs
Levenberg-Marquardt with a sparse normal Cholesky linear
[13]. These contain gyroscopes, accelerometers and mag-
solver. For each frame, the pose vector is initialized with the
netometers and through internal sensor fusion they output
solved value from the previous frame, yielding full-body 6-
an orientation at 60 Hz. The inertial reference frame of each
DoF pose estimation at real-time video rates.
IMU, Rig is assumed to be consistent between IMUs and in
alignment with the world coordinates through the global up
3.3. Joint detection from multi-view video
direction and magnetic north. The IMU-bone positions tib
The convolutional pose machines (CPMs) detector of are specified by manual visual alignment and the IMU-bone
Cao et al. [7] is used to perform joint detections tcp from orientations Rib are calibrated using the measured orienta-
each viewpoint in a multi-view video setup. The detector tions with the subject in a known pose (the T-pose, facing
also outputs confidences, ccp . These detections are used the direction of a given axis).
for the positional constraints in the cost function (Sec- We use a set of 4K video cameras, with intrinsics and
tion 3.2.2). Although no explicit triangulation is performed extrinsics calibrated using a chart [21], for simplicity of in-
in our formulation, at least two views are required in order tegration with the inertial measurements, the global refer-
for the solver to localize the subject in global coordinates ence frame of the camera system is chosen to align with
without depth ambiguity. the up direction and magnetic north. The Total Capture
On our hardware, the CPM detector requires 125 ms per dataset was recorded at HD, 1920 × 1080, 60 fps, and Out-
frame on a single video stream, while fully utilizing the door 1 was recorded at UHD, 3840 × 2160, 60 fps. For
GPU. Video-rate detection of all frames in multiple views efficiency of processing and display, we use downsampled
would thus not be practical. We employ two techniques to video (960 × 540) throughout, since realtime decoding and
improve throughput and achieve video rate solving: ROI display of multiple streams of high-resolution video proved
(region of interest) packing and temporal sub-sampling of a bottleneck.
position measurements. To temporally align the IMU and video data an initial
footstamp was performed by the actor, which is visible in
the video and produces a strong peak in acceleration in the
3.3.1 ROI packing IMU data.

The CPM detector is able to detect multiple people within 4. Results and evaluation
a single image, while maintaining computation time [7].
We propose to increase the detection throughput by packing The approach was tested using an existing indoor
ROIs from multiple cameras (and optionally, frames) into a dataset, Total Capture [16] containing ground-truth data
single image. The detection is performed on the packed im- from a commercial mo-cap system, as well as on a new out-
age and the resulting detections are assigned the originating door dataset Outdoor 1. The solver can easily be configured
camera and frame (Figure 2). The ROIs for each camera to take an arbitrary subset of the available IMUs and posi-
are updated at every frame to an expanded bounding box of tional constraints to evaluate the effect of camera and IMU
the current detections. In the event of missed detections, the sparsity. Note that in this work, all positional constraint in-
corresponding ROI is reverted to the full image. In practice, formation is obtained from the multiple-view video based
the subject cannot be too small in the frame or the detector on per-view CPM as discussed in Section 3.3 and no optical
will fail. Packing 8 ROIs was found to be satisfactory (e.g. markers or visible targets are used.
1 frame from 8 cameras or 2 frames from 4 cameras). First, quantitative results are presented showing the rel-
ative performance with various configurations of IMUs and
cameras, sub-sampling configurations of position detections
3.3.2 Temporal sub-sampling as well as the contribution of each term in the cost function.
Next, further quantitative results are presented for multiple
To increase the frame-rate of our solver in spite of relatively sequences of the Total Capture dataset. Finally, qualitative
long CPM detection times, we propose to perform the CPM results are presented for the Outdoor 1 dataset, which does
detections on a subset of the input frames, resulting in tem- not contain ground truth data. Videos of the results are pre-
porally sparse position measurements. For the intervals of sented in the supplementary material.
frames without positional constraints, global motion is still Throughout the experiments, the same weightings were
produced because of the acceleration term, which essen- used for the cost function terms, namely λR = 1, λP =
tially performs ‘dead-reckoning’. In Section 4.1.2 different 1 × 10−3 , λA = 7 × 10−4 , λP P = 0.9, λP D = 0.08.
sub-sampling strategies are evaluated. These values were arrived at by a gradient-based parameter
Cam 1 Cam 2 Cam 1 Cam 2 Cam 1 Cam 2

Cam 3 Cam 4 Cam 3 Cam 4 Cam 3 Cam 4

Last detections and source ROIs Frame B Frame C Packed ROI image for CPM detection
Frame A (unseen) (unseen) (from frames B and C)

Figure 2: Visualization of the ROI packing process for efficient multi-camera/frame CPM detection.

optimization over 200 frames of one motion sequence. shorter time with no detections than SS 2/20. This has an
effect on the quality of the solved motion as shown in Fig-
4.1. Indoor capture results ure 4, where a range of sub-sampling rates were used, with
The Total Capture dataset includes five subjects (S) per- No ∈ {1, 2, 3}.
forming various motions including range of motion (ROM), These results suggest that it is optimal to use No = 2.
walking (W), acting (A), and ‘freestyle’ (FS). The subjects Having detections for two successive frames results in a
were recorded simultaneously using 13 Xsens MTw IMUs, more reliable motion trajectory than having a single frame
8 HD video cameras, and a commercial infra-red motion more frequently. Having three successive frames, No = 3
capture system consisting of 16 cameras and a dense set of means that the interval of no detections is too long and the
retro-reflecting markers worn by the subject. The marker- error increases. With a quarter of the frames detected (SS
based input is not used in the runtime solver and is only used 2/8) the error is still reasonably low, while the processing
in this work as a ‘ground truth’ reference for evaluation. time is reduced, increasing the output frame-rate. This dec-
imation rate is used to evaluate on additional sequences in
4.1.1 Sparse IMUs and cameras Section 4.1.4.
10
It is desirable to have a minimal capture hardware setup in
Mean pos. error (cm)

13 IMUs
order to reduce cost as well as actor setup time. We sim- 8
6 IMUs

ulate the effect of reduced capture hardware by excluding

6
selected cameras and IMUs from the input. The 13 IMUs
in the full set are placed on the pelvis, sternum, upper and 4
2 3 4 6 8
lower limbs, head and feet. The 6 IMUs in the reduced set
are positioned on the pelvis, lower limbs and head. The full 12

set of cameras form a ring around the subject and between

Mean ori. error (deg)

10
2 and 8 of the available cameras are used in these tests.
8
Figure 3 compares the error using the sparse set of 6
6
IMUs with the full set of 13 using between 2 and 8 cam-
eras. With the sparse set of IMUs, position and orientation 4
2 3 4 6 8
error both decrease as more cameras are added. With the Number of cameras

full set of IMUs, the position error is lowest for intermedi- Figure 3: Position and orientation error with different sen-
ate numbers of cameras, while the orientation error hardly sor configurations, 13 or 6 IMUs and 2-8 cameras. Se-
varies with the number of cameras. An intermediate number quence: S2 - FS1, SS 1/1.
of cameras, 4, is used to evaluate on additional sequences in
Section 4.1.4.
4.1.3 Contribution of cost terms
4.1.2 Temporal sub-sampling of position
Table 1 shows the relative error in solved bone position and
We use the following notation for the temporal sub- orientation with selected terms in the cost function disabled.
sampling (SS) of the position detection: No /Np , where po- The results are shown with 4 cameras, with 2/8 detection
sition detection is performed on the first No frames of every sub-sampling.
Np frames in the sequence. For example SS 1/2 - every The orientation term from the IMUs has a strong effect
other frame, SS 2/4 - two out of every four frames. While on both position and orientation error, while the acceler-
SS 1/10 and SS 2/20 require the same amount of computa- ation term has a limited effect, helping with the position
tion, SS 1/10 provides a shorter interval of detections and in the 13 IMU case. The position term does not improve
S1 S2 S2 S3 S3 S4 S5 S5
Mean pos. error (cm)

20 No = 1 FS3 FS1 RM3 FS1 FS3 FS3 A3 FS1 Mean

No = 2 Pos. error (cm)
15 No = 3 Ours, 13 IMU, HQ 7.4 5.3 3.9 6.7 6.7 6.4 6.4 7.0 6.2
Trumble [16] 9.4 16.7 9.3 13.6 8.6 11.6 14.0 10.5 11.7
10 Ours, 13 IMU, HS 8.5 5.4 3.8 7.4 7.3 7.7 6.6 7.5 6.8
Ours, 6 IMU, HQ 9.8 7.1 6.6 10.0 10.7 9.2 9.0 10.0 9.1
5 Ours, 6 IMU, HS 14.3 9.4 10.8 19.4 17.1 13.9 13.3 16.5 14.3
1 2 3 4 6 10 15 20 Ori. error (deg)
Ours, 13 IMU, HQ 11.2 5.1 5.0 8.3 9.3 8.0 7.6 8.2 7.8
6.5 Ours, 13 IMU, HS 11.2 5.1 5.0 8.3 9.3 8.0 7.6 8.2 7.8
Ours, 6 IMU, HQ 16.3 9.2 8.7 13.2 15.7 13.0 11.8 12.1 12.5
Mean ori. error (deg)

Ours, 6 IMU, HS 18.3 10.9 10.6 16.2 19.7 14.8 14.3 15.1 15.0
6

5.5
Table 2: Mean error in position (cm) and orientation (deg)
for sequences from the Total Capture dataset using high
5 quality (HQ) and high speed (HS) settings, compared to the
1 2 3 4 6 10 15 20
Sub-sampling rate approach of Trumble et al. [16].
Figure 4: Position error under a range of detection sub-
sampling rates using No ∈ {1, 2, 3} successive frames. ror is maintained between HQ and HS (7.8 deg). Our ap-
Note that No = 2 yields the lowest position error across proach outperforms Trumble et al. [16] across the test se-
the sub-sampling range. Sequence: S2 - FS1, 13 IMUs, 8 quences both for HQ and HS modes. The errors for the 6
cameras. IMU case are larger at 9.1 cm, 12.5 deg for HQ and 14.2 cm,
15 deg for HS.
13 IMUs 6 IMUs
Terms Omitted Pos. Ori. Pos. Ori. 4.2. Outdoor capture results
IMU (ER , EA ) 1.97 4.82 1.27 2.38
The Outdoor 1 dataset was recorded outdoors in chal-
Ori. (ER ) 2.63 6.27 1.54 2.89
Acc. (EA ) 1.11 0.99 1.01 0.97
lenging uncontrolled conditions with a moving background
Pos. (EP ) 188.58 1.00 194.82 1.05 and varying illumination. A set of 6 cameras were placed
Prior (EP P , EP D ) 1.50 4.68 1.42 4.33 in a 120◦ arc around the subject and 13 Xsens IMUs. No
Prior Proj. (EP P ) 2.26 6.29 1.63 6.46 ground truth data is available for this dataset. Figure 7
Prior Dev. (EP D ) 1.16 2.86 1.46 3.24 shows a selection of solved frames overlaid on the input
image and full sequences are shown in the supplementary
Table 1: Position and orientation error with various terms in video.
the cost function disabled, relative to the error using the full
cost function, Equation 2 (Sequence: S2 - FS1, 4 Cam., SS
2/8).

the orientation error, but without it the global position drifts

causing gross error in position. Both the prior projection
and prior deviation terms improve the position and orienta-
tion errors significantly.

4.1.4 Further results Figure 5: Solved (blue) and ground truth (yellow) skeletons
overlaid on an input image showing CPM detections in yel-
In Table 2, further quantitative results are provided for sev- low and the corresponding locations on the solve skeleton
eral sequences from the Total Capture dataset, covering the in blue. Note the robustness to the outlier detection on the
5 subjects and a range of motion types from slow ROM mo- leg. Sequence: S5, FS1.
tion to challenging sequences including fast motion and un-
usual poses such as lying on the floor (see Figure 6 and refer
to the supplementary video). Figure 5 shows the robustness
4.3. Computation time
of our approach to typical misdetections from the CPM joint Figure 8 shows the real-time online frame-rate achieved
detector. using the approach as a function of the sub-sampling rate
Four configurations were used: high quality, ‘HQ’ (8 (with the CPM detection running in parallel with the main
camera, SS 1/1) and high speed, ‘HS’ (4 camera, SS 2/8) solver thread). The computing hardware is a standard desk-
each with both the 13 and 6 IMU sets. The average position top PC with Intel i7 3.6 GHz CPU and NVIDIA GTX 1080
error using 13 IMUs is 6.2 cm using HQ mode, degrading GPU. A frame rate of 30 fps can be achieved with SS 2/8,
slightly to 6.8 cm using HS mode, while the orientation er- while a rate in excess of 60 fps can be can be achieved with
(a) S1, FS3 (b) S2, ROM3 (c) S3, FS3

(d) S3, FS1 (e) S4, FS3 (f) S5, FS1

Figure 6: Selection of frames from the Total Capture dataset. The reference (ground truth) skeleton is shown in yellow, while
our solved skeleton is shown in blue.

(a) Camera layout (b) Camera views (freestyle) (c) Multiple frames (prop interaction)
Figure 7: Selection of solved frames from the Outdoor 1 dataset.

80
13 IMUs 5. Conclusion and further work
70 6 IMUs

60 The approach presented is flexible in terms of camera

Output frame-rate (fps)

50
and IMU hardware requirements. It is capable of recovering
the full 6-DoF pose, without drift in global position and can
40
operate both in constrained studio environments and in un-
30
constrained setups such as outdoor scenes with varying il-
20
lumination moving backgrounds and occlusion. The solver
10 can handle missing or outlier joint detections and even short
0 periods of complete occlusion because of the inclusion of
1 2 3 4 6 10 15 20
Sub-sampling rate the IMU input, degrading gracefully as the hardware is re-
duced. Reducing the number of cameras has less of an ef-
Figure 8: Output frame-rate of our solver (including detec-
fect on quality than reducing the number of IMUs. Future
tions) as a function of sub-sampling level. Sequence: S2-
work includes optimizing the code and using multiple GPUs
FS1, 4 cameras, No = 2.
to increase CPM detection throughput. It would also be pos-
sible to extend the approach to handle multiple subjects.
Acknowledgements
more aggressive sub-sampling (SS 2/40). In practice, 30 fps This work was supported by the Innovate UK Total Capture
is sufficient for most applications, and the increase in speed project (grant 102685) and the EU H2020 Visual Media project
is not worth the increase in error. Although not tested here, (grant 687800). We wish to thank Anna Korzeniowska, Evren
it should also be possible to use a lower decimation rate by Imre, Joao Regateiro and Armin Mustafa for their help with data
running two CPM detectors in parallel on two GPUs. capture.
References [18] T. von Marcard, B. Rosenhahn, M. Black, and G. Pons-Moll.
Sparse Inertial Poser: Automatic 3D Human Pose Estimation
[1] IKinemaOrion. https://ikinema.com/orion. 2 from Sparse IMUs. In Eurographics 2017, volume 36, 2017.
[2] OptiTrack Motive. http://www.optitrack.com. 1 1
[3] Perception Neuron. http://www.neuronmocap.com. [19] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-
1 volutional Pose Machines. 2016 IEEE Conference on Com-
[4] Vicon Blade. http://www.vicon.com. 1 puter Vision and Pattern Recognition, pages 4724–4732,
[5] S. Agarwal, K. Mierle, and Others. Ceres solver. http: 2016. 1
//ceres-solver.org. 4 [20] X. Wei, P. Zhang, and J. Chai. Accurate realtime full-body
[6] S. Andrews, I. Huerta, T. Komura, L. Sigal, and K. Mitchell. motion capture using a single depth camera. ACM Transac-
Real-time Physics-based Motion Capture with Sparse Sen- tions on Graphics, 31(6):1, 2012. 2
sors. In Proceedings of the 13th European Conference on [21] Z. Zhang. Flexible camera calibration by viewing a plane
Visual Media Production (CVMP 2016), 2016. 1 from unknown orientations. In ICCV, volume 00, pages 0–7,
1999. 5
[7] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime
multi-person 2D pose estimation using part affinity fields. [22] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and
In Conference on Computer Vision and Pattern Recognition K. Daniilidis. Sparseness meets deepness: 3D human pose
(CVPR), 2017. 1, 5 estimation from monocular video. 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
[8] T. Helten, M. Muller, H.-P. Seidel, and C. Theobalt. Real-
4966–4975, 2016. 2
time body tracking with one depth camera and inertial sen-
sors. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pages 1105–1112, 2013. 2
[9] S. Hochreiter and J. Schmidhuber. Long short-term memory.
In Neural computation, volume 9, pages 1735–1780. MIT
Press, 1997. 2
[10] A. E. Ichim and F. Tombari. Semantic parametric body shape
estimation from noisy depth sequences. Robotics and Au-
tonomous Systems, 75:539–549, 2016. 2, 3, 4
[11] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.
Black. SMPL: A skinned multi-person linear model. ACM
Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–
248:16, Oct. 2015. 1
[12] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin,
M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt.
VNect: Real-time 3D human pose estimation with a single
RGB camera. In ACM Transactions on Graphics, volume 36,
2017. 1
[13] D. Roetenberg, H. Luinge, and P. Slycke. Xsens MVN :
Full 6DOF Human Motion Tracking Using Miniature Inertial
Sensors. Technical report, pages 1–7, 2013. 1, 5
[14] D. Tome, C. Russell, and L. Agapito. Lifting from the
deep: Convolutional 3D pose estimation from a single im-
age. Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2017. 1
[15] M. Trumble, A. Gilbert, A. Hilton, and J. Collomosse. Deep
convolutional networks for marker-less human pose estima-
tion from multiple views. In Proceedings of the 13th Euro-
pean Conference on Visual Media Production (CVMP 2016),
2016. 2
[16] M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. Col-
lomosse. Total Capture: 3D Human Pose Estimation Fusing
Video and Inertial Sensors. In British Machine Vision Con-
ference (BMVC), 2017. 2, 4, 5, 7
[17] T. Von Marcard, G. Pons-Moll, and B. Rosenhahn. Human
Pose Estimation from Video and IMUs. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 38(8):1533–
1547, aug 2016. 1