Real-Time Full-Body Motion Capture From Video and Imus
Real-Time Full-Body Motion Capture From Video and Imus
1
fitting, while Zhou et al. [22] use CNNs for 2D joint detec- transforms along the kinematic chain as follows:
tion and offline Expectation-Maximization over an entire
Y Rb0 tb0
sequence for 3D pose. Due to the monocular input, these Tgb (θ) = (1)
methods are subject to depth ambiguity. 0
0 1
b ∈P(b)
Trumble et al. [15] use convolutional neural networks on
multiview video data to perform real-time motion capture. where P(b) is the ordered set of parent joints of bone b.
However, this requires extensive training from multi-view We define a set of ni IMU track targets, i, each attached
video data and the axial rotation of the limbs cannot be re- to a bone bi . The rotational and translational offsets of
covered since the input is based on visual hulls. Further- the IMU w.r.t. the bone are denoted Rib and tib , respec-
more, controlled capture conditions are required for back- tively. The rotational transform between each IMU refer-
ground segmentation. In contrast, our method requires min- ence frame and the global coordinates is denoted Rig . IMU
imal, simple training of the pose prior, while using a pre- orientation measurements (w.r.t. the IMU inertial reference
trained CPM detector for 2D detections. By incorporating frame) and acceleration measurements (w.r.t. the IMU de-
IMU data, our method is able to recover axial rotation of vice frame) are denoted Ri and ai , respectively. Likewise,
the limbs while handling dynamic backgrounds and occlu- we define a set of np positional track targets, p, each at-
sions. In subsequent work, Trumble et al. [16] combined tached to a bone bp with translational offset tpb w.r.t. the
video and IMU input in a deep learning framework, includ- bone. Note that here we use the term ‘track target’ to refer
ing using LSTM (long short term memory, [9]) for temporal to a specific point on the body for which motion is esti-
prediction to reduce noise, but still require at least four cam- mated, not a physical optical marker. In our approach 2D
eras and relatively controlled capture conditions for visual joint positions are estimated using natural images and no
hull estimation. visual markers are required.
Other recent approaches to realtime body tracking Finally, we define a set of nc cameras, c with calibrated
use other types of capture hardware for example Kinect 3 × 4 projection matrices Pc and let tcp denote the 2D posi-
(RGBD) cameras [20, 10] Kinect plus IMUs [8], or HTC tion measurement for track target p in the local coordinates
Vive infra-red VR controllers strapped to the limbs [1]. of camera c.
Our work performs real-time, online, full-body marker-
less tracking in unconstrained environments using multiple- 3.2. Pose optimization
view video with as few as two cameras and 6 IMUs as input, The following pose optimization energy is used:
recovering the full DoFs including axial rotation and drift-
free global position. z }|
Data
{
E(θ) = ER (θ) + EP (θ) + EA (θ) +
3. Method (2)
P rior
z }| {
3.1. Notation and skeleton parametrization EP P (θ) + EP D (θ)
The kinematic skeleton consists of a pre-defined hierar- where ER (θ), EP (θ) and EA (θ) contain orientation, posi-
chy of nb rigid bones, b attached at joints. The root bone tion and acceleration constraints, respectively and EP P (θ)
b = 1 (i.e. the hips) has a global position, t1 and orientation, and EP D (θ) are pose projection and pose deviation priors,
R1 . Each child bone, b ∈ [2, nb ] is attached to its parent respectively. The data and prior constraints are visualized
with a fixed translational offset, tb , and pose-varying rota- in Figure 1. Each term is described in the following subsec-
tion, Rb , w.r.t. the parent bone coordinates. In this work, tions, where solved values have a ‘ˆ’ circumflex and their
nb = 21 bones are used. The total degrees of freedom dependence on θ is omitted for clarity. Unless otherwise
(DoF) are d = 3 + 3 × 21 = 66, consisting of the root trans- specified, values are for the current frame, t.
lation and 3 rotational degrees of freedom per joint. We
encode the pose of the skeleton as a single 66-dimensional
vector θ containing the 3D global translation of the root, 3.2.1 Orientation term
followed the stacked local joint rotations of each bone (in- For each IMU, i, an orientation constraint is added which
cluding the root), represented as 3D angle-axis vectors (i.e. seeks to minimize the relative orientation between the mea-
the axis of rotation multiplied by the angle of rotation in sured and solved global bone orientation (Figure 1).
radians). This parameter vector is the variable which is op- The measured global bone orientation, Rgbi is obtained
timized, with the root translation t1 and joint rotations Rb from the IMU measurement Ri using the IMU-bone offset
being extracted and used in calculations as applicable. Rib and IMU reference frame-global offset as follows:
For each bone, b, the global rigid body transform Tgb is
−1
computed by concatenating bone offset and joint rotation Rgbi = Rig · Ri · (Rib ) . (3)
Pos. (t - 1)
Solved pose
Pos. (t - 2)
Ori. meas. EPP(θ)
N(σ, μ)
EPD(θ)
ER(θ) 2D pos. meas. Acc. target
EP(θ) Projected pose
Ori. target EA(θ) Pos. (t)
Proj. pos. target Acc. meas. PCA subspace
(a) Orientation term (b) Position term (c) Acceleration term (d) PCA prior (figure adapted from [10])
Figure 1: Visualization of data and prior terms in the cost function (Equation 2).
The solved global bone orientation, R̂gb is obtained using where ccp ∈ [0, 1] is a confidence weighting for con-
the kinematic chain, ignoring translations: straint p obtained from the image-based position measure-
Y ment mechanism (Section 3.3), λP is a position constraint
R̂gb = Rb 0 . (4) weighting factor, ρP (·) is a loss function (see Section 3.2.5).
b0 ∈P(bi ) The confidence weighting cp and loss function enable ro-
bust output pose estimates in spite of persistently high levels
and the orientation cost is of noise and frequent outliers in input position detections.
2
! In these experiments, the track targets are located on a
−1 subset of the joints and thus have zero offset w.r.t. the bone
X
ER (θ) = ρR λR
ψ (R̂gbi ) Rgbi
(5)
i∈[1,ni ]
2 (tpb = 0). In general positional targets could be offset from
the joint locations (this would be the case if the positions
where ψ(·) extracts the vector part of the quaternion repre- were to come from optical markers attached to the surface
sentation of the rotation matrix, λR is orientation constraint of the body, rather than markerless joint detections).
weighting factor, ρR (·) is a loss function. Discussion of the
weightings and loss functions are deferred to Section 3.2.5.
3.2.3 Acceleration term
where the operators τ T (·) and τ t (·) are shorthand for cre- where the solved IMU positions t̂gi are computed analo-
ating a transform matrix from a translation vector and ex- gously with Equation 6 (replacing subscripts p with i) and
tracting the translation vector from a transform matrix, re- ∆t is the frame period (in our case, 16.7 ms).
spectively. This global target position is projected into each The measured local accelerations from the previous
camera to obtain 2D solved targets t̂cp in camera coordi- frame of IMU data1 are converted to global coordinates as
nates: follows:
t̂cp = dh(Pc t̂gp ) (7)
agi (t − 1) = Rig · Ri (t − 1) · ai (t − 1) − ag (10)
where the operator dh(·) performs de-homogenization of a
homogeneous vector.
where ag = [0, 9.8707, 0]T is the acceleration of gravity,
The position cost is defined as
which needs to be subtracted. The acceleration cost is then
X X
2
EP (θ) = ρP λP ccp kt̂cp − tcp k2 (8) 1 The previous frame is used because central differences are used to
The CPM detector is able to detect multiple people within 4. Results and evaluation
a single image, while maintaining computation time [7].
We propose to increase the detection throughput by packing The approach was tested using an existing indoor
ROIs from multiple cameras (and optionally, frames) into a dataset, Total Capture [16] containing ground-truth data
single image. The detection is performed on the packed im- from a commercial mo-cap system, as well as on a new out-
age and the resulting detections are assigned the originating door dataset Outdoor 1. The solver can easily be configured
camera and frame (Figure 2). The ROIs for each camera to take an arbitrary subset of the available IMUs and posi-
are updated at every frame to an expanded bounding box of tional constraints to evaluate the effect of camera and IMU
the current detections. In the event of missed detections, the sparsity. Note that in this work, all positional constraint in-
corresponding ROI is reverted to the full image. In practice, formation is obtained from the multiple-view video based
the subject cannot be too small in the frame or the detector on per-view CPM as discussed in Section 3.3 and no optical
will fail. Packing 8 ROIs was found to be satisfactory (e.g. markers or visible targets are used.
1 frame from 8 cameras or 2 frames from 4 cameras). First, quantitative results are presented showing the rel-
ative performance with various configurations of IMUs and
cameras, sub-sampling configurations of position detections
3.3.2 Temporal sub-sampling as well as the contribution of each term in the cost function.
Next, further quantitative results are presented for multiple
To increase the frame-rate of our solver in spite of relatively sequences of the Total Capture dataset. Finally, qualitative
long CPM detection times, we propose to perform the CPM results are presented for the Outdoor 1 dataset, which does
detections on a subset of the input frames, resulting in tem- not contain ground truth data. Videos of the results are pre-
porally sparse position measurements. For the intervals of sented in the supplementary material.
frames without positional constraints, global motion is still Throughout the experiments, the same weightings were
produced because of the acceleration term, which essen- used for the cost function terms, namely λR = 1, λP =
tially performs ‘dead-reckoning’. In Section 4.1.2 different 1 × 10−3 , λA = 7 × 10−4 , λP P = 0.9, λP D = 0.08.
sub-sampling strategies are evaluated. These values were arrived at by a gradient-based parameter
Cam 1 Cam 2 Cam 1 Cam 2 Cam 1 Cam 2
Last detections and source ROIs Frame B Frame C Packed ROI image for CPM detection
Frame A (unseen) (unseen) (from frames B and C)
Figure 2: Visualization of the ROI packing process for efficient multi-camera/frame CPM detection.
optimization over 200 frames of one motion sequence. shorter time with no detections than SS 2/20. This has an
effect on the quality of the solved motion as shown in Fig-
4.1. Indoor capture results ure 4, where a range of sub-sampling rates were used, with
The Total Capture dataset includes five subjects (S) per- No ∈ {1, 2, 3}.
forming various motions including range of motion (ROM), These results suggest that it is optimal to use No = 2.
walking (W), acting (A), and ‘freestyle’ (FS). The subjects Having detections for two successive frames results in a
were recorded simultaneously using 13 Xsens MTw IMUs, more reliable motion trajectory than having a single frame
8 HD video cameras, and a commercial infra-red motion more frequently. Having three successive frames, No = 3
capture system consisting of 16 cameras and a dense set of means that the interval of no detections is too long and the
retro-reflecting markers worn by the subject. The marker- error increases. With a quarter of the frames detected (SS
based input is not used in the runtime solver and is only used 2/8) the error is still reasonably low, while the processing
in this work as a ‘ground truth’ reference for evaluation. time is reduced, increasing the output frame-rate. This dec-
imation rate is used to evaluate on additional sequences in
4.1.1 Sparse IMUs and cameras Section 4.1.4.
10
It is desirable to have a minimal capture hardware setup in
Mean pos. error (cm)
13 IMUs
order to reduce cost as well as actor setup time. We sim- 8
6 IMUs
10
2 and 8 of the available cameras are used in these tests.
8
Figure 3 compares the error using the sparse set of 6
6
IMUs with the full set of 13 using between 2 and 8 cam-
eras. With the sparse set of IMUs, position and orientation 4
2 3 4 6 8
error both decrease as more cameras are added. With the Number of cameras
full set of IMUs, the position error is lowest for intermedi- Figure 3: Position and orientation error with different sen-
ate numbers of cameras, while the orientation error hardly sor configurations, 13 or 6 IMUs and 2-8 cameras. Se-
varies with the number of cameras. An intermediate number quence: S2 - FS1, SS 1/1.
of cameras, 4, is used to evaluate on additional sequences in
Section 4.1.4.
4.1.3 Contribution of cost terms
4.1.2 Temporal sub-sampling of position
Table 1 shows the relative error in solved bone position and
We use the following notation for the temporal sub- orientation with selected terms in the cost function disabled.
sampling (SS) of the position detection: No /Np , where po- The results are shown with 4 cameras, with 2/8 detection
sition detection is performed on the first No frames of every sub-sampling.
Np frames in the sequence. For example SS 1/2 - every The orientation term from the IMUs has a strong effect
other frame, SS 2/4 - two out of every four frames. While on both position and orientation error, while the acceler-
SS 1/10 and SS 2/20 require the same amount of computa- ation term has a limited effect, helping with the position
tion, SS 1/10 provides a shorter interval of detections and in the 13 IMU case. The position term does not improve
S1 S2 S2 S3 S3 S4 S5 S5
Mean pos. error (cm)
Ours, 6 IMU, HS 18.3 10.9 10.6 16.2 19.7 14.8 14.3 15.1 15.0
6
5.5
Table 2: Mean error in position (cm) and orientation (deg)
for sequences from the Total Capture dataset using high
5 quality (HQ) and high speed (HS) settings, compared to the
1 2 3 4 6 10 15 20
Sub-sampling rate approach of Trumble et al. [16].
Figure 4: Position error under a range of detection sub-
sampling rates using No ∈ {1, 2, 3} successive frames. ror is maintained between HQ and HS (7.8 deg). Our ap-
Note that No = 2 yields the lowest position error across proach outperforms Trumble et al. [16] across the test se-
the sub-sampling range. Sequence: S2 - FS1, 13 IMUs, 8 quences both for HQ and HS modes. The errors for the 6
cameras. IMU case are larger at 9.1 cm, 12.5 deg for HQ and 14.2 cm,
15 deg for HS.
13 IMUs 6 IMUs
Terms Omitted Pos. Ori. Pos. Ori. 4.2. Outdoor capture results
IMU (ER , EA ) 1.97 4.82 1.27 2.38
The Outdoor 1 dataset was recorded outdoors in chal-
Ori. (ER ) 2.63 6.27 1.54 2.89
Acc. (EA ) 1.11 0.99 1.01 0.97
lenging uncontrolled conditions with a moving background
Pos. (EP ) 188.58 1.00 194.82 1.05 and varying illumination. A set of 6 cameras were placed
Prior (EP P , EP D ) 1.50 4.68 1.42 4.33 in a 120◦ arc around the subject and 13 Xsens IMUs. No
Prior Proj. (EP P ) 2.26 6.29 1.63 6.46 ground truth data is available for this dataset. Figure 7
Prior Dev. (EP D ) 1.16 2.86 1.46 3.24 shows a selection of solved frames overlaid on the input
image and full sequences are shown in the supplementary
Table 1: Position and orientation error with various terms in video.
the cost function disabled, relative to the error using the full
cost function, Equation 2 (Sequence: S2 - FS1, 4 Cam., SS
2/8).
4.1.4 Further results Figure 5: Solved (blue) and ground truth (yellow) skeletons
overlaid on an input image showing CPM detections in yel-
In Table 2, further quantitative results are provided for sev- low and the corresponding locations on the solve skeleton
eral sequences from the Total Capture dataset, covering the in blue. Note the robustness to the outlier detection on the
5 subjects and a range of motion types from slow ROM mo- leg. Sequence: S5, FS1.
tion to challenging sequences including fast motion and un-
usual poses such as lying on the floor (see Figure 6 and refer
to the supplementary video). Figure 5 shows the robustness
4.3. Computation time
of our approach to typical misdetections from the CPM joint Figure 8 shows the real-time online frame-rate achieved
detector. using the approach as a function of the sub-sampling rate
Four configurations were used: high quality, ‘HQ’ (8 (with the CPM detection running in parallel with the main
camera, SS 1/1) and high speed, ‘HS’ (4 camera, SS 2/8) solver thread). The computing hardware is a standard desk-
each with both the 13 and 6 IMU sets. The average position top PC with Intel i7 3.6 GHz CPU and NVIDIA GTX 1080
error using 13 IMUs is 6.2 cm using HQ mode, degrading GPU. A frame rate of 30 fps can be achieved with SS 2/8,
slightly to 6.8 cm using HS mode, while the orientation er- while a rate in excess of 60 fps can be can be achieved with
(a) S1, FS3 (b) S2, ROM3 (c) S3, FS3
(a) Camera layout (b) Camera views (freestyle) (c) Multiple frames (prop interaction)
Figure 7: Selection of solved frames from the Outdoor 1 dataset.
80
13 IMUs 5. Conclusion and further work
70 6 IMUs
50
and IMU hardware requirements. It is capable of recovering
the full 6-DoF pose, without drift in global position and can
40
operate both in constrained studio environments and in un-
30
constrained setups such as outdoor scenes with varying il-
20
lumination moving backgrounds and occlusion. The solver
10 can handle missing or outlier joint detections and even short
0 periods of complete occlusion because of the inclusion of
1 2 3 4 6 10 15 20
Sub-sampling rate the IMU input, degrading gracefully as the hardware is re-
duced. Reducing the number of cameras has less of an ef-
Figure 8: Output frame-rate of our solver (including detec-
fect on quality than reducing the number of IMUs. Future
tions) as a function of sub-sampling level. Sequence: S2-
work includes optimizing the code and using multiple GPUs
FS1, 4 cameras, No = 2.
to increase CPM detection throughput. It would also be pos-
sible to extend the approach to handle multiple subjects.
Acknowledgements
more aggressive sub-sampling (SS 2/40). In practice, 30 fps This work was supported by the Innovate UK Total Capture
is sufficient for most applications, and the increase in speed project (grant 102685) and the EU H2020 Visual Media project
is not worth the increase in error. Although not tested here, (grant 687800). We wish to thank Anna Korzeniowska, Evren
it should also be possible to use a lower decimation rate by Imre, Joao Regateiro and Armin Mustafa for their help with data
running two CPM detectors in parallel on two GPUs. capture.
References [18] T. von Marcard, B. Rosenhahn, M. Black, and G. Pons-Moll.
Sparse Inertial Poser: Automatic 3D Human Pose Estimation
[1] IKinemaOrion. https://ikinema.com/orion. 2 from Sparse IMUs. In Eurographics 2017, volume 36, 2017.
[2] OptiTrack Motive. http://www.optitrack.com. 1 1
[3] Perception Neuron. http://www.neuronmocap.com. [19] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Con-
1 volutional Pose Machines. 2016 IEEE Conference on Com-
[4] Vicon Blade. http://www.vicon.com. 1 puter Vision and Pattern Recognition, pages 4724–4732,
[5] S. Agarwal, K. Mierle, and Others. Ceres solver. http: 2016. 1
//ceres-solver.org. 4 [20] X. Wei, P. Zhang, and J. Chai. Accurate realtime full-body
[6] S. Andrews, I. Huerta, T. Komura, L. Sigal, and K. Mitchell. motion capture using a single depth camera. ACM Transac-
Real-time Physics-based Motion Capture with Sparse Sen- tions on Graphics, 31(6):1, 2012. 2
sors. In Proceedings of the 13th European Conference on [21] Z. Zhang. Flexible camera calibration by viewing a plane
Visual Media Production (CVMP 2016), 2016. 1 from unknown orientations. In ICCV, volume 00, pages 0–7,
1999. 5
[7] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh. Realtime
multi-person 2D pose estimation using part affinity fields. [22] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and
In Conference on Computer Vision and Pattern Recognition K. Daniilidis. Sparseness meets deepness: 3D human pose
(CVPR), 2017. 1, 5 estimation from monocular video. 2016 IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), pages
[8] T. Helten, M. Muller, H.-P. Seidel, and C. Theobalt. Real-
4966–4975, 2016. 2
time body tracking with one depth camera and inertial sen-
sors. In Proceedings of the IEEE International Conference
on Computer Vision (ICCV), pages 1105–1112, 2013. 2
[9] S. Hochreiter and J. Schmidhuber. Long short-term memory.
In Neural computation, volume 9, pages 1735–1780. MIT
Press, 1997. 2
[10] A. E. Ichim and F. Tombari. Semantic parametric body shape
estimation from noisy depth sequences. Robotics and Au-
tonomous Systems, 75:539–549, 2016. 2, 3, 4
[11] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J.
Black. SMPL: A skinned multi-person linear model. ACM
Trans. Graphics (Proc. SIGGRAPH Asia), 34(6):248:1–
248:16, Oct. 2015. 1
[12] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin,
M. Shafiei, H.-P. Seidel, W. Xu, D. Casas, and C. Theobalt.
VNect: Real-time 3D human pose estimation with a single
RGB camera. In ACM Transactions on Graphics, volume 36,
2017. 1
[13] D. Roetenberg, H. Luinge, and P. Slycke. Xsens MVN :
Full 6DOF Human Motion Tracking Using Miniature Inertial
Sensors. Technical report, pages 1–7, 2013. 1, 5
[14] D. Tome, C. Russell, and L. Agapito. Lifting from the
deep: Convolutional 3D pose estimation from a single im-
age. Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2017. 1
[15] M. Trumble, A. Gilbert, A. Hilton, and J. Collomosse. Deep
convolutional networks for marker-less human pose estima-
tion from multiple views. In Proceedings of the 13th Euro-
pean Conference on Visual Media Production (CVMP 2016),
2016. 2
[16] M. Trumble, A. Gilbert, C. Malleson, A. Hilton, and J. Col-
lomosse. Total Capture: 3D Human Pose Estimation Fusing
Video and Inertial Sensors. In British Machine Vision Con-
ference (BMVC), 2017. 2, 4, 5, 7
[17] T. Von Marcard, G. Pons-Moll, and B. Rosenhahn. Human
Pose Estimation from Video and IMUs. IEEE Transactions
on Pattern Analysis and Machine Intelligence, 38(8):1533–
1547, aug 2016. 1