Improved Estimation of Hand Postures Usi
Improved Estimation of Hand Postures Usi
Depth Images
Abstract—Hand pose estimation is the task of deriving a Oikonomidis et al.’s [2] method deals very well with the
hand’s articulation from sensory input, here depth images high-dimensionality and self-occlusions of the human hand.
in particular. A novel approach states pose estimation as an However, their approach is still computationally demanding.
optimization problem: a high-dimensional hypothesis space is They report that their algorithm can run at about 15 FPS on
constructed from a hand model, in which particle swarms search a high-end PC. This is only half the rate at which the Kinect
for the best pose hypothesis. We propose various additions to
provides images. Our goal was to improve the performance,
this approach. Our extended hand model includes anatomical
constraints of hand motion by applying principal component possibly to the point of running in real-time. At the same time
analysis (PCA). This allows us to treat pose estimation as a we did not want to sacrifice any accuracy. We addressed this
problem with variable dimensionality. The most important benefit by exploiting biases in certain variants of particle swarms. We
becomes visible once our PCA-enhanced model is combined with will show, that the optimization behavior of these variants can
biased particle swarms. Several experiments show that accuracy be aligned with a priori knowledge about how humans perform
and performance of pose estimation improve significantly. hand motions. The result was an overall improved convergence
behavior, leading to better pose estimation in less time.
I. I NTRODUCTION The idea to use a priori information has already been
The human hand is highly articulated. Humans use hands to applied successfully to hand pose estimation by Bianchi et
manipulate objects in their surroundings and to communicate al. [3]. They determined statistical properties of hand motion
with other people. Capturing exact hand postures is an impor- and used these to improve the noisy measurements of a
tant step for Human-Robot Interaction and the development of low-cost data glove. Our method differs in the way a priori
natural interfaces. Computer vision (CV) can provide cheap knowledge is used. We use it to transform the search space
and unobtrusive solutions to this problem, especially compared of all hand postures, such that certain variants of particle
to data gloves. swarm optimization (PSO) perform better due to biases in their
behavior. We also do not require an existing pose estimation.
Solving CV-based hand pose estimation without markers
in single camera setups is a very challenging task, because This paper is organized as follows: Section II covers our
hands can take on vastly different shapes in images. The image preprocessing. Its purpose is to segment images into
amount of degrees of freedom (DOFs) contributes to a high- hand and non-hand parts. Section III introduces our new hand
dimensional problem. The problem is further complicated by model and how its parameter space is altered through principal
self-occlusions of the hand, that happen inevitably during the component analysis. Particle swarm optimization and the target
projection onto 2D images. function mentioned above are covered in Section IV. We will
also explain our motivation for using a PSO variant with
Following the taxonomy of Erol et al. [1], the approach certain biases. In Section V we detail our experiments with
discussed here belongs to the class of model-based tracking the new method and provide an evaluation of the data. A
methods that follow a single hypothesis over time. In this final discussion and an outlook for future research are given
context, single hypothesis means that only one satisfying in Section VI.
solution is searched for and kept for the initialization of the
next frame. II. H AND D ETECTION & T RACKING
Significant progress in this area was made by Oikonomidis Detecting hands in images is a necessary step, because pose
et al. [2]. They formulate pose estimation as an optimiza- estimation is not capable of performing this segmentation by
tion problem. An internal hand model defines the parameters itself. We have separated the task into two steps: first, an initial
(DOFs) that make up a hand pose. This high-dimensional one-time detection of hands based on depth images and shape
space is searched by a particle swarm for a suitable solution. recognition and second, subsequent tracking of the hand region
As a particle moves, it renders an artificial depth image with an adaptive skin color model.
of its current hand pose hypothesis, which is compared to
the actual observation from the Kinect. A target function is For the first step, we restrict the detection to a specific
used to measure the discrepancy between rendered image and hand posture that has a distinctive shape. A hand has to
observation. be open and face the sensor, with the fingers spread out a
little. We perform foreground segmentation on depth images
to reduce the region of interest. After that, edge detection in the
foreground depth image provides a set of candidate contours.
To support classification and the ability for generalization, we
use Fourier descriptors with 12 complex-valued coefficients1
to represent contours. These provide desirable invariance prop-
Fig. 1. Hand model in different configurations. It consists of two types of
erties against common affine transformation (e. g. scale or objects: ellipsoids (in blue) and elliptical cylinders (in green, red, orange and
rotation). Furthermore, the contour information is condensed yellow).
in these 12 coefficients. Finally soft-margin support vector
machines are used to separate hands from non-hands.
Based on the color that is enclosed by a hand contour, we
learn the parameters of an elliptical boundary model (EBM)
[4] of the skin color distribution. In all subsequent frames after
successful detection by shape, this model is used to retrieve
the hand.
This two-step scheme can properly distinguish between
hands and other skin-colored objects in the scene. It comes
at the cost of requiring a specific hand posture for detection.
But this restriction is alleviated as soon as the distribution
parameters of the skin color are learned.
1 2 3 4 5 6
Variance 1.324 0.466 0.325 0.259 0.178 0.13
Ratio 42.7% 15% 10.5% 8.3% 5.7% 4.2%
Cum. Ratio 42.7% 57.7% 68.2% 76.5% 82.2% 86.5%
7 8 9 10 11 12
Variance 0.118 0.09 0.063 0.057 0.035 0.022
Ratio 3.8% 2.9% 2% 1.8% 1.1% 0.7%
Cum. Ratio 90.3% 93.2% 95.2% 97% 98.1% 98.9%
13 14 15 16
Variance 0.017 0.012 0.004 0.003
Ratio 0.5% 0.4% 0.1% 0.08%
Cum. Ratio 99.4% 99.8% 99.9% 100%
Mean error
15°
8°
12°
6°
9°
4°
6°
3° 2°
0° 0°
4 6 8 10 12 14 16 18 20 22 24 26 28 30 7 10 13 16 19 22
PSO generations Number of DOFs after PCA
Fig. 4. Comparison of our method (blue, magenta) with Oikonomidis et Fig. 5. Dependency between DOFs and mean error.
al. [2] (red, green). A single measurement indicates the angle error (equation 6)
averaged over all frames in the test video.
be influenced negatively when insignificant dimensions were
present.
correspond to the hand location in 3D space, which was not
considered for evaluation. In contrast to Oikonomidis et al. [2],
we chose to stay close to the actual representation of hand C. Qualitative Results
poses as a vector of mostly angles. They derived locations of When it comes to the visually perceived accuracy of
phalanx endpoints and used them to measure accuracy. our pose estimation, there were only minor discrepancies
This evaluation did not take into consideration PSO pa- compared to the real hand posture. Figure 6 shows eight
rameters other than the number of particles and generations. different postures alongside the model, articulated according
In particular, the effect of the cognitive and social factors to the estimation. Most errors stem from thumb estimation.
in the particles velocity equation (2) was not analyzed. To Particularly in Fig. 6(d), the thumb does not point away from
conduct the experiments, we set the values to φc = 2.8 the hand. It is actually inside the other fingers, which is also the
and φs = 1.3 [2], i. e. the constriction factor χ was 0.73 case in Fig. 6(f). This happened quite often, because we did not
(equation 3). Most combinations for φc and φs perform well, perform collision detection or model any physical constraints.
as long as φc + φs = 4.1 holds true [14]. Figures 6(e) and (g) show postures with severe out-of-plane
rotations, that still resulted in proper estimations. We have
found these kinds of postures to be especially problematic,
A. Direct Comparison because the hand occludes large parts of itself.
The vertical axis in Fig. 4 shows the mean absolute angle
error ea . A single measurement is the mean error over all VI. D ISCUSSION & F UTURE W ORK
frames in the entire video for the given PSO parameters. The
In this paper we presented an improved method for the
original method shows a strong dependency on the number of
problem of full-DOF hand pose estimation, based on the
generations. To keep the error below an average of 9◦ at least
method by Oikonomidis et al. [2] that has been extended to
64 particles and 22 generations had to be used. For our method
take a priori knowledge about hand motion into consideration.
on the other hand, 32 particles and 14 generations already were
We achieved this by first applying a common relationship
sufficient. In general, we observed much faster convergence
between DIP and PIP joints, followed by a change of basis
after enabling the PCA and biased PSO. The curves for our
to eigenvectors. This way, biases in particle swarms can be
method are less steep in Fig 4. This directly translates to
exploited, leading to much improved convergence behavior.
an improved performance, because less effort was required
We performed several experiments with partially synthetic
to achieve a certain maximum error. Using 64 particles and
data to provide evidence for this claim. Other experiments
25 generations has been suggested before [2]. We reached the
revealed that our method retains its optimal accuracy when
same error at 32/18, which is roughly 2.8 times faster.
all dimensions are
B. Dimensionality Reduction We discussed our hand model from three different per-
spectives: shape, DOFs and constraints. Several similar works
The experiments above used our hand model with 22 [2], [15], [16] focus almost exclusively on the shape of the
DOFs. We applied the PCA but did not remove any dimensions model, while we put more emphasis on the DOFs of the model
afterwards. Figure 5 depicts the same experiment (32 particles, instead. With the terminology of Lin et al. [6], only level 1
20 generations) but with a varying number of dimensions. The constraints have been imposed on joint angles in the relevant
lowest possible number of dimensions is 7, which include 6 for literature [2], [15], [16]. In this paper, constraints of level 2
the global pose and just one dimension for all joint angles. The and 3 were considered, and some have been modelled with
data indicate that there was no benefit in removing dimensions. closed formulas, while the majority is included through PCA.
Starting from the right, the mean error first stagnates and then This also introduced the dimensionality of the hand model as
starts rising. Thus, our method performs optimally when all a parameter instead of a fixed value. The PCA played a major
dimensions are left in place. The biased PSO did not seem to role in the improved properties of our method.
(a) (b) (c) (d)