Real-Time Gesture Recognition Using Deterministic Boosting
Real-Time Gesture Recognition Using Deterministic Boosting
deterministic boosting
Raymond Lockton and Andrew W. Fitzgibbon
Department of Engineering Science
University of Oxford.
Abstract
1 Introduction
Gesture recognition is an area of active current research in computer vision. The prospect of
a user-interface in which natural gestures can be used to enhance human-computer interac-
tion brings visions of more accessible computer systems, and ultimately of higher bandwidth
interactions than will be possible using keyboard and mouse alone.
This paper describes a system for automatic real-time control of a windowed operating
system entirely based on one-handed gesture recognition. Using a computer-mounted video
camera as the sensor, the system can reliably interpret a 46-element gesture set at 15 frames
per second on a 600MHz notebook PC. The gesture set, shown in figure 1, comprises the
36 letters and digits of the American Sign Language fingerspelling alphabet [9], three ‘mouse
buttons’, and some ancillary gestures. The accompanying video, and figure 4, show a tran-
script of about five minutes of system operation in which files are created, renamed, moved,
and edited—entirely under gesture control. This corresponds to a per-image recognition rate
of over 99%, which exceeds any reported system to date, whether or not real-time. This sig-
nificant improvement in performance is the outcome of three factors:
1. The general engineering of the system means that preprocessing is reliable on every
frame. Lighting is controlled with just enough care to ensure that most skin pixels are de-
tected using simple image processing. The user wears a coloured wrist band which allows the
orientation of the hand to be easily computed.
2. An exemplar-based classifier[11, 19] ensures that recognition of the gesture label from
a preprocessed image uses a rich, informative model, allowing a large gesture vocabulary to
be employed.
British Machine Vision Conference 2
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
Figure 1: The gesture set to be recognized. The letters and digits of American sign language
have been modified as follows: the dynamic gestures ‘J’ and ‘Z’ are replaced with static
versions which can be recognized in a single frame; the digits whose sign is identical to a
letter—and which therefore are distinguished by human signers based on context—have been
modified to be distinguishable without such context; and ten additional gestures have been
added for operations such as ‘delete’, ‘enter’ and mouse button actions.
Figure 2: Preprocessing stages. Webcam images (a) are thresholded (b) in RGB for speed.
Detected wristband and hand centroids (c) are used to transform to canonical frame (d). Note
that in the final application, the canonical frame image is never explicitly formed—the small
subset of pixels queried is pulled directly from the input image. Note also that several pixels
on the hand have not been classified as skin: the recognition stage will learn to ignore these.
technical report [2]. Figure 2 illustrates the procedure for a typical input image.
For the remainder of the paper, images will be represented as a 1D column vectors x ∈ RN
which are the concatenation of the image columns. All images will be considered to be the
binarized, canonicalized versions as in figure 2d. We wish to recognise gestures based on
a previously acquired training set. Let us assume that we have K gestures, and for the k th
gesture we have obtained Mk training examples. Throughout the rest of the discussion, we
will assume without loss of generality that all gestures have the same number of training
images M , so Mk = M ∀k. In our application, K = 46 and M = 100, so there are a total of
4600 training examples; N is the number of image pixels, 320 × 240 here.
The training examples are denoted xmk for m = 1 . . . M and k = 1 . . . K. By convention,
the correct label for the example xmk is k. A classifier is a function c(x) = l which takes a
test image x and returns a label l ∈ {1 . . . K}. A classifier which explains the training data
perfectly will obey
c(xmk ) = k ∀m, k
We trust in ingenuity and a representative training set to find classifiers for which good perfor-
mance on the training data implies good performance on test examples. The task of this paper
is to define an accurate classifier which may be computed rapidly enough to allow real time
operation.
A soft classifier does not return a label, but instead assigns a likelihood to a given (image,
label) combination: c(x, l) : RN × {1 . . . K} 7→ [0, 1] A likelihood close to one implies
that the image is likely to be that label, a likelihood close to zero says that the assignment is
unlikely. A perfect soft classifier will assign ones on the training set for the correct label, and
zeros otherwise, i.e. c(xmk , l) = δlk ∀l, m, k where δ is the Kronecker delta. We note also
that P
the soft classifier is not required to return probabilities. In particular, we do not require
K
that l=1 c(x, l) = 1.
3 Background
This paper draws from a few strands of research: gesture recognition, exemplar-based track-
ing, and pattern classification. This section discusses the related literature, and emphasizes the
areas in which this paper innovates.
Gesture recognition: Much work has been done on gesture recognition over the years, and
this section does not attempt a full literature review, but rather points to some prototypical
British Machine Vision Conference 4
Comparing the two strategies, some general observations can be made: it is often easy to
provide a distance metric, but hard to build a parametric model. On the other hand, parametric
models can generalize from small amounts of training data. For parametric models, the set of
parameters which generate physically valid models may be difficult to characterize: given a
PDM with sufficient variability to model the full set of gestures in figure 1, one would expect
to find that the set of parameters which generate valid gestures live in a complex nonlinear sub-
space of the linear vector space. The final difficulty lies in automatic initialization and training
of these techniques, which remains an open research problem. Exemplars, on the other hand,
are trivially trained and at runtime the templates are easy to extract from the image stream,
but the techniques require large training sets, and initially required significant computational
effort. Gavrila and Philomin’s hierarchical matching [11] has made one step towards real time
recognition from large databases, the deterministic boosting algorithm introduced in this work
is another. In summary, exemplars allow the construction of reliable and robust systems, and
can now be made fast as well.
Boosting: Boosting [10] is one of a class of techniques which allows the combination of
several statistical classifiers in order to generate a consensus classifier which attains high reli-
ability and accuracy. The component weak classifiers are typically fast, but low quality, clas-
sifying only slightly better than at random for two-class problems. More correctly, “boosting”
refers to one of Freund and Schapire’s AdaBoost algorithms [10] for training such combined
systems and selection of the set of weak classifiers.
The basic idea of boosting is simple. We have a classification problem, and access to a
weak classifier—say for example, a neural network. Most importantly, we have a way to tell
the weak classifier to concentrate on getting certain examples in the training set right. We
train the weak classifier on our training data, favouring no particular examples. As expected,
the result is a classifier, call it c1 , which explains the training data, but whose performance is
maybe only a bit better than random. The key step follows: the weak classifier is retrained,
concentrating on getting the “hard” examples right. The original classifier is not discarded;
a combined classifier is built which is a weighted average of the results of c1 and the newly
trained classifier, c2 . Now, the combined classifier has a new set of “hard” examples, hopefully
smaller than either of the “hard sets” of c1 and c2 . Continuing the process can be shown
to consistently yield a high-accuracy classifier providing the weak classifier’s performance
is better than random. For details of how the classifier performance on the training set is
converted to favour hard examples, and of how the weak classifiers are combined, the reader
is referred to [10]. The well-known disadvantage of boosting is that as the hard sets get harder,
the likelihood of finding a good classifier drops, so that training is notoriously slow.
In a recent computer vision application of boosting, Viola et al [20] combined very simple
face detectors to build a real-time face detection system with excellent accuracy. The simple
sensors are based on thresholding of Haar wavelet responses, which may be computed ex-
tremely quickly. In this paper the sensors used are even simpler: a single pixel’s skin/non-skin
status is all that is queried. This paper’s new deterministic boosting algorithm speeds up the
boosting of extremely weak classifiers such as these.
4 The algorithm
In order to outline the algorithm used in this paper, we first describe it in terms of nearest-
neighbour template matching, and then describe the techniques which render it computation-
British Machine Vision Conference 6
ally feasible. In nearest-neighbour matching, a new binary image x is compared against all
the training examples, and the label of the closest example is P
reported. Affinity between two
N
binary images x and y is a centered correlation: a(x, y) = N4 n=1 (xn − 12 )(yn − 12 ), whose
value is +1 when x and y are identical and reaches a minimum of −1 when y = 1 − x. We
may define a distance measure as d(x, y) = 1−a(x, y). The nearest-neighbour classifier cNN
is !
cNN (x) = argmin min d(x, xmk )
k∈gestures m∈examples
and its soft-classifier form is cNN (x, l) = maxm 12 a(x, xml ) + 12 . The computational com-
plexity of template matching is O(M KN )—linear in the number of images, gestures and
pixels.
Although choosing a subset Ck which satisfies this threshold and has the smallest possible
number of elements is an NP-hard problem, we have found that a greedy algorithm1 provides
adequate results.
The algorithm was applied to a set of 100 examples of each of 46 gestures. The threshold
on minimum affinity α was set to 0.95. The numbers of exemplars for the gestures ‘A’ through
‘E’ were reduced from 100 each to 1, 1, 3, 4, 8 respectively and the total for all 46 gestures
was reduced from 4600 to 183.
Finally, the cluster contents are summarized for each exemplar by a single coherence map,
defined as follows. Each input example is assigned to the cluster with whose centre it has high-
est affinity. This defines a set of assigned images Sik for each cluster centre xmi ,k P
in Ck . The
coherence map vmi ,k is just the pixelwise mean of each cluster vmi ,k = #S1ik x∈Sik x.
Each coherence map encodes, for each pixel, the number of times that pixel was detected as
skin over the training images in the cluster. Specifically, for a coherence map v, we have
vn = 1 if pixel n was skin in every image in the cluster, vn = 0 if it was always background,
and intermediate values if the pixel was detected as both, due to within-class variation. Fig-
ure 3 shows coherence maps for one cluster of each of the gestures ‘M’ and ‘N’. The nearest
neighbour classifier cNN is defined exactly as before, albeit with real-valued rather than bi-
nary exemplars. This phase reduced recognition accuracy on the training set from 100%, as
1 Set Ck = {}. While Tk is not empty: (add head(Tk ) to Ck , remove from Tk all y for which a(y, x) > α)
British Machine Vision Conference 7
Figure 3: Left (m,n): Within-class variation map v for each of two gestures. Black pixels
are detected as skin in each training image, white pixels are always background, and gray
pixels are those whose classification varied through the training set. Middle (pseudo-loss):
Pixels which distinguish m and n. These are reliable (black or white) for both gestures, and
distinctive (black for m, white for n or vice versa). This is a map of p for this 2-gesture set
(§4.2). Right (boosted): The final set N of 1199 pixels used for classification over the 46
gestures. Notice that pixels near the wrist are ignored—although reliably segmented as skin,
they are not distinctive as all gestures share them.
is always achieved on the training set with nearest-neighbour classifiers, to about 99.5%, and
speed is improved by a factorP of 20. The training set is now represented by a set of coherence
K
maps vi for i in 1 . . . M 0 = k=1 rk and an associated label assignment ki which indicates
the gesture to which vi corresponds.
fore, after training in order to find N , we obtain the combined recognition algorithm R, which
assigns a gesture label to a canonical-frame image x as follows: R(x) = argmaxk cc(x, k).
The remaining detail is how to choose the set of query pixels N from the set of all pixels.
One option would be simple random sampling, another would be to use AdaBoost. However,
in this work we can do better than either alternative. We define a quality function for each
weak classifier, analogous to AdaBoost’s pseudo-loss function. The quality function for cn
has a high value if pixel n is a good discriminant across the training set, and if it splits the
dataset well. Thus if we take a certain pixel n, and consider the set of coherence-map values at
that pixel: V = {[v1 ]n , . . . , [vM 0 ]n } we wish to minimize the following pseudo-loss function
X X
pn = (v − 0.5) − |v − 0.5|
v∈V v∈V
The image p which is the reassembly of all the pn on a two-gesture set is shown in figure 3.
We may now deterministically select the weak classifiers: sorting the set of all pixels
on pn yields an ordered set P of the single-pixel classifiers from most to least effective at
classifying the training set. Choosing the first few hundred of these would be expected to
yield good recognition performance at significantly lower computational cost than matching
over all pixels. However, one final refinement is required. We wish each of our single-pixel
classifiers to classify the gesture set into different partitions. Thus, we cluster the classifiers
into subsets which distinguish the same gestures. In order to perform the classification, we
require a metric which measures whether two pixel classifiers perform the P same job. Again,
we can derive an efficient function to compute this metric: D(n1 , n2 ) = i |[vi ]n1 − [vi ]n2 |
which is low if pixels n1 and n2 have similar values for each exemplar in the training set.
Greedy clustering using this metric2 produces the final set N of query pixels. We call this
combination of sorting and clustering “deterministic boosting”. Its primary characteristics are
fast training and repeatable results. However, its downside is that one needs to be able to
deterministically compute the loss function pn , and similarity metric D which is possible only
for simple weak classifiers such as the single-pixel classifiers used here.
In our experiments, with the threshold on D set to the rather tight value of 2, the number
of query pixels was reduced from 34788 to 1199, yielding a 30-fold reduction in complexity.
Figure 3 illustrates the set of query pixels finally used.
demonstration of gesture-based full text entry and mouse operation using computer vision
alone.
The performance of the system is achieved by a combination of factors. Careful engineer-
ing of the acquisition means that accurate lighting control is replaced by a lightweight passive
wristband on the user. This requirement is less intrusive than gloves or finger bands. Secondly,
a novel adaptation of the emerging computer vision techniques of exemplar-based recognition
and boosting allows a system which is essentially a brute-force template matcher to operate in
real time with a large vocabulary.
On the other hand, much work remains. The reason for developing a high-speed training
(a) (b)
(c) (d)
(e) (e)
Figure 4: Demo transcript. Six frames from a 10-minute demo in which a windowed operat-
ing system is fed UI events generated by the gesture recognition system described in this paper (see
http://www.robots.ox.ac.uk/∼awf/bmvc02 for the demo). Control is exclusively via hand
gestures. Transcript: A right click opens a menu (a), a new text file is created and named my demo.txt
(b). A new folder is created, named demo folder (c), and the text file is dragged into the new folder
(d). Navigating to the new folder (e), double-clicking the text file, text is entered, and the file is saved
(h). In total, six errors were made, two of which were operator error, and all of which were correctable
using the backspace/undo gesture.
British Machine Vision Conference 10
algorithm such as deterministic boosting was to allow online retraining: every time a gesture
is misrecognized, it would be useful to add the misrecognized example to the training set and
retrain. Because both of the clustering algorithms are greedy, it is trivial to add new examples
to them. Replacing the greedy clustering algorithms with more sophisticated versions would
reduce the number of exemplars, and increase speed further.
Although the system does not require careful lighting, it does depend on its being the
same for test and training examples. This is because many gestures are distinguished by fairly
subtle shadowing effects. If lighting conditions are expected to vary, the training set should be
extended to include examples under all conditions or to switch between training sets captured
under the different conditions.
In the future, removal of the wrist band is an obvious candidate enhancement. This will
have a number of deleterious effects, particularly if the user is wearing short sleeves, or loose
sleeves which move along the arm. In fact, it is fair to say that most existing gesture recogni-
tion systems have an implicit “wrist band” assumption—this paper simply makes it explicit.
Acknowledgements This work was supported by the Royal Society and the Department of
Engineering Science, University of Oxford.
References
[1] T. Ahmad, C. J. Taylor, A. Lanitis, and T. F. Cootes. Tracking and recognising hand gestures using statistical
shape models. Image and Vision Computing, 15(5):345–352, 1997.
[2] Anonymous. Hand gesture recognition using computer vision. Technical report, Institution, 2002.
[3] H. Birk, T. Moeslund, and C. Madsen. Real-time recognition of hand alphabet gestures using principal compo-
nent analysis. In Proceedings, SCIA, 1997.
[4] R. Bowden, T. A. Mitchell, and M. Sarhadi. Non-linear statistical models for the 3d reconstruction of human
pose and motion from monocular image sequences. Image and Vision Computing, 18(9):729–737, 2000.
[5] R. Bowden and M. Sarhadi. Building temporal models for gesture recognition. In Proc. BMVC., volume 1,
pages 32–41, 2000.
[6] M. Brand, N. Oliver, and A. Pentland. Coupled hidden markov models for complex action recognition. In Proc.
CVPR, 1997.
[7] R. Cipolla and A. Pentland. Computer Vision for Human Machine Interaction. Cambridge University Press,
1998.
[8] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham. Active shape models—their training and application.
CVIU, 61(1):38–59, 1995.
[9] E. Costello. American sign language dictionary. Random House, 1997.
[10] Y. Freund and R. E. Schapire. Experiments with a new boosting algorithm. In Machine Learning: Proceedings,
1996.
[11] D. Gavrila and V. Philomin. Real-time object detection for “smart” vehicles. In Proc. ICCV, pages 87–93, 1999.
[12] T. Heap and D. Hogg. Towards 3D hand tracking using a deformable model. In Intl. Conf. on Automatic Face
and Gesture Recognition, pages 140–145, 1996.
[13] Jerome Martin and James L. Crowley. An appearance-based approach to gesture recognition. In Proceedings,
ICIAP, pages 340–347, 1997.
[14] Vladimir Pavlovic, Rajeev Sharma, and Thomas S. Huang. Visual interpretation of hand gestures for human-
computer interaction: A review. IEEE PAMI, 19(7):677–695, 1997.
[15] James Rehg and Takeo Kanade. DigitEyes: Vision-Based Human Hand Tracking. Technical Report CMU-CS-
93-220, Carnegie-Mellon Univ, Dec 1993.
[16] S. Romdhani, S. Gong, and A. Psarrou. Multi-view nonlinear active shape model using kernel PCA. In Proc.
BMVC., pages 13–16, 1999.
[17] T. Starner, J. Weaver, and A. Pentland. Real-time american sign language recognition using desk- and wearable
computer-based video. IEEE PAMI, 20(12):1371–1375, 1998.
[18] B. Stenger, P. R. S. Mendonça, and R. Cipolla. Model based 3D tracking of an articulated hand. In Proc. CVPR,
pages 310–315, 2001.
[19] K. Toyama and A. Blake. Probabilistic tracking in a metric space. In Proc. ICCV, pages II, 50–57, 2001.
[20] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Proc. CVPR, pages
??–??, 2001.
[21] C. Vogler and D. N. Metaxas. Parallel hidden markov models for american sign language recognition. In Proc.
ICCV, pages 116–122, 1999.