Adaboost With Totally Corrective Updates For Fast Face Detection
Adaboost With Totally Corrective Updates For Fast Face Detection
Abstract fied as a face by the previous stages, the final false positive
rate is very low (equal to the product of false positive rates
An extension of the AdaBoost learning algorithm is pro- of all stages, see Algorithm 3) and the final detection rate
posed and brought to bear on the face detection problem. remains high.
In each weak classifier selection cycle, the novel totally cor- The cascade evaluation is equivalent to a sequential clas-
rective algorithm reduces aggressively the upper bound on sification using a degenerated decision tree. When the cur-
the training error by correcting coefficients of all weak clas- rent stage classifier labels a region in an image as a non-
sifiers. The correction steps are proven to lower the upper face, the decision process is terminated. Otherwise, the next
bound on the error without increasing computational com- stage classifier is run. A region is declared a face if it is ac-
plexity of the resulting detector. We show experimentally cepted by all classifiers in the cascade.
that for the face detection problem, where large training Face detection is done by moving the cascade detector
sets are available, the technique does not overfit. across the image at multiple scales and locations. A typical
A cascaded face detector of the Viola-Jones type is built image contains only a small number of face regions com-
using AdaBoost with the Totally Corrective Update. The pared to the number of regions scanned. Due to early ter-
same detection and false positive rates are achieved with a mination of the decision process in non-face regions, only
detector that is 20 % faster and consists of only a quarter of few stages of the cascade are evaluated on average [12, 5].
the weak classifiers needed for a classifier trained by stan- Hence, the speed of evaluation depends heavily on the com-
dard AdaBoost. The latter property facilitates hardware im- putational complexity and rejection rates of the first few
plementation, the former opens scope for the increase in the stages. The enhanced AdaBoost learning algorithm pro-
search space, e.g. the range of scales at which faces are posed in this paper produces a classifier that, for a given
sought. detection and false positive rates, is more likely to make a
decision early in the evaluation of the cascade.
1. Introduction AdaBoost constructs the classifier as a linear combina-
tion of “weak” classifiers chosen from a given, finite or in-
Face detection has numerous applications and a range of finite, set. Its goal is to choose a small number of weak
algorithms has been proposed [8, 10, 12, 4]. In many appli- classifiers and assign them proper coefficients. The lin-
cations, real-time performance is required. Recently, Viola ear combination can be seen as a decision hyper-plane in
and Jones [12] introduced an impressive face detection sys- the weak classifier space. Hence, AdaBoost can be viewed
tem capable of detecting faces in real-time with both high as an optimization procedure, that operates in the space of
detection rate and very low false positive rates. The de- weak classifier coefficients, starting with a zero vector and
sirable properties are attributed especially to the efficiently ending with a vector with only small number of non-zero
computable features used, the AdaBoost learning algorithm, elements.
and a cascade technique adopted for decision making. In The standard (discrete) AdaBoost is a greedy algorithm,
this paper, an improvement of the AdaBoost algorithm is that in each step sets one zero-valued coefficient to a non-
proposed and its utility for cascade building in the context zero value. Because of its greedy character, neither the
of face detection is shown. found weak classifiers nor their coefficients are optimal.
The Viola and Jones detector consists of several classi- A totally corrective algorithm with coefficients updates
fiers trained by the AdaBoost algorithm [1] that are organ- (TCAcu) proposed in this paper differs from the standard
ised into a decision cascade. Each cascade stage classifier is AdaBoost in two main aspects. Firstly, the coefficients of
set to reach a very high detection rate and an “acceptably” already found weak classifiers are updated repetitively dur-
low false positive rate. Since it is trained on the data classi- ing the learning process. Secondly, in the standard Ada-
1
Boost, a newly added weak classifier can be shown to be Given: (x1 , y1 ), ..., (xm , ym ); xi ∈ X , yi ∈ {−1, 1}
“independent” in a precisely defined way of the previously Initialize weights D1 (i) = 1/m
added weak classifier. The TCAcu algorithm finds a new For t = 1, ..., T :
weak classifier that is independent of all weak classifiers se- P
m
1. Find ht = arg min j ; j = 21 [1 − Dt (i)yi hj (xi )]
lected so far. It is shown that these modifications minimise hj ∈H i=1
the classification error upper bound more aggressively and
that shorter classifiers are found. 2. If t ≥ 1/2 then stop
The term “totally corrective algorithm” was introduced 3. Set αt = 1
log( 1−
t )
t
2
by Kivinen and Warmuth [3]. However, the Kivinen and
Warmuth algorithm did not update the coefficients of al- 4. Update
ready found weak classifiers. The algorithm thus lost the
Dt (i)exp(−αt yi ht (xi ))
important property of minimisation of the upper bound on Dt+1 (i) =
the training error. Kivinen and Warmuth made no empirical Zt
evaluation of the algorithm. It was experimentally tested by 5. Totally corrective step (see Algorithm 2)
Oza on several standard problems with poor results [6].
Another attempt to shorten the final classifier was pro- Output the final classifier:
posed by Li et al. [4] and was motivated by the feature selec- !
T
X
tion view of AdaBoost. In case, the weak classifiers corre-
H(x) = sign αt ht (x)
spond directly to the features as in the Viola and Jones face
t=1
detection framework, changing one coefficient to a non-zero
value effectively selects this feature [12]. Li et al. pro-
Algorithm 1: TCAcu: Totally Corrective Algorithm with
posed FloatBoost, a modification of AdaBoost where some
coefficient updates.
of already non-zero coefficients are set back to zero when
it leads to a lower upper bound on the classification error.
Instead of the greedy feature selection, the sequential float-
one only by an additional Step 5. The standard AdaBoost,
ing forward selection (SFFS) technique [7] is used. Li et al.
i.e. Step 1 to 4, will be described first.
show that this modification leads to shorter classifiers.
The goal of AdaBoost is to train a classifier using a set
The main contribution of this paper is (1) a modification
of examples. First, a weight D1 (i) is assigned to each train-
of AdaBoost algorithm which leads to shorter classifiers and
ing example. Learning then proceeds in a simple loop. At
a speedup of classification, (2) the introduction of totally
time t, the algorithm selects a weak classifier ht minimis-
corrective algorithm to face detection training. It is shown
ing a weighted error on the training set (Step 1). The loop
that resulting classifier performance is comparable to the
is terminated if this error exceeds 1/2 (Step 2). The value
standard AdaBoost and the resulting classifier runs faster.
of αt is computed next (Step 3) and the weights are updated
The paper is structured as follows. In the Section 2 the according to the exponential rule (Step 4). In Step 4, Zt is
totally corrective algorithm with coefficients updates is de-
a normalisation factor which assures Dt+1 remains a distri-
scribed in the framework of the standard AdaBoost. Then, bution. The final decision rule is a linear combination of the
in Section 3, necessary details of the Viola and Jones work
selected weak classifiers weighted by their coefficients. The
are given. Experimental results are shown in Section 4 and
classifier decision is given by the sign of the linear combi-
the paper is concluded in Section 5. nation.
There are two properties of AdaBoost exploited in the
2. Totally corrective algorithm paper. First, as has been shown in [9] the algorithm min-
imises an upper bound on the classification error εtr (H) on
In this section, the standard AdaBoost algorithm is de- the training set
scribed and motivation for the totally corrective step (TCS)
is given. Then TCS is explained and its role in AdaBoost T
Y T
1 Yp
learning is discussed. εtr (H) ≤ Zt = t (1 − t ) (1)
t=1
2T t=1
2.1. Standard AdaBoost This upper bound is minimised by selecting a weak classi-
The totally corrective algorithm with coefficient updates fier with the smallest weighted error t on the training set
(TCAcu) is based on AdaBoost [1] and its structure is de- as done in Step 1 and by setting its coefficient as done in
picted in Algorithm 1. Schapire and Singer’s [9] notation is Step 3.
used and the algorithm differs from Schapire and Singer’s Second, the re-weighting scheme assures that the up-
2
0.55
b 0 = Dt
Initialize D
For j = 1, 2, ..., Jmax
0.5
1. qj = arg maxq=1..t |q − 1/2|.
2. If |qj − 1/2| < ∆min exit the loop.
0.45
11
t
0.4
4. Reweight
b j+1 (i) = 1 D
D b j (i) exp(−b
αj uqj ,i ) 0.35
Zj
0.45
0.3
5. αqj = αqj + α
bj 0 1 2 3 4 5 6 7 8 9 10 11
t
bj
Assign Dt+1 = D
Figure 1: Weighted errors 11 t (eq. (4)) of weak classi-
Algorithm 2: The Totally Corrective Step
fiers ht after ten iterations (T=10) for AdaBoost (circles)
and TCAcu (crosses). In AdaBoost, 11 t for all but the last
dated distribution satisfies weak classifier are arbitrary. In TCAcu, all errors satisfy
|11
t − 0.5| < ∆min . Note that the selected weak classifiers
m
X may be different for AdaBoost and TCAcu.
Dt+1 (i)ut,i = 0 (2)
i=1
case, the distribution Dt+1 must satisfy
where ut,i = ht (xi )yi .
Step 1 of the AdaBoost algorithm at time t + 1 can be m
X
also written as Dt+1 (i)uq,i = 0 for q = 1, ..., t (5)
i=1
m
X
ht+1 = arg max Dt+1 (i)uq,i (3) where uq,i = hq (xi )yi or equivalently t+1 q = 21 for q =
hq ∈H
i=1 1, ..., t.
Employing equation (2) it is evident that the selected ht+1 There is no closed-form solution to the system of equa-
is “maximally independent” of the mistakes made by ht [9]. tions (5) and sometimes even an exact solution does not ex-
Moreover, for the weighted error t+1 of ht where the ist [3]. This is a consequence of the non-negativity con-
t
upper index indicates that the error is measured on the straint on Dt+1 , which is a distribution. Therefore, TCS is
weights used at time t + 1 designed as an iterative optimisation algorithm.
In AdaBoost, equation (5) holds at time t (after reweight-
1 X m
1 ing, Step 4) only for q = t. A typical situation is depicted in
t+1
t = (1 − Dt+1 (i)ut,i ) = (4) Figure 1. Weak classifier errors are shown after step t = 10
2 2
i=1 of AdaBoost. The weak classifier errors differs from 0.5
except for the last (10th) weak classifier.
The weak classifier ht is therefore equivalent to a random
Another observation can be made about the change of
guess on the weights Dt+1 .
the upper bound. From equation (1), we see that the upper
Summarising equations (1) - (4), the AdaBoost algo-
bound is reduced if the error of a newly added weak classi-
rithm minimises the upper bound on the classification er-
fier differs from 0.5. The bigger the difference, the bigger
ror, selects weak classifiers with the smallest weighted er-
the reduction of the upper bound. If follows that the upper
ror, and the selected weak classifier at time t is maximally
bound can be further reduced by formally adding an already
independent of the mistakes made by the weak classifier se-
used weak classifier, if its error differs form 0.5. This addi-
lected at time t − 1.
tion has two important consequences.
First, because of the linear combination form of the final
2.2. Totally corrective step classifier, addition of an already used weak classifier hr ,
The independence property discussed in Section 2.1 is very r < t, requires only a change of αr coefficient, not a change
attractive from the feature selection point of view. A ques- of the final classifier size. A new coefficient is computed as
tion arises whether a new weak classifier, maximally inde- αr = αrr + αt+1r , where the upper indexes express the cycle
pendent of all already selected ones, can be found. In such in which the coefficient was computed.
3
Second, a new distribution obtained by this addition sat- Input: Allowed false positive rate f , and detection rate d;
isfies equation (5) for q = r, but not for any other q. If final false positive rate ffinal
equation (5) is approximately satisfied for all q the goal
F0 = 1, D0 = 1
is reached. If not, another q is selected and hq ”virtually
Do until Fi > ffinal
added”. Each such addition will lower the upper bound.
TCS is formally summarised in Algorithm 2. At time 1. Train a classifier until freached < f and dreached > d on
t, a distribution Dt is used to initialize the algorithm. In the validation set
each iteration a weak classifier is selected from the already 2. Fi+1 = Fi × freached
used ones so that the absolute difference of its error and 0.5
is maximised. The standard scheme is used to find α b j and 3. Di+1 = Di × dreached
the new distribution D b j+1 . The value α bj is added to the
4. Throw away misclassified faces and generate new non-
corresponding coefficient and the loop is repeated.
face data from non-face images
Since an exact solution may not exist, the computation is
terminated if a close enough solution is found or if the max- Algorithm 3: Building the cascade.
imum allowed number of iterations is reached. The final
distribution is then used in cycle t + 1 of AdaBoost learn-
ing. A typical result of the algorithm is depicted in Figure 1. the training set for the next stage classifier of the cascade.
Convergence properties of the TCS step and standard Since easily recognisable non-face images are classified in
AdaBoost are the same. The only difference is in the set the early stages, classifiers of the later stages of the cascade
of weak classifiers which in TCS is limited to the already can be trained rapidly only on the harder, but smaller, part
selected ones in the main AdaBoost loop. of the non-face training set.
A similar algorithm was proposed by Kivinen and War- The cascade building is described in Algorithm 3. Inputs
muth [3]. TCAcu differs from Kivinen and Warmuth al- to the algorithm are: the desired false positive rate f , the
gorithm in two important aspects: (1) the coefficients of detection rate d of the cascade stages, and the final false
weak classifiers are updated repetitively, (2) the property positive rate of the cascade. Each stage is trained until f and
of minimisation of the upper bound is kept. The Kivinen d are reached. Since AdaBoost is neither designed to reach
and Warmuth algorithm was experimentally tested by Oza low false positive rates nor high detection rates, a threshold
on several standard problems [6] with poor results. is adjusted ex post.
In the cascaded classifier, the overall false positive and
3. Face detection and AdaBoost detection rates are a product of the rates of individual stages.
The pruning process is asymmetric and concentrates on the
The totally corrective algorithm was applied to the face de- non-face images. The stage false positive f is usually set to
tection problem using the framework introduced by Viola higher values. The multiplication in Step 2 guarantees an
and Jones [12]. To train a classifier, Viola and Jones select exponential reduction of the overall false positive rate. The
from a large number of very efficiently computable features detection rate must be set close to one to ensure that final D
(see [12] for detailed description). Every weak classifier is high.
implements a simple threshold function on one of the fea-
tures. Having such a large set of weak classifiers, AdaBoost 4. Experiments
learning is used to chose a small number of weak classifiers
and to combine them into a classifier deciding whether an The performance of TCAcu and AdaBoost was compared
image is a face or a non-face. on the face detection problem. The training dataset, training
Due to its greedy character, AdaBoost is able to cope process and obtained results are discussed next. The perfor-
with very large sets of weak classifiers. However, for face mance evaluation concentrates on the speed and complexity
detection, very large training set has to be explored as well of the learned cascaded classifiers.
in order to build a high-quality classifier. To solve infeasi- Training data. The data for training were collected from
bility of this problem a bootstrapping technique [11] is com- various sources. Face images are taken from the MPEG7
monly used. Viola and Jones proposed another technique to face dataset [2]. The dataset contains face images of vari-
cope with this problem. able quiality, different facial expressions and taken under
Cascade building. Instead of training a single classifier, a wide range of lightning conditions, with uniform or com-
cascade of classifiers is built. An image window (region) plex background. The pose of the heads is generally frontal
is passed to the first classifier. It is either classified as non- with slight rotation in all directions. Eyes and the nose tip
face or a decision is deferred and the image is passed to the are aligned in all images. The dataset contains 3176 images,
second, etc. classifier. The goal of each classifier is to prune one image was removed due to severe distortion.
4
×10 8 detectors [8, 10, 12]. The main objective of the experi-
400 1.7
ments is to demonstrate the detection speedup in compari-
1.6
son with the classical Viola-Jones approach, rather than im-
1.5 provement of the detection rate per se. This means that we
did not try to find e.g. the optimal sets of weak classifiers
Number of evaluations
1.4
since this is not important for a fair comparison of Ada-
1.3
Boost and TCAcu.
1.2 The results for the cascades trained by the two variants of
1.1 AdaBoost are summarized in Table 1. For each number of
stages in the cascade, the following quantities are recorded
1
(left to right in Table 1): the number of weak classifier form-
0.9 ing a stage in the cascade, the total number of evaluations
0.8 in each stage, and the false negative and false positive rates
AB on the MIT+CMU dataset.
TCAcu
0.7
0 50 100 150 200 250 300 350 It can be observed that for both algorithms the complex-
Classifier complexity ity of stages increases gradually except for two small fluc-
tuations. At the beginning the growth of TCAcu is slower
Figure 2: Selectivity comparison. Horizontal axis: the com-
and it changes after four stages. However, the complexity is
plexity of the cascaded classifier expressed by the number not the only important factor determining the speed of face
of weak classifiers used. Vertical axis: number of weak detection. Also the number of regions marked as a potential
classifier evaluations on the MIT+CMU dataset.
face in each stage is significant. It can be seen that TCAcu
discards many more regions in early stages than AdaBoost.
Pose variability was added synthetically to the data. The This early prunning influences false positive and false neg-
images were randomly rotated by up to 5◦ , shifted up to one ative rates that are shown in the last two columns. These
pixel and the bounding box was scaled by a factor of 1 ± two rates measure the performance of the cascaded clas-
0.05. Two datasets, training and validation, of the same size sifier. The table shows that both algorithms lead to similar
as the original dataset were created by the perturbations. false positive and false negative rates, but TCAcu converges
Non-face images were collected from the web. Images much faster.
of diverse scenes were included. The dataset contains im- To compare the speed of the cascades trained by TCAcu
ages of animals, plants, countrysite, man-made objects, etc.. and AdaBoost, the number of weak classifiers evaluated on
More than 3000 images were collected and random sub- MIT+CMU dataset was measured. All regions have to be
windows used as non-face examples. evaluated by the first stage classifier. The number of eval-
Training process. During the training process, the training uations is consequently a product of the number of regions
and validation dataset are updated for each stage (cf. Algo- and the length of the first stage classifier. The same holds
rithm 3). The non-face part of the training and validation for the second (and higher) stage classifier, but only regions
datasets consist of 5000 randomly selected regions from the not rejected by the first (previous) stage(s) are evaluated.
non-face images. Only regions that were not rejected by Summing the numbers evaluations of the first and the sec-
previous stages of the cascade are included. The face set ond stage gives the number of evaluations of the two-stage
remains almost the same over the whole training. The faces cascade classifier. The result for all lengths of the cascade
rejected by some of the stage classifiers are removed, but and for both algorithms is depicted in Figure 2.
the cascade is build to ensure that these false rejects are just Figure 2 demonstrates two important phenomena. First,
a small fraction of the face data. the complexity of the cascades with comparable false neg-
The process is driven by the stage false positive, detec- ative and false positive rates is up to four times smaller for
tion and final false positive rates. In the reported experi- the TCAcu algorithm (six-stage TCAcu vs. eleven-stage
ments, the values were set to 0.4 stage false positive rate, AdaBoost). Second, the number of evaluations needed in
0.999 detection rate and 0.0001 the final false positive rate. AdaBoost is higher by 20 % than in TCAcu.
The final false positive rate was reached in stage eight in
TCAcu and in stage ten in AdaBoost. 5. Conclusions
A new extension of the AdaBoost algorithm was proposed
4.1. Results and compared with the state-of-the-art Viola and Jones face
The classifiers were tested on the MIT+CMU dataset [8]. detection algorithm. The proposed TCAcu algorithm finds
This dataset has been widely used for comparison of face the final classifier by aggressive minimisation of the up-
5
Number Stage classif. Number of False False
of length evaluations negatives positives
stages AB TCAcu AB TCAcu AB TCAcu AB TCAcu
1 6 6 12431151 12431151 0 0 3930473 3682307
2 10 9 4009205 3757632 0 0 1598933 1054019
3 14 11 1643072 1083123 0 1 795262 492881
4 15 10 823246 512004 2 5 415751 173341
5 17 22 435483 183962 4 16 189902 44287
6 22 33 201982 49814 11 39 92226 6488
7 23 36 100887 9052 17 83 46499 1584
8 25 55 52867 3173 26 143 22966 183
9 31 27504 35 11262
10 40 14818 53 5070
11 47 7568 59 2193
12 55 4194 73 905
13 49 2602 84 426
per bound on the training error and produces a significantly [4] S. Li, L. Zhu, Z. Zhang, A. Blake, H. Zhang, and H. Shum.
shorter classifier. The obtained results are comparable to Statistical learning of multi-view face detection. In ECCV,
the Viola and Jones method in terms of detection and false page IV: 67 ff., 2002.
positive rates. The classifier trained by the novel method [5] R. Lienhart, A. Kuranov, and P. Vadim. Empirical analysis
was about 20 % faster and consists of only a quarter of the of detection cascades of boosted classifiers for rapid object
weak classifiers needed for a classifier trained by standard detection. In DAGM, Magdeburg, Germany, September 2003.
AdaBoost.
The algorithm can be applied with other weak classi- [6] Nikunj C. Oza. Boosting with Averaged Weight Vectors. In
fiers suitable for face detection and in conjunction with Multiple Classifier Systems, pages 15–24, 2003.
FloatBoost-like feature selection techniques. The reduction [7] P. Pudil, J. Novovicova, and J. Kittler. Floating search meth-
of the number of weak classifiers can be important in ar- ods in feature selection. Pattern recognition letters, 15:1119–
eas where the weak classifiers are expensive to compute or 1125, 1994.
to implement, e.g. on smart cards or other special purpose
hardware. [8] H. Rowley, S. Baluja, and T. Kanade. Neural network-based
face detection. PAMI, 20(1):23–38, January 1998.