0% found this document useful (0 votes)
47 views

General Framework For Object Detection

This document presents a general framework for object detection in images using a wavelet representation derived from statistical analysis of object classes. The wavelet representation overcomes problems of in-class variability and provides low false detection rates. The framework is demonstrated on face detection and pedestrian detection, achieving good results without relying on hand-crafted models or motion segmentation. It also introduces a motion module to improve detection in videos by using motion cues without compromising ability to detect non-moving objects.

Uploaded by

Vinod Kanwar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

General Framework For Object Detection

This document presents a general framework for object detection in images using a wavelet representation derived from statistical analysis of object classes. The wavelet representation overcomes problems of in-class variability and provides low false detection rates. The framework is demonstrated on face detection and pedestrian detection, achieving good results without relying on hand-crafted models or motion segmentation. It also introduces a motion module to improve detection in videos by using motion cues without compromising ability to detect non-moving objects.

Uploaded by

Vinod Kanwar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/3766402

General framework for object detection

Conference Paper · February 1998


DOI: 10.1109/ICCV.1998.710772 · Source: IEEE Xplore

CITATIONS READS

1,270 14,339

3 authors, including:

Tomaso A. Poggio
Massachusetts Institute of Technology
697 PUBLICATIONS   80,386 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Theory of Deep Learning: View project

The MIT Vision Machine Project View project

All content following this page was uploaded by Tomaso A. Poggio on 05 January 2014.

The user has requested enhancement of the downloaded file.


A General Framework for Object Detection
Constantine P. Papageorgiou Michael Oren Tomaso Poggio

Center for Biological and Computational Learning


Artificial Intelligence Laboratory
MIT
Cambridge, MA 02139
{ cpapa,oren,tp}@ai. mit ,edu

Abstract in the image; MAP or maximum likelihood methods


will not work since the classification of each pattern
This paper presents a general trainable framework in an image is done independently. This paper also
for object detection in static images of cluttered scenes. introduces an extension that uses motion cues to im-
The detection technique we develop is based on a prove detection accuracy over video sequences. This
wavelet representation of an object class derived from a motion module is a general one that can be used with
statistical analysis of the class instances. B y learning many detection algorithms and does not compromise
an object class in terms of a subset of an overcomplete the ability of the system to detect non-moving objects.
dictionary of wavelet basis functions, we derive a com- Initial work on the detection of rigid objects in
pact representation of an object class which is used as static images, such as street signs or faces, Betke &
an input to a suppori vector machine classifier. This Makris[l], Yuille, et. al.[2l], used template match-
representation overcomes both the problem of in-class ing approaches with a set of rigid templates or hand-
variability and provides a low false detection rate in crafted parameterized curves. These approaches are
unconstrained environments. difficult to extend t o more complex objects such as
W e demonstrate the capabilities of the technique i n people, since they involve a significant amount of prior
two domains whose inherent information content dif- information and domain knowledge. In recent re-
fers significantly. The first system is face detection search, more closely related to our system, the detec-
and the second is the domain of people which, in con- tion problem is solved using learning-based techniques
trast to faces, vary greatly in color, texture, and pat- that are data driven. This approach was used by Sung
terns. Unlike previous approaches, this system learns &Poggio[lG] and Vaillant, et al.[l8] for the detection
from examples and does not rely on any a priori (hand- of frontal faces in cluttered scenes, with similar arcbi-
crafted) models or motion-based segmentation. The tectures presented by Moghaddam and A. Pentland
paper also presents a motion-based extension to en- [9], Rowley, et. a1.[14], and Osuna et al.[ll].
hance the performance of the detection algorithm over Most previous systems that detect objects in video
video sequences. The results presented here suggest sequences focused on using motion and 3D models or
that this architecture may well be quite general. constraints to find people: Tsukiyama & Shirai 1 ,
Leung & Yang[G], Hogg[4], Rohr[l3], Wren, et a1.[21,
1 Introduction Heisele, et. a1.[3], McKenna & Gong[8]. These sys-
This paper presents a novel framework for object tems suffer from restrictive assumptions on the scene .
detection in cluttered scenes, based on the use of an structure, for instance, a single object in the scene or a
overcomplete dictionary of basis functions and com- stationary camera and a sequence of frames. In some
bined with statistical learning techniques. The detec- of these motion-based systems, the focus is on model
tion of real-world objects of interest, Such as faces and fitting, tracking and motion interpretation. In con-
people, poses challenging problems: these objects are trast, our work addresses the issue of detection in sin-
difficult to model, there is significant variety in color gle static images in unconstrained environments with
and texture, and the backgrounds against which the cluttered backgrounds, while making no assumption
objects lie are unconstrained. In contrast to the case on the scene structure.
of pattern classification where we need to decide be- One of the major issues in developing a system that
tween well-defined classes, the detection problem re- will handle complex classes of objects is finding an
quires us to differentiate between the object class and appropriate image representation. To illustrate the
the rest of the world. As a result, the class model must importance of an appropriate visual coding, Figure 1
accommodate the intra-class variability without com- shows images of people and their corresponding edge
promising the discriminative power in distinguishing maps. It is clear that both the pixel and edge-based
the object within cluttered scenes. We also cannot as- representations are inadequate; the pedestrian images
sume that there are a certain number of objects, if any, vary greatly in color and texture and the edge maps

555
Figure 3: Examples of faces used for training. The images are gray level of size 19 x 19 pixels.

(a) (b) (c) (dl (e) (f )


Figure 4: Ensemble average values of the wavelet coefficients for faces coded using color. Each basis function is
displayed as a single square in the images above. Coefficients whose values are close to the average value of 1
are coded gray, the ones which are above the average are coded using red and below the average are coded using
blue. We can observeal strong features in the eye areas and the nose. Also, the cheek area is an area of almost
uniform intensity, ie. below average coefficients. e):(c) vertical, horizontal and diagonal coefficients of scale 4 x 4
of images of faces. (d)-(f) vertical, horizontal an diagonal coefficients of scale 2 x 2 of images of faces.

tion corresponds to a single square in each image and


not the entire image. It is interesting to observe how
the different types of wavelets - vertical, horizontal,
and diagonal - capture various facial features, such as
the eyes, nose, and mouth.
From this statistical analysis, we derive a set of
37 coefficients, from both the coarse and finer scales,
that capture the significant features of the face. These
significant bases consist of 12 vertical, 14 horizontal,
and 3 diagonal coefficients a t the scale of 4 x 4 and
3 vertical, 2 horizontal, and 3 corner coefficients at
the scale of 2 x 2. Figure 5 shows a typical human
face from our training database with the significant
Figure 5: The significanlt basis functions for face detec- 37 coefficients drawn in the proper configuration.
tion that are uncovered through our learning strategy, For the task of pedestrian detection, we use a
overlayed on an example image of a face. database of 924 color images of people (Figure 1).
A similar analysis of the average values of the coef-
ficients was done for the pedestrian class and Figure
6 shows the grey-scale coding similar to Figure 4. We
the normalized coefficients over the entire set of ex- refer the interested reader to [lo] for the details. It
amples. The normalization has the property that the is interesting to observe that for the pedestrian class,
average value of coefficients of random patterns will be there are no strong internal patterns as in the face
1. If the average value of a coefficient is much greater class; rather, the significant basis functions are along
than 1, this indicates that the coefficient is encoding a the exterior boundary of the class, indicating a dif-
boundary between two regions that is consistent along ferent type of significant visual information. Through
the examples of the class; similarly, if the average value the same type of analysis, we choose 29 significant
of a coefficient is much smaller than 1, that coefficient coefficients from the initial, overcomplete set of 1326
encodes a uniform region. wavelet coefficients. These basis functions are shown
To illustrate this analysis, we code the coefficients’ overlayed on an example pedestrian in Figure 7.
values using grey-scale in Figure 4, where each coeffi- It should be observed, that from the viewpoint of
cient, or basis function, is drawn as a distinct square the classification task, we could use the whole set of
in the image. The arrangement of the squares cor- coefficients as a feature vector. However, using all the
responds to the spatial location of the basis func- wavelet functions that describe a window of 128 x 64
tions, where strong coeificients (large average values) pixels in the case of pedestrians would yield vectors of
are coded by darker grey levels and weak coefficients very high dimensionality, r.s we mentioned earlier. The
(small average values) are coded by lighter grey levels. training of a classifier with such a high dimensionality,
It is important to note that in Figure 4, a basis func- on the order of 1000, would in turn require too large
557
Figure 6: Ensemble average values of the wavelet coefficients coded using gray level. Coefficients whose values
are above the template average are darker, those below the average are lighter. (a) vertical coefficients of random
d
scenes. (b)-(d vertical, horizontal and corner coefficients of scale 32 x 32 of images of people. (e)-(g) vertical,
horizontal an corner coefficients of scale 16 x 16 of images of people.

We train our systems using databases of positive ex-


amples gathered from outdoor and indoor scenes. The
initial negative examples in the training database are
patterns from natural scenes not containing people or
faces. While the target class is well-defined, there are
no typical examples of the negative class. To overcome
this problem of defining this extremely large negative
class, we use the idea of “bootstrapping” training [16 .
In the context of the pedestrian detection system, a -
ter the initial training, we run the system over arbi-
I
trary images that do not contain any people, adding
false detections into the training set as examples of the
negative class, and retraining the classifier (Figure 8).
This incremental refinement of the decision surface is
iterated until satisfactory performance is achieved.

Figure 7: The significant basis functions for pedestrian


detection that are uncovered through our learning
strategy, overlayed on an example image of a pedes-
trian.

an example set. This dimensionality reduction stage


serves to select the basis functions relevant for this
task and to reduce their number considerably.
3.2 Stage 2: Learning the Class Model
Once we have identified the important basis func-
tions we can use various classification techniques to
learn the relationships between the wavelet coefficients
that define the object class. The classification tech-
nique we use is the support vector machine (SVM)
developed by Vapnik et. a1.[2][19]. This recently devel-
oped technique has the appealing features of having
very few tunable parameters and using structural risk Figure 8: Incremental bootstrapping. to improve the
minimization which* minimizes a bound on the gener- system performance.
alization error (see*[ll][12]).
558
."
1OJ 1o* lo-' 104 10" lo4 10- IO* to-' 100
Falae Poslbva Rale F.lu Politiva R.1.

(a) Face Detection System (b) People Detection System


Figure 9: ROC curves for the detection systems. The detection rate is plotted against the false detection rate,
measured on a logarithrnic scale. The false detection rate is defined as the number of false detections per inspected
window; (a) Face Detection: System A was trained with equal penalty for missed positive examples and false
detections; systems B and C were trained with penalties for missed positive examples that were 1 and 2 orders
of magnitude greater than the penalty for false detections, (b) People Detection: System A penalizes incorrect
classifications of positive and negative examples equally, system B penalizes incorrectly classified positive examples
5 times more than negative examples.

4 The Experimental Results tion rate tradeoffs, rather than give a single perfor-
The system detects objects in arbitrary positions in mance result. This is accomplished by varying the
the image and in differlent scales. Once the training classification threshold in the support vector machine.
phase in Section 3 is complete, the system can detect The ROC curves are shown in Figure 9a and indicate
objects at arbitrary positions by scanning all possible that even higher penalties for missed positive exam-
locations in the image by shifting the detection win- ples may result in better performance. We can see
dow. This is combinedl with iteratively resizing the that, if we allow one false dete'ction per 7,500 windows
image to achieve multi-scale detection. For our exper- examined, the rate of correctly detected faces reaches
iments with faces, we detected faces from the minimal 75%.
size of 19 x 19 to 5 time,s this size by scaling the novel In Figure 10 we show the results of running the face
image from 0.2 to 1.0 times its original size, at incre- detection system over example images. The missed
ments of 0.1. For pedestrians, the image is scaled from detections are due to higher degrees of rotations than
0.2 to 2.0 times its original size, again in increments were present in the training database; with further
of 0.1. At any given scale, instead of recomputing the training on an appropriate set of rotated examples,
wavelet coefficients for every window in the image, we these types of rotations could be detected. In the im-
compute the transform for the whole image and do the age in the lower right, there are several incorrect de-
shifting in the coeffcient space. tections. Again, we expect that with further training,
4.1 Face Detection this can be eliminated.
To evaluate the face detection system performance, 4.2 People Detection
we start with a database of 2429 positive examples The frontal and rear pedestrian detection system
and 1000 negative examples. To understand the effect starts with 924 positive examples and 789 negative ex-
of different penalties in the Support Vector training amples and goes through 9 bootstrapping steps ending
(see [ll] [12]), we train several systems using differ- up with a set of 9726 patterns that define the non-
ent penalties for misclassification. The systems un- pedestrian class. We measure performance on novel
dergo the bootstrapping cycle detailed in Section 3, data using a set of 105 pedestrian images that are close
and end up with between 4500 and 9500 negative ex- to frontal or rear views; it should be emphasized that
amples. Out-of-sample performance is evaluated us- we do not choose test images of pedestrians in per-
ing a set of 131 faces anid the rate of false detections fect frontal or rear poses, rather, many of these test
is determined by running the system over approxi- images represent slightly rotated or walking views of
mately 900,000 patterns from images of natural scenes pedestrians. We use a set of 2,800,000 patterns from
that do not contain either faces or people. To give a natural scenes to measure the false detection rate. We
complete characterization of the systems, we generate give the ROC curves for the pedestrian detection sys-
ROC curves that illustrate the accuracy/false detec- tem in Figure 9b; as with faces, these curves indicate

,559
Figure 10: Results from the face detection system. The missed instances are due to higher degrees of rotation
than were present in the training database; false detections can be eliminated with additional training.

that even larger penalty terms for missed positive ex- era ego-motion, but rather we use the dynamic mo-
amples may improve accuracy significantly. From the tion information to assist the classifier. Additionally,
curve, we can see, for example, that if we have a tol- the use of motion information does not compromise
erance of one false positive for every 15,000 windows the ability of the system to detect non-moving people.
examined, we can achieve a detection rate of 70%. Figure 12 demonstrates how the motion cues enhance
Figure 11 exhibits some typical images that are pro- the performance of the system.
cessed by the pedestrian detection system; the images
are very cluttered scenes crowded with complex pat- We test the system over a sequence of 208 frames;
terns. These images show that the architecture is able the detection results are shown in Table 1. Out of
to effectively handle detection of people with different a possible 827 pedestrians in the video sequence - in-
clothing under varying illumination conditions. cluding side views for which the system is not trained -
Considering the complexity of these scenes and the the base system correctly detects 360 (43.5%) of them
difficulties of object detection in cluttered scenes, we with a false detection rate of 1 per 236,500 windows.
consider the above detection rates to be high. We The system enhanced with the motion module detects
believe that additional training and refinement of the 445 (53.8%) of the pedestrians, a 23.7 % increase in
current systems will reduce the false detection rates detection accuracy, while maintaining a false detection
further. rate of 1 per 90,000 windows. It is important to iter-
ate that the detection accuracy for non-moving objects
is not compromised; in the areas of the image where
5 Motion Extension there is no motion, the classifier simply runs as before.
In the case of video sequences, we can utilize mo- Furthermore, the majority of the false positives in the
tion information to enhance the robustness of the de- motion enhanced system were partial body detections,
tection; we use the pedestrian detection system as a ie. a detection with the head 'cut off, which were still
testbed. We compute the optical flow between con- counted as false detections. Taking this factor into
secutive images and detect discontinuities in the flow account, the false detection rate is even lower.
field that indicate probable motion of objects relative
to the background. We then grow these regions of This relaxation paradigm has difficulties when there
discontinuity using morphological operators, to define are a large number of moving bodies in the frame or
the full regions of interest. In these regions of motion, when the pedestrian motion is very small when com-
the likely class of objects is limited, so we can relax the pared to the camera motion. Based on our results,
strictness of the classifier. It is important to observe though, we feel that this integration of a trained clas-
that, unlike most person detection systems, we do not sifier with the module that provides motion cues could
assume a static camera nor do we need to recover cam- be extended to other systems as well.

560
Figure 11: Results from the pedestrian detection system. These are typical images of relatively complex scenes
that are used to test the system. Missed examples of pedestrians are usually due to the figure being merged with
the background.

-e.tection I False Positive overcomplete set would be difficult, if not intractable.


I Rate (per window) Most of the basis functions in the original full set do
Base svstem I 1:236.500 not necessarily convey relevant information about the
Motion extension (1 53.8% j i:go,Ooo object class we are learning, but, by starting with a
large overcomplete dictionary, we would not sacrifice
Table 1: Performance of the pedestrian detection sys- details or spatial accuracy. The learning step extracts
tem with the motion-b.ased- extensions, compared to the most prominent features and results in a signifi-
the base system. cant dimensionality reduction.
We also present an extension that uses motion cues
to improve pedestrian detection accuracy over video
sequences. This module is appealing in that, unlike
6 Conclusion most systems, it does not totally rely on motion to ac-
In this paper, we describe the idea of an overcom- complish detection; rather, it takes advantage of the
plete wavelet representattion and demonstrate how it a priori knowledge that the class of moving objects
can be learned and used for object detection in a clut- is limited while not compromising performance in de-
tered scene. This representation yields not only a com- tecting non-moving pedestrians.
putationally efficient algorithm but an effective learn- The strength of our system comes from the expres-
ing scheme as well. sive power of the overcomplete set of basis functions
We have decomposed the learning of an object class - this representation effectively encodes the intensity
into a two-stage learning process. In the first stage, relationships of certain pattern regions that define a
we perform a dimensionslity reduction where we iden- complex object class. The encouraging results of our
tify the most important basis functions from an orig- system in two different domains, faces and people, sug-
inal overcomplete set of basis functions. The rela- gest that the approach described in this paper may
tionships between the basis functions which define the well generalize to several other object detection tasks.
class model are learned in the second stage using a
support vector machine (SVM). Without this dimen- References
sionality reduction stage, the training on the original [l] M. Betke and N. Makris. Fast object recognition in

561
Figure 12: The sequence of steps i n t h e motion-based m o d u l e showing, from left to right, s t a t i c detection results,
motion discontinuities, full m o t i o n regions, and improved detection results.

noisy images using simulated annealing. In Proceed- [14] H. Rowley, S. Baluja, and T. Kanade. Human face de-
sngs of the Fzfth Internatzonal Conference on Com- tection in visual scenes. Technical Report CMU-CS-
puter Vzszon, pages 523-20, 1995. 95-158, School of Computer Science, Carnegie Mellon
[a] B. Boser, I. Guyon, and V. Vapnik. A training algo- University, July/November 1995.
rithm for optim margin classifier. In Proceedzngs of [15] K.-K. Sung. Learnzng and Example Selectzon for Ob-
the Fzfth Annual ACM Workshop on Computatzonal ject and Pattern Detectzon. PhD thesis, Artificial
Learnzng Theory, pages 144-52. ACM, 1992. Intelligence Laboratory, Massachusetts Institute of
[3] B. Heisele, U. Kressel, and W. Ritter. Tracking non- Technology, December 1995.
rigid, moving objects based on color cluster flow. In [16] K.-K. Sung and T. Poggio. Example-based learn-
CVPR '97, 1997. t o appear. ing for view-based human face detection. A.I.
[4] D. Hogg. Model-based vision: a program to see Memo 1521, Artificial Intelligence Laboratory, Mas-
a walking person. Image and Vzszon Computzng, sachusetts Institute of Technology, December 1994.
1(1):5-20, 1983. [17] T. Tsukiyama and Y. Shirai. Detection of the move-
[5] M. Leung and Y.-H. Yang. Human body motion seg- ments of persons from a sparse sequence of tv images.
mentation in a complex scene. Pattern Recognztzon, Pattern Recognztzon, 18(3/4):207-13, 1985.
20(1):55-64, 1987. [18] R. Vaillant, C. Monrocq, and Y. L. Cun. Original
[6] M. Leung and Y.-H. Yang. A region based ap- approach for the localisation of objects in images. IEE
proach for human body analysis. Pattern Recognztzon, Proc.- Vas. Image Szgnal Processzng, 141(4), August
20( 3):321-39, 1987. 1994.
[7] S. Mallat. A theory for multiresolution signal decom- [19] V. Vapnik. The Nature of Statzstzcal Learnzng Theory.
position: T h e wavelet representation. IEEE Transac- Springer Verlag, 1995.
ttons on Pattern Analyszs and Machzne Intellzgence, [20] C. Wren, A. Azarbayejani, T. Darrell, and A. Pent-
11(7):674-93, July 1989. land. Pfinder: Real-time tracking of the human
[SI S. McKenna and S. Gong. Non-intrusive person body. Technical Report 353, Media Laboratory, Mas-
authentication for access control by visual tracking sachusetts Institute of Technology, 1995.
and face recognition. In J. Bigun, G. Chollet, and [all A. Yuille, P. Hallinan, and D. Cohen. Feature Ex-
G. Borgefors, editors, Audzo- and Vzdeo-based Bzo- traction from Faces using Deformable Templates. In-
metrzc Person Authentzcatzon, pages 177-183. IAPR, ternatzonal Journal of Computer Vzszon, 8(2):99-111,
Springer, 1997. 1992.
[9] B. Moghaddam and A. Pentland. Probabilistic visual
learning for object detection. Technical Report 326,
Media Laboratory, Massachusetts Institute of Tech-
nology, 1995.
[lo] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and
T. Poggio. Pedestrian detection using wavelet tem-
plates. In Computer Vzszon and Pattern Recognztzon,
pages 193-99, 1997.
[ll] E. Osuna, R. Freund, and F. Girosi. Support vec-
tor machines: Training and applications. A.I. Memo
1602, MIT A. I. Lab., 1997.
[12] E. Osuna, R. Freund, and F. Girosi. Training support
vector machines: An application to face detection.
In Computer Vzszon and Pattern Recognztzon, pages
130-36, 1997.
[13] K. Rohr. Incremental recognition of pedestrians
from imaggsequences. Computer Vzszon and Pattern
Recognztzon, pages 8-13, 1993.
562

View publication stats

You might also like