General Framework For Object Detection
General Framework For Object Detection
net/publication/3766402
CITATIONS READS
1,270 14,339
3 authors, including:
Tomaso A. Poggio
Massachusetts Institute of Technology
697 PUBLICATIONS 80,386 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Tomaso A. Poggio on 05 January 2014.
555
Figure 3: Examples of faces used for training. The images are gray level of size 19 x 19 pixels.
4 The Experimental Results tion rate tradeoffs, rather than give a single perfor-
The system detects objects in arbitrary positions in mance result. This is accomplished by varying the
the image and in differlent scales. Once the training classification threshold in the support vector machine.
phase in Section 3 is complete, the system can detect The ROC curves are shown in Figure 9a and indicate
objects at arbitrary positions by scanning all possible that even higher penalties for missed positive exam-
locations in the image by shifting the detection win- ples may result in better performance. We can see
dow. This is combinedl with iteratively resizing the that, if we allow one false dete'ction per 7,500 windows
image to achieve multi-scale detection. For our exper- examined, the rate of correctly detected faces reaches
iments with faces, we detected faces from the minimal 75%.
size of 19 x 19 to 5 time,s this size by scaling the novel In Figure 10 we show the results of running the face
image from 0.2 to 1.0 times its original size, at incre- detection system over example images. The missed
ments of 0.1. For pedestrians, the image is scaled from detections are due to higher degrees of rotations than
0.2 to 2.0 times its original size, again in increments were present in the training database; with further
of 0.1. At any given scale, instead of recomputing the training on an appropriate set of rotated examples,
wavelet coefficients for every window in the image, we these types of rotations could be detected. In the im-
compute the transform for the whole image and do the age in the lower right, there are several incorrect de-
shifting in the coeffcient space. tections. Again, we expect that with further training,
4.1 Face Detection this can be eliminated.
To evaluate the face detection system performance, 4.2 People Detection
we start with a database of 2429 positive examples The frontal and rear pedestrian detection system
and 1000 negative examples. To understand the effect starts with 924 positive examples and 789 negative ex-
of different penalties in the Support Vector training amples and goes through 9 bootstrapping steps ending
(see [ll] [12]), we train several systems using differ- up with a set of 9726 patterns that define the non-
ent penalties for misclassification. The systems un- pedestrian class. We measure performance on novel
dergo the bootstrapping cycle detailed in Section 3, data using a set of 105 pedestrian images that are close
and end up with between 4500 and 9500 negative ex- to frontal or rear views; it should be emphasized that
amples. Out-of-sample performance is evaluated us- we do not choose test images of pedestrians in per-
ing a set of 131 faces anid the rate of false detections fect frontal or rear poses, rather, many of these test
is determined by running the system over approxi- images represent slightly rotated or walking views of
mately 900,000 patterns from images of natural scenes pedestrians. We use a set of 2,800,000 patterns from
that do not contain either faces or people. To give a natural scenes to measure the false detection rate. We
complete characterization of the systems, we generate give the ROC curves for the pedestrian detection sys-
ROC curves that illustrate the accuracy/false detec- tem in Figure 9b; as with faces, these curves indicate
,559
Figure 10: Results from the face detection system. The missed instances are due to higher degrees of rotation
than were present in the training database; false detections can be eliminated with additional training.
that even larger penalty terms for missed positive ex- era ego-motion, but rather we use the dynamic mo-
amples may improve accuracy significantly. From the tion information to assist the classifier. Additionally,
curve, we can see, for example, that if we have a tol- the use of motion information does not compromise
erance of one false positive for every 15,000 windows the ability of the system to detect non-moving people.
examined, we can achieve a detection rate of 70%. Figure 12 demonstrates how the motion cues enhance
Figure 11 exhibits some typical images that are pro- the performance of the system.
cessed by the pedestrian detection system; the images
are very cluttered scenes crowded with complex pat- We test the system over a sequence of 208 frames;
terns. These images show that the architecture is able the detection results are shown in Table 1. Out of
to effectively handle detection of people with different a possible 827 pedestrians in the video sequence - in-
clothing under varying illumination conditions. cluding side views for which the system is not trained -
Considering the complexity of these scenes and the the base system correctly detects 360 (43.5%) of them
difficulties of object detection in cluttered scenes, we with a false detection rate of 1 per 236,500 windows.
consider the above detection rates to be high. We The system enhanced with the motion module detects
believe that additional training and refinement of the 445 (53.8%) of the pedestrians, a 23.7 % increase in
current systems will reduce the false detection rates detection accuracy, while maintaining a false detection
further. rate of 1 per 90,000 windows. It is important to iter-
ate that the detection accuracy for non-moving objects
is not compromised; in the areas of the image where
5 Motion Extension there is no motion, the classifier simply runs as before.
In the case of video sequences, we can utilize mo- Furthermore, the majority of the false positives in the
tion information to enhance the robustness of the de- motion enhanced system were partial body detections,
tection; we use the pedestrian detection system as a ie. a detection with the head 'cut off, which were still
testbed. We compute the optical flow between con- counted as false detections. Taking this factor into
secutive images and detect discontinuities in the flow account, the false detection rate is even lower.
field that indicate probable motion of objects relative
to the background. We then grow these regions of This relaxation paradigm has difficulties when there
discontinuity using morphological operators, to define are a large number of moving bodies in the frame or
the full regions of interest. In these regions of motion, when the pedestrian motion is very small when com-
the likely class of objects is limited, so we can relax the pared to the camera motion. Based on our results,
strictness of the classifier. It is important to observe though, we feel that this integration of a trained clas-
that, unlike most person detection systems, we do not sifier with the module that provides motion cues could
assume a static camera nor do we need to recover cam- be extended to other systems as well.
560
Figure 11: Results from the pedestrian detection system. These are typical images of relatively complex scenes
that are used to test the system. Missed examples of pedestrians are usually due to the figure being merged with
the background.
561
Figure 12: The sequence of steps i n t h e motion-based m o d u l e showing, from left to right, s t a t i c detection results,
motion discontinuities, full m o t i o n regions, and improved detection results.
noisy images using simulated annealing. In Proceed- [14] H. Rowley, S. Baluja, and T. Kanade. Human face de-
sngs of the Fzfth Internatzonal Conference on Com- tection in visual scenes. Technical Report CMU-CS-
puter Vzszon, pages 523-20, 1995. 95-158, School of Computer Science, Carnegie Mellon
[a] B. Boser, I. Guyon, and V. Vapnik. A training algo- University, July/November 1995.
rithm for optim margin classifier. In Proceedzngs of [15] K.-K. Sung. Learnzng and Example Selectzon for Ob-
the Fzfth Annual ACM Workshop on Computatzonal ject and Pattern Detectzon. PhD thesis, Artificial
Learnzng Theory, pages 144-52. ACM, 1992. Intelligence Laboratory, Massachusetts Institute of
[3] B. Heisele, U. Kressel, and W. Ritter. Tracking non- Technology, December 1995.
rigid, moving objects based on color cluster flow. In [16] K.-K. Sung and T. Poggio. Example-based learn-
CVPR '97, 1997. t o appear. ing for view-based human face detection. A.I.
[4] D. Hogg. Model-based vision: a program to see Memo 1521, Artificial Intelligence Laboratory, Mas-
a walking person. Image and Vzszon Computzng, sachusetts Institute of Technology, December 1994.
1(1):5-20, 1983. [17] T. Tsukiyama and Y. Shirai. Detection of the move-
[5] M. Leung and Y.-H. Yang. Human body motion seg- ments of persons from a sparse sequence of tv images.
mentation in a complex scene. Pattern Recognztzon, Pattern Recognztzon, 18(3/4):207-13, 1985.
20(1):55-64, 1987. [18] R. Vaillant, C. Monrocq, and Y. L. Cun. Original
[6] M. Leung and Y.-H. Yang. A region based ap- approach for the localisation of objects in images. IEE
proach for human body analysis. Pattern Recognztzon, Proc.- Vas. Image Szgnal Processzng, 141(4), August
20( 3):321-39, 1987. 1994.
[7] S. Mallat. A theory for multiresolution signal decom- [19] V. Vapnik. The Nature of Statzstzcal Learnzng Theory.
position: T h e wavelet representation. IEEE Transac- Springer Verlag, 1995.
ttons on Pattern Analyszs and Machzne Intellzgence, [20] C. Wren, A. Azarbayejani, T. Darrell, and A. Pent-
11(7):674-93, July 1989. land. Pfinder: Real-time tracking of the human
[SI S. McKenna and S. Gong. Non-intrusive person body. Technical Report 353, Media Laboratory, Mas-
authentication for access control by visual tracking sachusetts Institute of Technology, 1995.
and face recognition. In J. Bigun, G. Chollet, and [all A. Yuille, P. Hallinan, and D. Cohen. Feature Ex-
G. Borgefors, editors, Audzo- and Vzdeo-based Bzo- traction from Faces using Deformable Templates. In-
metrzc Person Authentzcatzon, pages 177-183. IAPR, ternatzonal Journal of Computer Vzszon, 8(2):99-111,
Springer, 1997. 1992.
[9] B. Moghaddam and A. Pentland. Probabilistic visual
learning for object detection. Technical Report 326,
Media Laboratory, Massachusetts Institute of Tech-
nology, 1995.
[lo] M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, and
T. Poggio. Pedestrian detection using wavelet tem-
plates. In Computer Vzszon and Pattern Recognztzon,
pages 193-99, 1997.
[ll] E. Osuna, R. Freund, and F. Girosi. Support vec-
tor machines: Training and applications. A.I. Memo
1602, MIT A. I. Lab., 1997.
[12] E. Osuna, R. Freund, and F. Girosi. Training support
vector machines: An application to face detection.
In Computer Vzszon and Pattern Recognztzon, pages
130-36, 1997.
[13] K. Rohr. Incremental recognition of pedestrians
from imaggsequences. Computer Vzszon and Pattern
Recognztzon, pages 8-13, 1993.
562