Automatic Detection of Cars in Real Roads Using Haar-Like Features
Automatic Detection of Cars in Real Roads Using Haar-Like Features
M. Oliveira, V. Santos
Abstract: This paper describes a computer vision based system designed for the detection of cars in real world
environments. The system uses the Haar-like features method firstly introduced by Viola and Jones which is
known for its fast processing and good detection rates. The process requires representative data sets to be used
for training and validation including positive (presence of objects to detect) and negative (absence of objects to
detect) image samples. Therefore, several example images of cars were hand labeled for training and
performance calculation purposes. Preliminary results show that the method can be very effective to detect cars at
fast rates and show generalization capabilities. Despite some occasional false detections, because this method is
quite fast, it can act as a primordial filter of promising regions of the image, where more effective yet time
demanding tests can later be employed.
Keywords: Haar Features, Computer Vision, Automatic Detection of Cars, Transportation systems,
1
learning system and consequently improving the Where I ( x′, y ′) is the value of the image’s pixel at
performance of the object detection system” [5]. The coordinates ( x, y ). The value of any RecSum( x, y, w, h) can
features that were proposed by Viola and Jones (the basic
be obtained by simply four lookups at the SAT.
set) and latter by Lienhart and Maydt (extended set) are
shown in Figure 1. It is important to emphasize that the RecSum ( x, y, w, h ) =
features of Figure 1 are mere prototypes. They are scaled
= SAT ( x + w, y + h ) − SAT ( x + w, y ) − (6)
independently in horizontal and vertical directions in order
to get an over complete set of features (the 24x24 window − SAT ( x, y + h ) + SAT ( x, y )
proposed in [4] the amount of possible features is around
180000). The result of the application of each feature to a This procedure is shown on Figure 2.
particular image region is given by the sum of the pixels that
lie within the black rectangles of the feature subtracted by
the sum of the ones overlapping the white rectangles. The x,y x,y SAT(x+w,y)
The values of N ,Wi and of ri are arbitrarily chosen. In Figure 2. Fast RecSum(ri ) calculation.
the case of [5], it has been defined that N = 2 and that black
Viola and Jones [4] set up a framework to combine several
rectangles ( r0 ) have negative weight Wi and white ( r1 ) features into a cascade, i.e. a sequence of tests on the image
have positive weights. Furthermore, the relationship between or on particular regions of interest, organized into several
weights is given by the difference of area occupied by the stages, each based on the results of one or more different
black and white rectangles. Haar features. For an object to be recognized, it must pass
through all of the stages of the cascade. The cascade is built
−W0 ⋅ Area ( r0 ) = W1 ⋅ Area ( r1 ) (2) by supplying a set of positive and negative examples to the
training algorithm. The used algorithm is called Adaboost,
Assuming W0 = −1 , one can obtain:
known for its high performance in what concerns
Area ( r0 ) generalization speed [4]. At each stage of the cascade, the
W1 = (3) machine learning algorithm selects the feature or a
Area ( ri )
combination of features that best separate negative from
Consequently, for example for feature (2a) of Figure 1, positive examples, by tuning the threshold classification
with a height h = 2 and width w = 6 the outcome of the function. There is a trade off relationship between the
feature application to a rectangular region positioned at x, y number of stages in a cascade and features in each stage and
the amount of time it takes to process the cascade. Viola and
would be:
Jones define, for each stage, a target for the minimum
feature2 a = −1⋅ RecSum ( x, y, 6, 2 ) + reduction in false positives and a maximum decrease in
(4) detection. The mentioned rates are obtained by using a
6× 2
+ ⋅ RecSum ( x + 2, y, 2, 2 ) validation set made up of the positive and negative
2× 2
examples. In order to improve the time performance of the
In order to compute the value of each feature very rapidly, algorithm, the same authors have also presented the notion of
an intermediate image representation is calculated. This attentional cascade. The idea consists of using the first stages
representation is called integral image or Summed Area of the cascade to effectively discard most of the regions of
Table (SAT). The value of the integral image at coordinates the image that have no objects. This is done by adjusting the
( x, y ), is given by the sum of all the pixels in the image that classifier’s threshold so that the false negative is close to
are above and to the left of ( x, y ): zero. By discarding many candidate regions early in the
cascade, Viola and Jones significantly improve the method’s
SAT ( x, y ) = ∑ I ( x ', y ') (5) performance. In fact, it makes a lot of sense that the
x ' ≤ x , y '≤ y detection system is able to quickly discard obvious negative
regions of an image using valuable time to better test much
2
more promising regions by submitting them to higher level
stages of the cascade that yield more complex features.
It has already been said that Haar-like features were
especially applied to perform face detection. However, the
framework is all-purpose. Some other approaches have
successfully used it for pedestrian detection [6].
California Institute of Technology dataset is composed of Figure 6. Authors’ own car dataset with poor weather
1156 images in png format, though many are very similar conditions.
(Figure 3). Image resolution is variable. This dataset is used
for training and will henceforth be named training dataset 1 3.2 Performance Datasets Description
(TDS1).
For the purpose of testing, three separate datasets are
used. The first dataset was built by Brad Philip and Paul
Updike (Figure 7).
3
mentioned as performance dataset 1 (PDS1). Table 3 summarizes the setup used for several cascades
Performance dataset 2 (PDS2) is taken from the footage trained using different combinations of TDS, number of
that provided images for TDS3. The images are not the same stages, features pool, i.e., BASIC for Viola Jones features
although they are similar. PDS2 consists of 105 images, collection and ALL meaning Lienhart and Maydt extended
756x512 pixels of resolution, saved in png format. No set as well as the number of positive and negative samples
demanding weather conditions, city environment or gas generated after the dataset (T. Samples). Also, several
stations images were included. The idea was to use a window sizes were attempted. Cascades will henceforth be
simplified version of the footage. Finally, performance named by C followed by their respective number.
dataset 3 (PDS3) is an extension of PDS2 obtained by
including all kinds of images: poor weather, city, gas 4 PERFORMANCE TESTS AND RESULTS
stations, bridges etc. (Figure 8). Opencv [7] provides a tool for cascade performance
testing. The tool applies the cascade to all test images and
compares the algorithm’s outcome to the report generated by
hand labeling. Hit (HR) and false detection (or false alarm)
(FDR) rates are generated based on this comparison. In order
Figure 8. Complex images in PDS3.
to assume a given detection as the one described in the
PDS3 is a much harder set. It consists of 232 images with report, some tolerances are assumed. Tolerances are related
the same resolution and format as the ones of PDS2. to the disparities in position and size from the current
Table 2. Performance datasets description. detection and the one manually generated for comparison
For every detection made, a search in the corresponding PDS
Name Nº Img Resolution Location Authors is executed to see if the detection is true or false. In order to
PDS 1 530 320x240 California Philip, Updike allow for easy performance comparison, the tolerances
PDS 2 105 752x512 Portugal Oliveira, Santos employed are the default values of the mentioned tool. PDS1
was tested with several cascades and several scaling factors
PDS 3 232 752x512 Portugal Oliveira, Santos
scaling factor, sf , which is a Haar detection parameter that
3.3 Cascades Description indicates how much the reference window should be scaled
Having ensured a wide variety of examples, hand labeling up. HRs are quite good (some above 95%) though FDRs are
was performed over all images both in training and quite high (Table 4).
performance sets. A semi-automatic hand labeling
application was developed to ease the process by enabling Table 4. Performance results for PDS1.
fast mouse selection. It also generates a text file were the Name sf Hits Missed
False
HR FDR
Detect.
ROI or ROIs (i.e. the regions were the car or cars can be
found) is/are defined for every image. Intel Open Source 1.05 508 18 400 0,966 0,760
C1
Computer Vision Library (Opencv) [7] provides a tool that 1.9 370 156 263 0,703 0,500
creates samples by clipping the defined ROIs from TDS 1.05 501 25 444 0,952 0,844
C2 1.5 501 25 193 0,952 0,367
images, converting them to grayscale, rescaling them to
window size, and inserting them into a random background 2.9 311 215 500 0,591 0,951
image. The background image pool, or negative set, has not C3
1.05 440 86 4840 0,837 9,202
yet been described. A negative set consists of a set of images 1.9 436 90 1529 0,829 2,907
where no objects (cars) exist. They haven’t been mentioned 1.05 449 77 2676 0,854 5,087
since they are impaired with their respective TDS, i.e., every C5 1,9 495 31 953 0,941 1,812
TDS also has a set of negative examples, usually road 2,9 397 129 874 0,755 1,662
images where no cars are present. C6 1.05 442 84 5799 0,840 11,025
4
presented in Table 4, Table 5 and Table 6 may seem quite or if further validation tests are to be implemented.
high, the charts on Figure 9, Figure 11 and Figure 13 provide Performance comparison between C4 1.05, C5 1.05 and C6 1.05 tested on
PDS2
a much better view of the HR versus the FDR relationship. A
close analysis of those figures clearly shows that a small loss 1
Hit Rate
The cascades that best perform would be C1sf =1.05 and 0,5 C5 1.05
0,4 C6 1.05
C 2sf =1.5 . The ROC curves for both are presented at Figure 9. 0,3
0,2
Performance comparison between C1 1.05 and C2 1.5 tested on PDS1
0,1
1 0
0 2 4 6 8 10
0,9
False Alarm Rate
0,8
C1 1.05
0,5
C2 1.5
0,4 On Figure 11, the optimum point of cascade C 4 sf =1.05
0,3
presents values of HR 0.78 and FDR 0.75 . Tests with
0,2
0,1
PDS2 were not entirely satisfactory in what concerns FDRs.
0
However, this is a very difficult set and some additional
0 0,2 0,4 0,6 0,8 1 procedures could have been implemented to ease the FDR
False Alarm Rate
and also, in some cases, improve the HR.
Figure 9. ROC of the best performing cascades on PDS1.
Figure 9 shows that cascade C 2 sf =1.05 performs better
than C1sf =1.05 . Cascade C 2sf =1.5 can achieve the same HR as
C1sf =1.05 at a lower cost, i.e., lower FDR. Analyzing Figure 9
one could assume the optimum point to be HR 0.92 and
FDR 0.18 . This would imply that 92% of all cars were
detected yielding only 18 false detections per every 100
truthful ones. Some examples of C 2sf =1.5 detections can be
seen on Figure 10.
Figure 10. Examples of detections made by C 2sf =1.5 on Figure 12. PDS2’s apparent problems.
PDS1. First of all, the images from PDS3 could be clipped
Regarding PDS2, fewer tests were executed. Table 5 clearly without loss of reliable extrapolation of the algorithm’s
shows a much higher FDR’s average score. performance (Figure 12). The upper and the lower parts of
these images contain no information on the road (sky/rear
Table 5. Performance results for PDS2. mirror and car interior panel). This clipping operation would
Name sf Hits Missed
False
HR FDR lower considerably the FDR since many of these false
Detect. detections are in these areas of the images. Also, many of the
C4 1.05 116 28 2150 0,806 14,931 detections are very close to each other. An algorithm for
C5 1.05 79 65 1213 0,549 8,424 merging overlapping detections (or a fine tune of the
C6 1.05 84 60 1081 0,583 7,507 performance calculation tolerances mentioned at the
beginning of this chapter) could be easily implemented thus
While C 4 sf =1.05 yields the best HR, it also has a FDR of decreasing even more the FDR. In the case of Figure 12, one
14, i.e. for every detection that should be made, 14 false would go from a situation with 15 false detections to none, if
detections occur. This number may appear high if the these procedures were implemented, which would
cascade is used for actual detection but may loose relevance dramatically lower the FDR. Taking the previous
if the cascade is to be used as a simple attention mechanism considerations into account, it seemed interesting to test
5
some cascades on PDS3, even knowing that it is even more were used for the training of the cascades. Also, three
demanding than PDS2. The results are outlined on Table 6 different PDS were employed for performance testing. The
best achieved results show [ HR 0.92 ; FDR 0.18 ];
Table 6. Performance results for PDS3.
[ HR 0.78 ; FDR 0.75 ] and [ HR 0.72 ; FDR 1 ]
False
Name sf Hits Missed
Detect.
HR FDR respectively for PDS 1, 2 and 3.
C1 1.5 78 249 156 0,239 0,477
C2 1.5 89 238 612 0,272 1,872
C3 1.5 269 58 1510 0,823 4,618
1,05 256 71 4837 0,783 14,792
C4
1.5 204 123 2028 0,624 6,202
1.5 158 169 1252 0,483 3,829
C5
1.05 79 65 1213 0,549 8,424
C6 1.5 158 169 1457 0,483 4,456
C7 1.5 115 212 1649 0,352 5,043
1.05 295 32 4119 0,902 12,596 Figure 14.Some detections of PDS3.
C8 1,9 30 297 43 0,092 0,131 Some methods for decreasing the FDRs were suggested
2,9 189 138 837 0,578 2,560 and will be further explored in the future. Most of the
cascades were tested against all PDSs, which may provide
In this particularly difficult dataset, most of the cascades relevant information regarding the influence of variables
present a low HR. However, cascades C 3sf =1.5 , C 4 sf =1.05 and such as window size, training sample size and variability,
particularly C8sf =1.05 have acceptable HRs. Of course that usage of rotated Haar features and others on the cascade
performance.
the FDRs are considerable, but there is the conviction that
In the case of PDS1, C 2sf =1.5 performed better than C1sf =1.05 ,
these rates can be substantially reduced by means of the
already mentioned clipping and merging techniques. which seems to corroborate the idea that a larger detection
Performance of C4 1.05 tested on PDS3
window may best describe an object (review Table 3).
Regarding PDS2, C 4 sf =1.05 is more efficient than both
1
0,9 C 5sf =1.05 and C 6sf =1.05 . The increase in performance may be
0,8
due to the usage of both simple and rotated Haar features.
0,7
The processing of the cascades is quite fast, which enables
0,6
Hit Rate
0,5 C4 1.05
the future implementation of this method in a real time
0,4
system.
0,3
0,2 6 REFERENCES
0,1
[1] R. Cancela, M. Neta, M. Oliveira, V. Santos, 2006. ATLAS III: Um
0
Robô com Visão Orientado para Provas em Condução Autónoma,
0 2 4 6 8 10
Robótica, nº 62, 2006, p. 8 (ISSN: 0874-9019).
False Alarm Rate
[2] M. Oliveira, V. Santos , A Vision-based Solution for the Navigation
of a Mobile Robot in a Road-like Environment, Robótica, nº69, 2007
Figure 13. ROC curve of C 4 sf =1.05 tested on PDS3. p. 8 (ISSN: 0874-9019).
[3] M. Oliveira, V. Santos. Combining View-based Object Recognition
Performance data was not extracted from C 3sf =1.5 neither with Template Matching for the Identification and Tracking of Fully
Dynamic Targets, 7th Conference on Mobile Robots and
from C8sf =1.05 and so Figure 13 presents only the results Competitions, Festival Nacional de Robótica 2007. Paderne, Algarve.
of C 4 sf =1.05 . The optimum point of detection performance for 27/04/2007
[4] P. Viola, M. Jones 2001. Rapid Object Detection using a Boosted
cascade C 4 sf =1.05 would, nonetheless, yield acceptable Cascade of Simple Features, Conference on Computer Vision and
Pattern Recognition CVPR, Hawaii, December 9-14, 2001.
HR 0.72 and FDR 1 . Bearing in mind that FDRs could [5] R. Lienhart and J. Maydt. An Extended Set of Haar-like Features for
be overstated and that PDS3 is a set of high complexity, Rapid Object Detection. IEEE ICIP 2002, Vol. 1, pp. 900-903, Sep.
including images in the rain, city and other tricky obstacles, 2002.
[6] G. Monteiro, P. Peixoto, U. Nunes, 2006. Vision-based Pedestrian
the HR of C8sf =1.05 is quite acceptable (Figure 14). Detection using Haar-like Features. Encontro Científico, Festival
Nacional de Robótica 2006.
[7] Opencv, Intel Open Source Computer Vision, found at
5 CONCLUSIONS AND FUTURE WORK http://www.intel.com/technology/computing/Opencv/ on January
This paper presented a method based on Haar-like features 2007.