Learning Based Computer Vision With OpenCV
Learning Based Computer Vision With OpenCV
0902
Intel
®
Technology
Journal
Compute-Intensive, Highly Parallel Applications and Uses
More information, including current and past issues of Intel Technology Journal, can be found at:
http://developer.intel.com/technology/itj/index.htm
Learning-Based Computer Vision with Intel’s Open Source
Computer Vision Library
Gary Bradski, Corporate Technology Group, Intel Corporation
Adrian Kaehler, Enterprise Platforms Group, Intel Corporation
Vadim Pisarevsky, Software and Solutions Group, Intel Corporation
Index words: computer vision, face recognition, road recognition, optimization, open source, OpenCV
OpenCV was designed for enablement and infrastructure. OpenCV was released in Alpha in 2000, Beta in 2003, and
Many groups who could make use of vision were will be released in official version 1.0 in Q4 2005. If the
prevented from doing so due to lack of expertise; OpenCV Intel Integrated Performance Primitives (IPP) library [2] is
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 119
Intel Technology Journal, Volume 9, Issue 2, 2005
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 120
Intel Technology Journal, Volume 9, Issue 2, 2005
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 121
Intel Technology Journal, Volume 9, Issue 2, 2005
Table 1: Approximate speed-ups using assembly etc.). For such objects, a statistical model (classifier) may
optimized IPP over the embedded optimized C in be trained instead and then used to detect the objects.
OpenCV
Statistical model-based training takes multiple instances of
Function Speed-up range the object class of interest, or “positive” samples, and
(OpenCV/IPP exec. time) multiple “negative” samples, i.e., images that do not
contain objects of interest. Positive and negative samples
Gaussian Pyramids ~3 together make a training set. During training, different
Morphology ~3-7 features are extracted from the training samples and
distinctive features that can be used to classify the object
Median filter ~2.1-18 are selected. This information is “compressed” into the
Linear convolution (with a ~2-8 statistical model parameters. If the trained classifier does
small kernel) not detect an object (misses the object) or mistakenly
detects the absent object (i.e., gives a false alarm), it is
Template Matching ~1.5-4 easy to make an adjustment by adding the corresponding
Color Conversion (RGB ~1-3 positive or negative samples to the training set.
to/from Grayscale, HSV, OpenCV uses such a statistical approach for object
Luv) detection, an approach originally developed by Viola and
Jones [4] and then analyzed and extended by Lienhart [5,
Image moments ~1.5-3
6]. This method uses simple Haar-like features (so called
Distance transform ~1.5-2 because they are computed similar to the coefficients in
Haar wavelet transforms) and a cascade of boosted tree
Image affine and ~1-4
classifiers as a statistical model. In [4] and in OpenCV
perspective
this method is tuned and primarily used for face detection.
transformations
Therefore, we discuss face detection below, but a
Corner detection ~1.8 classifier for an arbitrary object class can be trained and
used in exactly the same way.
DFT/FFT/DCT ~1.5-3
The classifier is trained on images of fixed size (Viola
Math functions (exp, log, 3-10 uses 24x24 training images for face detection), and
sin, cos …) detection is done by sliding a search window of that size
through the image and checking whether an image region
at a certain location “looks like a face” or not. To detect
In OpenCV 1.0, support for more IPP functions, such as faces of different size it is possible to scale the image, but
face detection and optical flow, will be added. the classifier has the ability to “scale” as well.
Fundamental to the whole approach are Haar-like features
FACE DETECTION
and a large set of very simple “weak” classifiers that use a
Introduction/Theory single feature to classify the image region as face or non-
face.
Object detection, and in particular, face detection is an
important element of various computer vision areas, such Each feature is described by the template (shape of the
as image retrieval, shot detection, video surveillance, etc. feature), its coordinate relative to the search window
The goal is to find an object of a pre-defined class in a origin and the size (scale factor) of the feature. In [3],
static image or video frame. Sometimes this task can be eight different templates were used, and in [5, 6] the set
accomplished by extracting certain image features, such as was extended to 14 templates, as shown in Figure 1.
edges, color regions, textures, contours, etc. and then
using some heuristics to find configurations and/or
combinations of those features specific to the object of
interest. But for complex objects, such as human faces, it
is hard to find features and heuristics that will handle the
huge variety of instances of the object class (e.g., faces
may be slightly rotated in all three directions; some people
wear glasses; some have moustaches or beards; often one
half of the face is in the light and the other is shadow,
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 122
Intel Technology Journal, Volume 9, Issue 2, 2005
+ 1, ti ,0 ≤ xi < ti ,1
fi =
−1 else
where the response +1 means the face, and −1 – means the
non-face. Every such classifier, called a weak classifier, is
not able to detect a face; rather, it reacts to some simple
feature in the image that may relate to the face. For
example, in many face images eyes are darker than the
surrounding regions, and so feature 3a in Figure 1,
centered at one of the eyes and properly scaled, will likely
give a large response (assuming that weightblack<0).
Figure 1: Extended set of Haar-like features In the next step, a complex and robust classifier is built
out of multiple weak classifiers using a procedure called
Each feature consists of two or three joined “black” and boosting, introduced by Freund and Schapire [7].
“white” rectangles, either up-right or rotated by 45°. The
The boosted classifier is built iteratively as a weighted
Haar feature’s value is calculated as a weighted sum of
sum of weak classifiers:
two components: The pixel sum over the black rectangle
and the sum over the whole feature area (all black and F = sign(c1 f1 + c2 f 2 + K + cn f n )
white rectangles). The weights of these two components
are of opposite signs and for normalization, their absolute On each iteration, a new weak classifier fi is trained and
values are inversely proportional to the areas: for added to the sum. The smaller the error fi gives on the
example, the black feature 3(a) in Figure 1 has weightblack training set, the larger is the coefficient ci that is assigned
= -9×weightwhole. to it. The weight of all the training samples is then
updated, so that on the next iteration the role of those
In real classifiers, hundreds of features are used, so direct
samples that are misclassified by the already built F are
computation of pixel sums over multiple small rectangles
emphasized. It is proven in [7] that if fi is even slightly
would make the detection very slow. But Viola [4]
more selective than just a random guess, then F can
introduced an elegant method to compute the sums very
achieve an arbitrarily high (<1) hit rate and an arbitrarily
fast. First, an integral image, Summed Area Table (SAT),
small (>0) false alarm rate, if the number of weak
is computed over the whole image I, where
classifiers in the sum (ensemble) is large enough.
SAT ( X , Y ) = ∑ I ( x, y ) .
x < X , y <Y
However, in practice, that would require a very large
training set as well as a very large number of weak
classifiers, resulting in a slow processing speed.
The pixel sum over a rectangle r={(x,y),x0≤x<x0+w,
Instead, Viola [4] suggests building several boosted
y0≤y<y0+h} can then be computed using SAT by using just
classifiers Fk with constantly increasing complexity and
the corners of the rectangle regardless of size:
chaining them into a cascade with the simpler classifiers
RecSum(r)=SAT(x0+w, y0+h)−SAT(x0+w, y0)− going first. During the detection stage, the current search
window is analyzed subsequently by each of the Fk
SAT(x0, y0+h)+SAT(x0, y0)
classifiers that may reject it or let it go through, as
This is for up-right rectangles. For rotated rectangles, a depicted in Figure 2.
separate “rotated” integral image must be used.
Search
The computed feature value xi=wi,0RecSum(ri,0)+
Window
wi,1RecSum(ri,1) is then used as input to a very simple
decision tree classifier that usually has just two terminal Face
nodes, that is: F1 F2 FN
+ 1, xi ≥ ti
fi =
− 1, xi < ti Not a Face
or three terminal nodes:
Figure 1: Object (face) detection cascade of classifiers
where rejection can happen at any stage
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 123
Intel Technology Journal, Volume 9, Issue 2, 2005
That is, Fk (k=1..N)’s are subsequently applied to the face 1.2, // scale the cascade
// by 20% after each pass
candidate until it gets rejected by one of them or until it 2, // groups of 3 (2+1) or more
passes them all. In experiments, about 70-80% of neighbor face rectangles are joined into a
candidates are rejected in the first two stages that use the single “face”, smaller groups are rejected
simplest features (about 10 weak classifiers each), so this CV_HAAR_DO_CANNY_PRUNING, // use Canny
edge detector to reduce number of false alarms
technique speeds up detection greatly. Most of the cvSize(0, 0) // start from the minimum
detection time, therefore, is spent on real faces. Another face size allowed by the particular classifier
advantage is that each of the stages need not be perfect; in );
fact, the stages are usually biased toward higher hit-rates // for each face draw the bounding rectangle
rather than towards small false-alarm rates. By choosing for(i=0;i<(faces ? faces->total:0); i++ ) {
the desired hit-rate and false-alarm rate at every stage and CvRect* r = (CvRect*)
cvGetSeqElem( faces, i );
by choosing the number of stages accurately, it is possible CvPoint pt1 = { r->x, r->y };
to achieve very good detection performance. For example, CvPoint pt2 = { r->x + r->width,
r->y + r->height };
if each of the stages gives a 0.999 hit-rate and a 0.5 false- cvRectangle( image, pt1, pt2,
alarm rate, then by stacking 20 stages into a cascade, we CV_RGB(255,0,0), 3, 8, 0 );
will be able to get a hit-rate of 0.99920=0.98 and a false- }
alarm rate of 0.520~10-6! // create window and show the image with
outlined faces
Face Detection with OpenCV cvNamedWindow( "faces", 1 );
cvShowImage( "faces", image );
OpenCV provides low-level and high-level APIs for cvWaitKey();
face/object detection. A low-level API allows users to // after a key pressed, release data
cvReleaseImage( &image );
check an individual location within the image by using the cvReleaseHaarClassifierCascade( &cascade );
classifier cascade to find whether it contains a face or not. cvReleaseMemStorage( &storage );
return 0;
Helper functions calculate integral images and scale the }
cascade to a different face size (by scaling the coordinates
of all rectangles of Haar-like features) etc. Alternatively,
the higher-level function cvDetectObjects does this If the above program is built as facedetect.exe, it may be
all automatically, and it is enough in most cases. Below is invoked as (type it in a single line):
a sample of how to use this function to detect faces in a facedetect.exe –-cascade=”c:\program
specified image: files\opencv\data\haarcascades\haarcascade_
frontalface_default.xml” ”c:\program
// usage: facedetect –-cascade=<path> image_name files\opencv\samples\c\lena.jpg”
#include "cv.h"
#include "highgui.h"
#include <string.h>
assuming that OpenCV is installed in c:\program
files\opencv. Figure 3 shows example results of using a
int main( int argc, char** argv ) trained face detection model that ships with OpenCV.
{
CvHaarClassifierCascade* cascade;
if( argc != 3 ||
strncmp(argv[1], "--cascade=", optlen) )
return -1;
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 124
Intel Technology Journal, Volume 9, Issue 2, 2005
A detailed description of object detection functions can be procedure might be repeated many times with
found in the OpenCV reference manual different parameters, the same vec-file may be re-
(opencvref_cv.htm, Object Detection section). used.
Example:
Training the Classifier Cascade
Once there is a trained classifier cascade stored in an createsamples –vec eyes.vec \
XML file, it can be easily loaded using the cvLoad –info eyes.idx –w 20 –h 15
function and then used by cvHaarDetectObjects or
The above builds eyes.vec out of the database,
by low-level object detection functions. The question
described in eyes.idx (see above): all the positive
remains as to how to create such a classifier, if/when the
samples are extracted from images, normalized and
standard cascades shipped with OpenCV fail on some
resized to the same size (20x15 in this case).
images or one wants to detect some different object
createsamples can also creates a vec file out of a
classes, like eyes, cars, etc. OpenCV includes a
haartraining application that creates a classifier single positive sample (e.g., some company logo) by
applying different geometrical transformations,
given a training set of positive and negative samples. The
adding noise, altering colors, etc. See haartraining in
usage scheme is the following (for more details, refer to
the OpenCV reference html manual for details.
the haartraining reference manual supplied with OpenCV):
3. Collect a database of negative samples. Make sure the
1. Collect a database of positive samples. Put them into
database does not contain instances of the object class
one or more directories and create an index file that
of interest. You can make negative samples out of
has the following format:
arbitrary images, for example. They can be
filename_1 count_1 x11 y11 w11 h11 x12 y12 … downloaded from the Internet, bought on CD, or shot
by your digital camera. Put the images into one or
filename_2 count_2 x21 y21 w21 h21 x22 y22 …
more directories, and make an index file: that is, a
… plain list of image filenames, one per line. For
example, an image index file called
That is, each line starts with a file name (including
“backgrounds.idx” might contain:
subdirectories) of an image followed by the number
of objects in it and bounding rectangles for every backgrounds/img0001.jpg
object (x and y coordinates of top-left corner, width backgrounds/my_img_02.bmp
and height in pixels). For example, if a database of
backgrounds/the_sea_picture.jpg
eyes resides in a directory eyes_base, the index file
eyes.idx may look like this: …
eyes_base/eye_000.jpg 2 30 100 15 10 4. Run haartraining. Below is an example (type it in
55 100 15 10 command-line prompt as a single line or create a
eyes_base/eye_001.jpg 4 15 20 10 6 30 20 batch file):
10 6 …
haartraining
…
–data eyes_classifier_take_1
Notice that the performance of a trained classifier
-vec eyes.vec –w 20 –h 15
strongly depends on the quality of the database used.
For example, for face detection, faces need to be -bg backgrounds.idx
aligned so that the relative locations of eyes–the most -nstages 15
distinctive features–are the same. The eyes need to be
on the same horizontal level (i.e., faces are properly -nsplits 1
rotated) etc. Another example is the detection of [-nonsym]
profile faces. These are non-symmetric, and it is
-minhitrate 0.995
reasonable to train the classifier only on right profiles
(so that variance inside the object class is smaller) -maxfalsealarm 0.5
and at the detection stage to run it twice–once on the
In this example, a classifier will be stored in
original images and a second time on the flipped
eyes_classifier_take1.xml. eyes.vec is used as a set of
images.
positive samples (of size 20x15), and random images from
2. Build a vec-file out of the positive samples using the background.idx are used as negative samples. The cascade
createsamples utility. While the training will consist of 15 (-nstages) stages; every stage is trained
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 125
Intel Technology Journal, Volume 9, Issue 2, 2005
to have the specified hit-rate (-minhitrate) or higher, and a Based on laser scanner point clouds in the near field, it
false-alarm rate (-maxfalsealarm) or lower. Every weak was possible to estimate what sections of nearby visible
classifier will have just 1 (-nsplits) non-terminal node (1 terrain might be road. In this case, “near” is approximately
split trees are called “stumps”). ten meters (varying by terrain type and other ambient
conditions). Once projected into the visual camera images,
The training procedure may take several hours to
this region could then be used to train a classifier that
complete even on a fast machine. The main reason is that
would extrapolate out beyond the range of the lasers into
there are quite a lot of different Haar features within the
the far visual field. In many cases the method is successful
search window that need to be tried. However, this is
at extrapolating the road all of the way to the horizon.
essentially a parallel algorithm and it can benefit (and
This amounts to a practical range of as much as one
does benefit) from SMP-aware implementations.
hundred meters. The ability to see into the visual far field
Haartraining supports OpenMP via the Intel Compiler and
is crucial for path planning for high-speed operation.
this parallel version is shipped with OpenCV.
We discussed use of an object detection/recognition Incoming Color Image Incoming Laser Data
algorithm built into OpenCV. In the next section, we
discuss using OpenCV functions to recognize abstract Shadow Removal Compute Candidate Polygon
objects such as roads.
Kmeans, Learn Gaussian Model
ROAD SEGMENTATION
Categorize All Points
The Intel OpenCV library has been used for the vision
system of an autonomous robot. This robot is built from a
Adapting Threshold
commercial off-road vehicle and the vision system is used
to detect and follow roads. In this system, the problem was Candidate Selection
to use a close-by road, identified by scanning laser range
finders to initialize vision algorithms that can extend the Model Based Validation
initial road segmentation out as far as possible. The roads
in question were not limited to marked and paved streets;
they were typically rocky trails, fire roads, and other poor Figure 5: Overview of data flow
quality dirt trails. Figure 4 shows “easy” and “hard” roads.
The core algorithm outlined in Figure 5 is as follows.
First, the flat terrain that the lasers find immediately in
front of the vehicle is converted to a single polygon and
projected into the camera co-ordinates. Shadow regions
are marked out of consideration as shown in Figure 6. The
projected polygon represents our best starting guess for
determining what pixels in the overall image contribute to
road. It is of course possible that there is no road at all.
The method is to extrapolate out to find the largest patch
to which we can extend what the lasers have given us, and
only thereafter to ask if that patch might be a road. This
final determination will be made based on the shape of the
area found. Only a relatively small set of shapes can
correspond to a physical road of approximately constant
width, disappearing into the distance.
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 126
Intel Technology Journal, Volume 9, Issue 2, 2005
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 127
Intel Technology Journal, Volume 9, Issue 2, 2005
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 128
Intel Technology Journal, Volume 9, Issue 2, 2005
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 129
Intel Technology Journal, Volume 9, Issue 2, 2005
With learning-based vision, one just “points” the Boston University Center for Adaptive Systems. He
algorithm at the data and useful models for detection, started and was the technical content director of OpenCV
segmentation, and identification can often be formed. working closely with Vadim. Currently, he consults on
Learning can often easily fuse or incorporate other sensing OpenCV and machine learning content with the
modalities such as sound, vibration, or heat. Since performance primitives group. His e-mail is Garybradski
cameras and sensors are becoming cheap and powerful at gmail.com.
and learning algorithms have a vast appetite for
Adrian Kaehler is a senior software engineer working in
computational threads, Intel is very interested in enabling
the Enterprise Platforms Group. His interests include
geometric and learning-based vision routines in its
machine learning, statistical modeling, and computer
OpenCV library since such routines are vast consumers of
vision. Adrian received a B.A. degree in Physics from the
computational power.
University of California at Santa Cruz in 1992 and his
Ph.D. degree in Theoretical Physics from Columbia
REFERENCES University in 1998. Currently, Adrian is involved with a
[1] Open Source Computer Vision Library: variety of vision-related projects in and outside of Intel.
http://www.intel.com/research/mrl/research/opencv His e-mail is Adrian.l.Kaehler at intel.com.
[2] Intel® Integrated Performance Primitives Vadim Pisarevsky is a software engineer in the
http://www.intel.com/software/products/perflib Computational Software Lab in Intel. His interests are in
image processing, computer vision, machine learning,
[3] Stewart Taylor, “Intel® Integrated Performance
algorithm optimization, and programming languages.
Primitives,” in How to Optimize Software
Vadim received a Masters degree in Mathematics from the
Applications Using Intel® IPP”
Nizhny Novgorod State University, in 1998. He has been
http://www.intel.com/intelpress/sum_ipp.htm
involved in different software projects related to
[4] Paul Viola and Michael J. Jones, “Rapid Object multimedia processing since 1996. In 2000, he joined
Detection using a Boosted Cascade of Simple Intel Russia Research Center where he led the OpenCV
Features,” IEEE CVPR, 2001. development team for over four years. Currently, he is
working in the software department and continues to
[5] Rainer Lienhart and Jochen Maydt, “An Extended Set
improve OpenCV and its integration with Intel Integrated
of Haar-like Features for Rapid Object Detection,”
Performance Primitives. His e-mail is Vadim.Pisarevsky
Submitted to ICIP2002.
at intel.com.
[6] Alexander Kuranov, Rainer Lienhart, and Vadim
Pisarevsky, “An Empirical Analysis of Boosting
Algorithms for Rapid Objects With an Extended Set Copyright © Intel Corporation 2005. This publication
of Haar-like Features,” Intel Technical Report MRL- was downloaded from http://developer.intel.com/.
TR-July02-01, 2002. Legal notices at
[7] Freund, Y. and Schapire, R. E. (1996b), “Experiments http://www.intel.com/sites/corporate/tradmarx.htm.
with a new boosting algorithm,” in Machine
Learning: Proceedings of the Thirteenth
International Conference, Morgan Kauman, San
Francisco, pp. 148-156, 1996.
[8] Bradski, G., “Computer Vision Face Tracking For Use
in a Perceptual User Interface,” Intel Technology
Journal,http://developer.intel.com/technology/itj/q21
998/articles/art_2.htm, Q2 1998.
AUTHORS’ BIOGRAPHIES
Gary Rost Bradski is a principal engineer and manager
of the Machine Learning group for Intel Research. His
current interests are learning-based vision and sensor
fusion in world models. Gary received a B.S. degree from
U.C. Berkeley in May, 1981. He received his Ph.D.
degree in Cognitive and Neural Systems (mathematical
modeling of biological perception) in May, 1994 from
Learning-Based Computer Vision with Intel’s Open Source Computer Vision Library 130
For further information visit:
developer.intel.com/technology/itj/index.htm