Detection v2
Detection v2
Related Concepts
– Object Recognition
– Image Classification
Definition
Object detection involves detecting instances of objects from one or several
classes in an image.
Background
The goal of object detection is to detect all instances of objects from one or
several known classes, such as people, cars or faces in an image. Typically only
a small number of objects are present in the image, but there is a very large
number of possible locations and scales at which they can occur and that need
to somehow be explored.
Each detection is reported with some form of pose information. This could
be as simple as the location of the object, a location and scale, a bounding
box, or a segmentation mask. In other situations the pose information is more
detailed and contains the parameters of a linear or non-linear transformation.
For example a face detector may compute the locations of the eyes, nose and
mouth, in addition to the bounding box of the face. An example of a bicycle
detection that specifies the locations of certain parts is shown in Figure 1. The
pose could also be defined by a three-dimensional transformation specifying the
location of the object relative to the camera.
Object detection systems construct a model for an object class from a set
of training examples. In the case of a fixed rigid object only one example may
be needed, but more generally multiple training examples (often hundreds or
thousands) are necessary to capture certain aspects of class variability. Broadly
speaking, less training data is needed when more information about class vari-
ability can be explicitly built into the model. However, it may be difficult to
specify models that capture the vast variability found in images. An alternative
approach is to use methods such as convolutional neural networks [1] that learn
about class variability from large datasets.
Object detection approaches typically fall into one of two major categories,
generative methods (see, e.g., [2,3,4,5,6]) and discriminative methods (see, e.g.,
[7,8,9,10,11]). A generative method consists of a probability model for the pose
variability of the objects together with an appearance model: a probability model
for the image appearance conditional on a given pose, together with a model for
background, i.e. non-object images. The model parameters can be estimated
from training data and the decisions are based on ratios of posterior probabili-
ties. A discriminative method typically builds a classifier that can discriminate
between images (or sub-images) containing instances of the target object classes
and those not containing them. The parameters of the classifier are selected to
minimize mistakes on the training data, often with a regularization bias to avoid
overfitting.
Theory
Images of objects from a particular class are highly variable. One source
of variation is the actual imaging process. Changes in illumination, changes in
camera position as well as digitization artifacts, all produce significant variations
in image appearance, even in a static scene. The second source of variation is due
to the intrinsic appearance variability of objects within a class, even assuming
no variation in the imaging process. For example, people have different shapes
and wear a variety of clothes, while the handwritten digit 7 can be written with
or without a line through the middle, with different slants, stroke widths, etc.
The challenge is to develop detection algorithms that are invariant with respect
to these variations and are computationally efficient.
Invariance
The brute force approach to invariance assumes training data is plentiful and
represents the entire range of object variability. Invariance is implicitly learned
from the data while training the models. In recent years, with the increase in
annotated dataset size and computational acceleration using GPUs, this has
been the approach of choice in the context of the multi-layer convolutional neural
network paradigm, as discussed later in this article.
When training data is limited it is necessary to build invariance into the
models. There are two complementary methods to achieve this. One involves
computing invariant functions and features, the other involves searching over
latent variables. Most algorithms contain a combination of these approaches.
For example many algorithms choose to apply local transformations to pixel in-
tensities in such a way that the transformed values are invariant to a range of
illumination conditions and small geometric variations. These local transforma-
tions lead to features and the array of feature values is the feature map. More
significant transformations are often handled through explicit search of latent
variables or by learning the remaining variability from training data.
Invariant functions and features This method constructs functions of the data
that are invariant with respect to the types of variability described above and
can still distinguish between object and background images. This may prove
difficult if object variability is extensive. Invariant functions that produce the
same output no matter the pose and appearance of the object necessarily have
less discriminative power.
There are two common types of operations leading to invariant functions.
The first involves computing local features that are invariant to certain image
transformations. The second operation involves computing geometric quantities
that are invariant to some or all three-dimensional pose variation. For example
the cross-ratio among distinguished points is a projective invariant that has been
used to recognize rigid objects (see, e.g., [12]).
An example of a local feature, invariant to certain photometric variations
and changes in illumination, is the direction of the image gradient, from which
a variety of edge features can be computed. More complex features capture the
appearance of small image patches and are often computed from edge features.
An example would be the histogram of gradient (HOG) features [10]. Local
features are usually computed at a dense grid of locations in the image, leading to
a dense feature map. Features such as HOG were designed by practitioners based
on a variety of considerations involving the desired invariance, and at times were
motivated by certain analogies to biological processing in the visual system. An
alternative approach, implemented in multi-layer convolutional neural networks,
learns the local features as well as intermediate and higher-level features as
part of the training process. Interestingly, the low-level features learned by such
networks often resemble oriented edge detectors, like the designed features.
Local pooling of features is commonly used to introduce some degree of in-
variance to small geometric variations. A typical example is the max or sum
operation [13,2]. In this case a quantity that is to be computed at a pixel is
replaced by the maximum or sum of the quantity in a neighborhood of the pixel.
When the region is extended over the entire window the result is a bag of features
model [14], which counts the number of binary features of different types that
occur within a window. In this case all spatial information is lost, leading to
models that are invariant to fairly large geometric transformations.
For computational reasons it is often useful to sparsify the feature map by
applying local decisions to find a small set of interest points. The assumption
is that only certain features are useful (or necessary) for object detection. The
approach yields sparse feature maps that can be processed much more efficiently.
Examples of commonly used sparse features are SIFT descriptors [15], corner
detectors and edge conjunctions [2]. One drawback of sparse features is that
hard decisions are being made on their presence, and if some are missed an
algorithm may fail to detect an instance of the object.
Note that it is possible to predefine a very large family of features that is
never fully computed, rather, in training an informative subset is selected that
can produce the required classification for a particular object class. One example
are the Haar features that compute differences of intensity averages in adjacent
rectangles of varying sizes and locations [8]. Another example are geometric edge
arrangements of increasing complexity.
The most common approach to object detection reduces the problem to one
of classification. Consider the problem of detecting instances from one object
class of fixed size but varying positions in the image. Let W denote a reference
window size that an instance of the object would occupy. Let L denote a grid
of locations in the image. Let Xs+W denote the image features in a window
(sub-image) with top-left corner at s ∈ L. One can reduce the object detection
problem to binary classification as follows. For each location s ∈ L classify
Xs+W into two possible classes corresponding to windows that contain an object
and windows that do not contain an object. The sliding-window approach to
object detection involves explicitly considering and classifying every possible
window. Note that the same approach can be used to detect objects of different
sizes by considering different window sizes or alternatively windows of fixed size
at different levels of resolutions in an image pyramid. Clearly the number of
windows where the classifier needs to be computed can be prohibitive. Many
computational approaches find ways to narrow down the number of windows
where the classifier is implemented.
Generative Models
A general framework for object detection using generative models involves mod-
eling two distributions. A distribution p(θ; ηp ) is defined on the possible latent
pose parameters θ ∈ Θ. This distribution captures assumptions on which poses
are more or less likely. An appearance model defines the distribution of the im-
age features in a window conditional on the pose, p(Xs+W |object, θ; ηa ). The
appearance model might be defined by a template specifying the probability
of observing certain features at each location in the detection window under a
canonical choice for the object pose, while θ specifies a transformation of the
template. Warping the template according to θ leads to probabilities for observ-
ing certain features at each location in Xs+W [2,3,5].
Training data with images of the object are used to estimate the parame-
ters ηp and ηa . Note that the images do not normally come with information
about the latent pose variables θ, unless annotation is provided. Estimation
thus requires inference methods that handle unobserved variables, for example
the different variants of the expectation maximization algorithm [5,4]. In some
cases a probability model for background images is estimated as well using large
numbers of training examples of images not containing the object.
The basic detection algorithm then scans each candidate window in the im-
age, computes the most likely pose under the object model and obtains the
‘posterior odds’, i.e. the ratio between the conditional probability of the window
under the object hypothesis at the optimal pose, and the conditional probability
of the window under the background hypothesis. This ratio is then compared to
a threshold τ to decide if the window contains an instance of the object
When no background model has been trained offline, a simple adaptive back-
ground model can be estimated online for each window being tested. In this
case no background training data is needed [5]. Alternative background models
involve sub-collections of parts of the object model [16].
Discriminative Models
If no explicit latent pose variables are used the underlying assumption is that
the training data is sufficiently rich to provide a sample of the entire variation of
object appearance. The discriminative approach trains a standard classifier to
discriminate between image windows containing the target object classes and a
broad background class. This is done using large amounts of data from the object
classes and from background. Many classifier types have been used, including
neural networks, SVMs, boosted decision trees and radial basis functions.
Cascades Because of the large size of the background population and its com-
plexity discriminative methods are often organized in cascades [8]. An initial
classifier is trained to distinguish between the object and a manageable amount
of background data. The classifier is designed to have very few false negatives
at the price of a larger number of false positives. Then a large number of back-
ground examples are evaluated and the misclassified ones are collected to form
a new background data set. Once a sufficient number of such false positives is
accumulated a new classifier is trained to discriminate between the original ob-
ject data and the new ‘harder’ background data. Again this classifier is designed
to have no false negatives. This process can be continued several times.
At detection time the classifiers in the cascade are applied sequentially. Once
a window is classified as background the testing terminates with the background
label. If the object label is chosen, the next classifier in the cascade is applied.
Only windows that are classified as object by all classifiers in the cascade are
labelled as object by the cascade.
Pose variables Certain discriminative models can also be implemented with la-
tent pose parameters [11]. Assume a generic classifier defined in terms of a space
of classifier functions f (x; u) parameterized by u. Usual training of a discrimi-
native model consists of solving an equation of the form
n
X
min D(yi , f (xi ; u)) + C(u),
u
i=1
for some regularization term C(u) which prevents overfitting and a loss function
D measuring the distance between the classifier output f (xi ; u) and the ground
truth label yi = 1 for object and yi = 0 for background.
The minimization above can be replaced by
X X
min min D(1, f (θ(xi ); u)) + max D(0, f (θ(xi ); u)) + C(u).
u θ∈Θ θ∈Θ
yi =1 yi =0
classifier
person : 0.992
2k scores 4k coordinates k anchor boxes dog : 0.994 RoI pooling
horse : 0.993
proposals
dog : 0.997 person : 0.979
256-d
intermediate layer Region Proposal Network
feature maps
bus : 0.996
boat : 0.970
person : 0.983
person : 0.736 person : 0.983
person : 0.925
person : 0.989
conv layers
sliding window
conv feature map image
Computational methods
The basic detection process consists of searching over pose parameters to classify
each hypothesis. At a minimum this usually involves searching over locations and
sizes and is clearly a very intensive computation. There are a number of methods
to make it more efficient.
Sparse features When sparse features are used it is possible to focus the com-
putation only in regions around features. The two main approaches that take
advantage of this sparsity are alignment [22] and the generalized Hough trans-
form. Alignment uses information regarding the relative locations of the features
on the object. In this case the locations of some features determine the possible
locations of the other features. Various search methods enable a quick decision
on whether a sufficient number of features were found to declare object, or not.
The Hough transform typically uses information on the location of each feature
type relative to some reference point in the object. Each detected feature votes
with some weight for a set of candidate locations of the reference point. Loca-
tions with a sufficiently large sum of weighted votes determine detections. This
idea can also be generalized to include identification of scale as well. The vot-
ing weights can be obtained either through discriminative training or through
generative training [2].
Cascades As mentioned above the cascade method trains a sequence of clas-
sifiers with successively more difficult background data. Each such classifier is
usually designed to be simple and computationally efficient. When the data in
the window Xs+W is declared background by any classifier of the cascade the
decision is final and the computation proceeds to the next window. Since most
background windows are rejected early in the cascade most of the windows in
the image are processed very quickly.
Coarse to fine The cascade method can be viewed as a coarse to fine decomposi-
tion of background that gradually makes finer and finer discriminations between
object and background images that have significant resemblance to the object.
An alternative is to create a coarse to fine decomposition of object poses [9].
In this case it is possible to train classifiers that can rule out a large subset of
the pose space in a single step. A general setting involves a rooted tree where
the leaves correspond to individual detections and internal nodes store classifiers
that quickly rule out all detections below a particular node. The idea is closely
related to branch-and-bound methods [14] that use admissible lower-bounds to
search a space of transformations or hypotheses.
umbrella.98 bus.99
umbrella.98
person1.00
person1.00
person1.00
backpack1.00
person1.00 person.99
handbag.96 person.99
person1.00 person1.00 person1.00
person1.00 person1.00
person.95 person.98
person1.00
person1.00 person1.00 person.94 person1.00 person1.00 person.89
person1.00 sheep.99
backpack.99
sheep.99 sheep.86
backpack.93 sheep.82 sheep.96
sheep.96 sheep.93 sheep.91 sheep.95 sheep.96 sheep1.00
sheep1.00
sheep.99
sheep1.00
sheep.99
sheep.96
sheep.99
person.99
bottle.99
dining table.96
bottle.99
bottle.99
person.99person1.00
person1.00
traffic light.96 tv.99
chair.98 chair.99
chair.90
dining table.99 chair.96 wine glass.97
chair.86
bottle.99wine glass.93 chair.99
bowl.85 wine glass1.00
elephant1.00
wine glass.99
wine glass1.00
person1.00 chair.96 chair.99 fork.95
person.96
Fig. 3. Bounding boxes and segmentation masks produced by Mask R-CNN. The
system is trained to detect and segment 80 object categories from the COCO
dataset. (Reproduced with permission from [21].)
Recommended Readings
[1] LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard,
W., Jackel, L.D. (1989). Backpropagation applied to handwritten zip code
recognition. Neural computation 1(4) 541–551
[2] Amit, Y. (2002). 2d Object Detection and Recognition: Models, Algorithms
and Networks. MIT Press
[3] Felzenszwalb, P., Huttenlocher, D. (2005). Pictorial structures for object
recognition. International Journal of Computer Vision 61(1) 55–79
Fig. 4. Bounding boxes, segmentation masks, and joint positions produced by
Mask R-CNN. The system is trained to detect, segment, and predict pose on the
person categories from the COCO dataset. (Reproduced with permission from
[21].)
[4] Fergus, R., Perona, P., Zisserman, A. (2003). Object class recognition by
unsupervised scale-invariant learning. IEEE CVPR 2003
[5] Amit, Y., Trouvé, A. (2007). POP: Patchwork of parts models for object
recognition. International Journal of Computer Vision 75(2) 267–282
[6] Jin, Y., Geman, S. (2006). Context and hierarchy in a probabilistic image
model. IEEE CVPR 2006
[7] Rowley, H.A., Baluja, S., Kanade, T. (1998). Neural network-based face
detection. IEEE Transactions on Pattern Analysis and Machine Intelligence
20(1) 23–38
[8] Viola, P., Jones, M.J. (2004). Robust real time face detection. International
Journal of Computer Vision 57(2) 137–154
[9] Fleuret, F., Geman, D. (2001). Coarse-to-fine face detection. International
Journal of Computer Vision 41(1-2) 85–107
[10] Dalal, N., Triggs, B. (2005). Histograms of oriented gradients for human
detection. IEEE CVPR 2005
[11] Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D. (2010). Object
detection with discriminatively trained part based models. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence 32(9) 1627–1645
[12] Reiss, T. (1993). Recognizing Planar Objects Using Invariant Image Fea-
tures. Springer-Verlag
[13] Riesenhuber, M., Poggio, T. (2000). Models of object recognition. Nature
neuroscience 3(11s) 1199–1204
[14] Lampert, C., Blaschko, M., Hofmann, T. (2009). Efficient subwindow
search: A branch and bound framework for object localization. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 31(12) 2129–2142
[15] Lowe, D. (2004). Distinctive image features from scale-invariant keypoints.
International Journal of Computer Vision 60(2) 91–110
[16] Chang, L.B., Jin, Y., Zhang, W., Borenstein, E., Geman, S. (2011). Context
computation, and optimal roc performance in hierarchical models. Interna-
tional Journal of Computer Vision 93(2) 117–140
[17] Krizhevsky, A., Sutskever, I., Hinton, G. (2012). Imagenet classification
with deep convolutional neural networks. Advances in Neural Information
Processing Systems
[18] Girshick, R., Donahue, J., Darrell, T., Malik, J. (2014). Rich feature hi-
erarchies for accurate object detection and semantic segmentation. IEEE
CVPR 2014
[19] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg,
A.C. (2016). SSD: Single shot multibox detector. ECCV 2016
[20] Ren, S., He, K., Girshick, R., Sun, J. (2015). Faster R-CNN: Towards real-
time object detection with region proposal networks. Advances in Neural
Information Processing Systems
[21] He, K., Gkioxari, G., Dollár, P., Girshick, R. (2017). Mask R-CNN. ICCV
2017
[22] Ullman, S. (1996). High-Level Vision. MIT. Press, Cambridge, MA.
[23] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.
(2017). Feature pyramid networks for object detection. IEEE CVPR 2017
[24] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D.,
Dollár, P., Zitnick, C.L. (2014). Microsoft COCO: Common objects in
context. ECCV 2014