0% found this document useful (0 votes)

11 views12 pages

Lenc 15 RCNN

Uploaded by

tobechao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

Lenc 15 RCNN

Uploaded by

tobechao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

LENC, VEDALDI: R-CNN MINUS R 1

R-CNN minus R
Karel Lenc Department of Engineering Science,
http://www.robots.ox.ac.uk/~karel University of Oxford,
Andrea Vedaldi Oxford, UK.
http://www.robots.ox.ac.uk/~vedaldi

Abstract
Deep convolutional neural networks (CNNs) have had a major impact in most ar-
eas of image understanding. In object category detection, however, the best results have
been obtained by techniques such as R(egion)-CNN that combine CNNs with cues from
image segmentation, using techniques such as selective search to propose possible ob-
ject locations in images. However, the role of segmentation in CNN detectors remains
controversial. On the one hand, segmentation may be a necessary modelling compo-
nent, carrying essential geometric information not contained in the CNN; on the other
hand, it may be merely a way of accelerating detection, by focusing the CNN classi-
fier on promising image areas. In this paper, we answer this question by developing a
detector that uses a trivial region generation scheme, constant for each image. While
such region proposals approximate objects poorly, we show that a bounding box regres-
sor using intermediate convolutional features can recover sufficiently accurate bounding
boxes, demonstrating that, indeed, the required geometric information is contained in the
CNN itself. Combined with convolutional feature pooling, we also obtain an excellent
and fast detector that does not require to process an image with algorithms other than
the CNN itself. We also streamline and simplify the training of CNN-based detectors by
integrating several learning steps in a single algorithm, as well as by proposing a number
of improvements that accelerate detection.

1 Introduction
Object detection is one of the core problems in image understanding. Until recently, the
best performing detectors in standard benchmarks such as PASCAL VOC were based on a
combination of handcrafted image representations such as SIFT, HOG, and the Fisher Vector
and a form of structured output regression, from sliding window to deformable parts mod-
els. Recently, however, these pipelines have been outperformed significantly by the ones
based on deep learning that induce representations automatically from data using Convolu-
tional Neural Networks (CNNs). Currently, the best CNN-based detectors are based on the
R(egion)-CNN construction of [10]. Conceptually, R-CNN is remarkably simple: it samples
image regions using a proposal mechanism such as Selective Search (SS; [21]) and classifies
them as foreground and background using a CNN. Looking more closely, however, R-CNN
leaves open several interesting question.
The first question is whether a CNN contains sufficient geometric information to localise
objects, or whether the latter must be supplemented by an external mechanism, such as region
c 2015. The copyright of this document resides with its authors.
It may be distributed unchanged freely in print or electronic forms.
2 LENC, VEDALDI: R-CNN MINUS R

Figure 1: Some examples of the bounding box regressor outputs. The dashed box is the
image-agnostic proposal, correctly selected despite the bad overlap, and the solid box is the
result of improving it by using the pose regressor. Both steps use the same CNN, but the first
uses the geometrically-invariant fully-connected layers, and the last the geometry-sensitive
convolutional layers. In this manner, accurate object location can be recovered without using
complementary mechanisms such as selective search.

proposal generation. There are in fact two hypothesis. The first one is that the only role of
proposal generation is to cut down computation by allowing to evaluate the CNN, which
is expensive, on a small number of image regions. If this is the case, as other speedups
such as SPP-CNN [11] become available, proposal generation becomes less important and
could ultimately be removed. The second hypothesis is that, instead, proposal generation
provides geometric information which is not represented in the CNN and which is required
for accurate object localisation. This is not unlikely, given that CNNs are usually trained to
be invariant to large geometric deformations and hence may not be sensitive to an object’s
location. This question is answered in Section 3.1 by showing that the convolutional layers
of standard CNNs contain sufficient information to localise objects (Figure 1).
The second question is whether the R-CNN pipeline can be simplified. While conceptu-
ally straightforward, in fact, R-CNN comprises many practical steps that need to be carefully
implemented and tuned to obtain a good performance. To start with, R-CNN builds on a
CNN pre-trained on an image classification tasks such as ImageNet ILSVRC [6]. This CNN
is ported to detection by: i) learning an SVM classifier for each object class on top of the
last fully-connected layer of the network, ii) fine-tuning the CNN on the task of discrim-
inating objects and background, and iii) learning a bounding box regressor for each object
class. Section 3.2 simplifies these steps, which require running a mix of different software on
cached data, by training a single CNN addressing all required tasks. A similar simplification
can also be found in a very recent version update to R-CNN, namely Fast R-CNN [9].
The third question is whether R-CNN can be accelerated. A substantial speedup was
already obtained in spatial pyramid pooling (SPP) by [11] by realising that convolutional
features can be shared among different regions rather than being recomputed. However,
this does not accelerate training, and in testing the region proposal generation mechanism
becomes the new bottleneck. The combination of dropping proposal generation and of the
other simplifications are shown in Section 4 to provide a substantial detection speedup –
and this for the overall system, not just the CNN part. This is alternative to the very recent
Faster R-CNN [17] scheme, which instead uses the convolutional features to construct a new
efficient proposal generation scheme. Our findings are summarised in Section 5.
Related work. The basis of our work are the current generation of deep CNNs for image un-
derstanding, pioneered by [14]. One of the first frameworks for object detection with CNNs
is OverFeat framework [18] which tackles object detection by performing sliding window on
a feature map produced by CNN layers followed by bounding box regression (BBR). Even
though authors introduce a way how to efficiently increase the number of evaluated locations,
LENC, VEDALDI: R-CNN MINUS R 3

their sliding window approach is limited to single aspect ratio bounding boxes (before BBR).
For object detection, our method builds directly on the R-CNN approach of [10] as well as
the SPP extension proposed in [11]. All such methods rely not only on CNNs, but also on
a region proposal generation mechanism such as SS [21], CPMC [3], multi-scale combina-
torial grouping [2], and edge boxes [27]. These methods, which are extensively reviewed
in [12], originate in the idea of “objectness” proposed by [1]. Interestingly, [12] showed that
a good region proposal scheme is essential for R-CNN to work well. Here, we show that
this is in fact not the case provided that bounding box locations are corrected by a strong
CNN-based bounding box regressor, a step that was not evaluated for R-CNNs in [12]. The
R-CNN and SPP-CNN detectors build on years of research in object detection. Both can
be seen as accelerated sliding window detectors [5, 24]. The two-stage computation using
region proposal generation is a form of cascade detector [24] or jumping window [20, 23].
However, they differ in part-based detector such as [8] in that they do not explicitly model
object parts in learning; instead parts are implicitly captured in the CNN. As noted above,
ideas similar or alternative to Section 3.2 have been recently introduced in [9] and [17].

2 CNN-based detectors
This section summarises the R-CNN (Section 2.1) and SPP-CNN (Section 2.2) detectors.

2.1 The R-CNN detector

The R-CNN method [10] is a chain of conceptually simple steps: generating candidate object
regions, classifying them as foreground or background, and post-processing them to improve
their fit to objects. These steps are described next.
Region proposal generation. R-CNN starts by an algorithm such as SS [21] or CPMC [3]
to extracts from an image x a shortlist of image regions R ∈ R(x) that are likely to tightly
enclose objects. These proposals, in the order of a few thousands per image, may have
arbitrary shapes, but are converted to rectangles before further processing.
CNN-based features. Candidate regions are described by CNN features before being classi-
fied. The CNN itself is transferred from a different problem – usually image classification in
the ImageNet ILSVRC challenge [6]. In this manner, the CNN can be trained on a very large
dataset, as required to obtain good performance, and then applied to object detection, where
datasets are usually much smaller. In order to transfer a pre-trained CNN to object detection,
its last few layers, which are specific to the classification task, are removed; this results in a
“beheaded” CNN φ that outputs relatively generic features. The CNN is applied to the image
regions R by cropping and resizing the image x, i.e. φRCNN (x; R) = φ (resize(x|R )). Cropping
and resizing serves two purposes: to localise the descriptor and to provide the CNN with an
image of a fixed size, as this is required by many CNN architectures.
SVM training. Given the region descriptor φRCNN (x; R), the next step is to learn a SVM
classifier to decide whether a region contains an object or background. Learning the SVM
starts from a number of example images x1 , . . . , xN , each annotated with ground-truth regions
R̄ ∈ Rgt (xi ) and object labels c(R̄) ∈ {1, . . .C}. In order to learn a classifier for class c̄, R-
CNN divides ground-truth Rgt (xi ) and candidate R(xi ) regions into positive and negative.
In particular, ground truth regions R ∈ Rgt (xi ) for class c(R) = c̄ are assigned a positive
label y(R; c̄; τ) = +1; other regions R are labelled as ambiguous y(R; c̄; τ) = ε and ignored
4 LENC, VEDALDI: R-CNN MINUS R

if overlap(R, R̄) ≥ τ with any ground truth region R̄ ∈ Rgt (xi ) of the same class c(R̄) = c̄.
The remaining regions are labelled as negative. Here overlap(A, B) = |A ∩ B|/|A ∪ B| is the
intersection-over-union overlap measure, and the threshold is set to τ = 0.3. The SVM takes
the form φSVM ◦φRCNN (x; R), where φSVM is a linear predictor hwc , φRCNN i+bc learned using
an SVM solver to minimise the regularised empirical hinge loss risk on the training regions.
Bounding box regression. Candidate bounding boxes are refitted to detected objects by
using a CNN-based regressor as detailed in [10]. Given a candidate bounding box R =
(x, y, w, h), where (x, y) are its centre and (w, h) its width and height, a linear regressor es-
timates an adjustment d = (dx , dy , dw , dh ) that yields the new bounding box d[R] = (wdx +
x, hdy + y, wedw , hedh ). In order to train this regressor, one collects for each ground truth re-
gion R∗ all the candidates R that overlap sufficiently with it (with an overlap of at least 0.5).
Each pair (R∗ , R) of regions is converted in a training input/output pair (φcnv (x, R), d) for the
regressor, where d is the adjustment required to transform R into R∗ , i.e. R∗ = d[R]. The
pairs are then used to train the regressor using ridge regression with a large regularisation
constant. The regressor itself takes the form d = Q> c φcnv (resize(x|R )) + tc where φcnv de-
notes the CNN restricted to the convolutional layers, as further discussed in Section 2.2. The
regressor is further improved by retraining it after removing the 20% of the examples with
worst regression loss – as found in the publicly-available implementation of SPP-CNN.
Post-processing. The refined bounding boxes are passed to non-maxima suppression be-
fore being evaluated. Non-maxima suppression eliminates duplicate detections prioritising
regions with higher SVM score φSVM ◦ φRCNN (x; R). Starting from the highest ranked region
in an image, other regions are iteratively removed if they overlap by more than 0.3 with any
region retained so far.
CNN fine-tuning. The quality of the CNN features, ported from an image classification
task, can be improved by fine-tuning the network on the target data. In order to do so, the
CNN φRCNN (x; R) is concatenated with additional layers φsftmx (a linear projection followed
by softmax normalisation) to obtain a predictor for the C + 1 object classes. The new CNN
φsftmx ◦ φRCNN (x; R) is then trained as a classifier by minimising its empirical logistic risk on
a training set of labelled regions. This is analogous to the procedure used to learn the CNN
in the first place, but with a reduced learning rate and a different (and smaller) training set
similar to the one used to train the SVM. In this dataset, a region R, either ground-truth or
candidate, is assigned the class c(R; τ+ , τ− ) = c(R̄∗ ) of the closest ground-truth region R̄∗ =
argmaxR̄∈Rgt (x) overlap(R, R̄), provided that overlap(R, R̄∗ ) ≥ τ+ . If instead overlap(R, R̄∗ ) <
τ− , then the region is labelled as c(R; τ+ , τ− ) = 0 (background), and the remaining regions
as ambiguous and ignored. By default τ+ and τ− are both set 1/2, resulting in a much
more relaxed training set than for the SVM. Since the dataset is strongly biased towards
background regions, during CNN training it is rebalanced by sampling with 25% probability
regions such that c(R) > 0 and with 75% probability regions such that c(R) = 0.

2.2 SPP-CNN detector

A significant disadvantage of R-CNN is the need to recompute the whole CNN from scratch
for each evaluated region; since this occurs thousands of times per image, the method is
slow. SPP-CNN addresses this issue by factoring the CNN φ = φfc ◦ φcnv in two parts, where
φcnv contains the so-called convolutional layers, pooling information from local regions, and
φfc the fully connected (FC) ones, pooling information from the image as a whole. Since the
convolutional layers encode local information, this can be selectively pooled to encode the
LENC, VEDALDI: R-CNN MINUS R 5

appearance of an image subregion R instead of the whole image [4, 11]. In more detail, let
y = φcnv (x) the output of the convolutional layers applied to image x. The feature field y is
a H × W × D tensor of height H and width W , proportional to the height and width of the
input image x, and D feature channels. Let z = SP(y; R) be the result of applying the spatial
pooling (SP) operator to the feature in y contained in region R. This operator is defined as:
zd = max yi jd , d = 1, . . . , D (1)
(i, j):g(i, j)∈R

where the function g maps the feature coordinates (i, j) back to image coordinates g(i, j).
The SP operator is extended to spatial pyramid pooling (SPP; [15]) by dividing the region
R into subregions R = R1 ∪ R2 ∪ . . . RK , applying the SP operator to each, and then stacking
the resulting features. In practice, SSP-CNN uses K × K subdivisions, where K is chosen
to match the size of the convolutional feature field in the original CNN. In this manner, the
output can be concatenated with the existing FC layers: φSPP (x; R) = φfc ◦SPP(·; R)◦φcnv (x).
Note that, compared to R-CNN, the first part of the computation is shared among all regions
R.
Next, we derive the map g that transforms feature coordinates back to image coordinates
as required by (1) (this correspondence was approximated in [11]). It suffices to consider
one spatial dimension. The question is which pixel x0 (i0 ) corresponds to feature xL (iL ) in
the L-th layer of a CNN. While there is no unique definition, a useful one is to let i0 be the
centre of the receptive field of feature xL (iL ), defined as the set of pixels ΩL (iL ) that can
affect xL (iL ) as a function of the image (i.e. the support of the feature seen as a function). A
short calculation leads to
!
L L p−1
Fp − 1
i0 = gL (iL ) = αL (iL − 1) + βL , αL = ∏ S p , βL = 1 + ∑ ∏ Sq − Pp ,
p=1 p=1 q=1 2

where the CNN layers are described geometrically by: padding Pl , downsampling factor Sl ,
and filter width Fl . The meaning of these parameters is obvious for linear convolution and
spatial pooling layers; most of other layers can also be thought of as “convolutional” (e.g.
ReLU) but with null padding, no (unitary) subsampling, and unitary filter width.
Given the definition of g, similarly to [11] equation (1) pools the features whose receptive
field centre is contained in the image region R

3 Simplifying and streamlining R-CNN

This section describes the main technical contributions of the paper: removing region pro-
posal generation from R-CNN (Section 3.1) and streamlining the pipeline (Section 3.2).

3.1 Dropping region proposal generation

While the SPP method of [11] (Section 2.2) accelerates R-CNN evaluation by orders of
magnitude, it does not result in a comparable acceleration of the detector as a whole; in fact,
proposal generation with SS is about ten time slower than SPP classification. Much faster
proposal generators exist, but may not result in very accurate regions [26]. However, this
might not be a problem if accurate object location can be recovered by the CNN. Here we
take this idea to the limit: we drop R(x) entirely and to use instead an image-independent
list of candidate regions R0 , relying on the CNN for accurate localisation a-posteriori.
6 LENC, VEDALDI: R-CNN MINUS R
GT SS SS oGT > 0.5 SW 7k Cluster 3k
−0.4
−0.2
0
y

0.2
0.4
−0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4 −0.4−0.2 0 0.2 0.4
x x x x x
−3

−2
log10(h)

−1

0
−3 −2 −1 0 −3 −2 −1 0 −3 −2 −1 0 −3 −2 −1 0 −3 −2 −1 0
log10(w) log10(w) log10(w) log10(w) log10(w)

−3
log10(scale)

−2

−1

0
0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5 0.1 0.3 0.5
|c| |c| |c| |c| |c|

Figure 2: Bounding box distributions using the normalised coordinates of Section 3.1. Rows
show the histograms for the bounding box centre (x, y), size (w, h), and scale vs distance
from centre (s, |c|). Column shows the statistics for ground-truth, selective search, restricted
selective search, sliding window, and cluster bounding boxes (for n = 3000 clusters).

Constructing R0 starts by studying the distribution of bounding boxes in a representative

object detection benchmark, namely the PASCAL VOC 2007 data [7]. A box is defined
by the tuple (rs , cs , re , ce ) denoting the upper-left and lower-right corners coordinates (rs , cs )
and (re , ce ). The bounding box centre is then given by (x, y) = 21 (ce + cs , re + rs ). Given an
image of size H × W , we define the normalised width and height √ as w = (ce − cs )/W and
h = (re − rs )/H respectively; we define also the scale as s = wh and distance from the
image centre as |c| = k [(cs + ce )/2W − 0.5, (rs + re )/2H − 0.5)] k2 .
The first column of Figure 2 shows the distribution of such parameters for the GT boxes
in the PASCAL data. It is evident that boxes tend to appear close to the image centre and
to fill the image. The statistics of SS regions differs substantially; in particular, the (s, |c|)
histogram shows that SS boxes tend to distribute much more uniformly in scale and space
compared to the GT ones. If SS boxes are restricted to the ones that have an overlap of at
least 0.5 with a GT BB, then the distributions are similar again, with a strong preference for
centred and large boxes.
The fourth column shows the distribution of boxes generated by a sliding window (SW; [5])
object detector. For an “exhaustive” enumeration of boxes at all location, scales, and aspect
ratios, there can be hundred of thousands boxes per image. Here we subsample this set to
7K in order to obtain a candidate set with a size comparable to SS. This was obtained by
sampling the width of the bounding boxes as w = w0 2l , l = 0, 0.5, . . . 4 where w0 ≈ 40 pixels
is the width of the smallest bounding box considered in the SSP-CNN detector. Similarly,
aspect ratios are sampled as 2{−1,−0.75,...1} . The distribution of boxes, visualised in the fourth
LENC, VEDALDI: R-CNN MINUS R 7

column of Figure 2, is similar to SS and dissimilar from GT. This is much denser sampling
than in the OverFeat framework [18] which evaluates approximately 1.6K boxes per image
with a single aspect ratio only.
A simple modification of sliding window is to bias sampling to match the statistics of
the GT bounding boxes. We do so by computing n K-means clusters from the collection of
vectors (rs , cs , re , ce ) obtained from the GT boxes in the PASCAL VOC training data. We
call this set of boxes R0 (n); the fifth column of Figure 2 shows that, as expected, the corre-
sponding distribution matches nicely the one of GT even for a small set of n = 3000 cluster
centres. Section 4 shows empirically that, when combined with a CNN-based bounding box
regressor, this proposal set results in a very competitive (and very fast) detector.

3.2 Streamlined detection pipeline

This section proposes several simplifications to the R/SPP-CNN pipelines complementary to
dropping region proposal generation as done in Section 3.1. As a result of all these changes,
the whole detector, including detection of multiple object classes and bounding box regres-
sion, reduces to evaluating a single CNN. Furthermore, the pipeline is straightforward to
implement on GPU, and is sufficiently memory-efficient to process multiple images at once.
In practice, this results in an extremely fast detector which still retains excellent performance.
Dropping the SVM. As discussed in Section 2.1, R-CNN involves training an SVM classi-
fier for each target object class as well as fine-tuning the CNN features for all classes. An
obvious question is whether SVM training is redundant and can be eliminated.
Recall from Section 2.1 that fine-tuning learns a softmax predictor φsftmx on top of R-
CNN features φRCNN (x; R), whereas SVM training learns a linear predictor φSVM on top of
the same features. In the first case, Pc = P(c|x, R) = [φsftmx ◦ φRCNN (x; R)]c is an estimate
of the class posterior for region R; in the second case Sc = [φSVM ◦ φRCNN (x; R)]c is a score
that discriminates class c from any other class (in both cases background is treated as one of
the classes). As verified in Section 4 and Table 1, Pc works poorly as a score for an object
detector; however, and somewhat surprisingly, using as score the ratio Sc0 = Pc /P0 (where P0
is the probability of the background class) results in performance nearly as good as using an
SVM. Further, note that φsftmx can be decomposed as C +1 linear predictors hwc , φRCNN i+bc
followed by exponentiation and normalisation; hence the scores Sc0 reduces to the expression
Sc0 = exp (hwc − w0 , φRCNN i + bc − b0 ).
Integrating SPP and bounding box regression. While in the original implementation of
SPP [11] the pooling mechanism is external to the CNN software, we implement it directly as
a layer SPP(·; R1 , . . . , Rn ). This layer takes as input a tensor representing the convolutional
features φcnv (x) ∈ RH×W ×D and outputs n feature fields of size h × w × D, one for each
region R1 , . . . , Rn passed as input. These fields can be stacked in a 4D output tensor, which
is supported by all common CNN software. Given a dual CPU/GPU implementation of the
layer, SPP integrates seamlessly with most CNN packages, with substantial benefit in speed
and flexibility, including the possibility of training with back-propagation through it.
Similar to SPP, bounding box regression is easily integrated as a bank of filters (Qc , bc ), c =
1, . . . ,C running on top of the convolutional features φcnv (x). This is cheap enough that can
be done in parallel for all the object classes in PASCAL VOC.
Scale-augmented training, single scale evaluation. While SPP is fast, one of the most time
consuming step is to evaluate features at multiple scales [11]. However, the authors of [11]
also indicate that restricting evaluation to a single scale has a marginal effect in performance.
8 LENC, VEDALDI: R-CNN MINUS R
Evaluation method Single scale Multi scale
BB regression no yes no yes
Sc (SVM) 54.0 58.6 56.3 59.7
Pc (softmax) 27.9 34.5 30.1 38.1
Pc /P0 (modified softmax) 54.0 58.0 55.3 58.4

Table 1: Evaluation of SPP-CNN with and without the SVM classifier. The table report
mAP on the PASCAL VOC 2007 test set for the single scale and multi scale detector, with
or without bounding box regression. Different rows compare different bounding box scoring
mechanism of Section 3.2: the SVM scores Sc , the softmax posterior probability scores Pc ,
and the modified softmax scores Pc /P0 .

method mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
SVM MS 59.68 66.8 75.8 55.5 43.1 38.1 66.6 73.8 70.9 29.2 71.4 58.6 65.5 76.2 73.6 57.4 29.9 60.1 48.4 66.0 66.8
SVM SS 58.60 66.1 76.0 54.9 38.6 32.4 66.3 72.8 69.3 30.2 67.7 63.7 66.2 72.5 71.2 56.4 27.3 59.5 50.4 65.3 65.2
FC8 MS 58.38 69.2 75.2 53.7 40.0 33.0 67.2 71.3 71.6 26.9 69.6 60.3 64.5 74.0 73.4 55.6 25.3 60.4 47.0 64.9 64.4
FC8 SS 57.99 67.0 75.0 53.3 37.7 28.3 69.2 71.1 69.7 29.7 69.1 62.9 64.0 72.7 71.0 56.1 25.6 57.7 50.7 66.5 62.3
FC8 C3k MS 53.41 55.8 73.1 47.5 36.5 17.8 69.1 55.2 73.1 24.4 49.3 63.9 67.8 76.8 71.1 48.7 27.6 42.6 43.4 70.1 54.5
FC8 C3k SS 53.52 55.8 73.3 47.3 37.3 17.6 69.3 55.3 73.2 24.0 49.0 63.3 68.2 76.5 71.3 48.2 27.1 43.8 45.1 70.2 54.6

Table 2: Comparison of different variants of the SPP-CNN detector. First group of rows:
original SPP-CNN using Multi Scale (MS) or Single Scale (SS) detection. Second group:
the same experiment, but dropping the SVM and using the modified softmax scores of Sec-
tion 3.2. Third group: SPP-CNN without region proposal generation, but using a fixed set
of 3K candidate bounding boxes as explained in Section 3.1.

Here, we maintain the idea of evaluating the detector at test time by processing each image
at a single scale. However, this requires the CNN to explicitly learn scale invariance, which
is achieved by fine-tuning the CNN using randomly rescaled versions of the training data.

4 Experiments
This section evaluates the changes to R-CNN and SPP-CNN proposed in Section 3. All
experiments use the Zeiler and Fergus (ZF) small CNN [25] as this is the same network used
by [11] that introduce SPP-CNN. While more recent networks such as the very deep models
of Simonyan and Zisserman [19] are likely to perform better, this choice allows to compare
directly [11]. The detector itself is trained and evaluated on the PASCAL VOC 2007 data [7],
as this is a default benchmark for object detection and is used in [11] as well.
Dropping the SVM. The first experiment evaluates the performance of the SPP-CNN detec-
tor with or without the linear SVM classifier, comparing the bounding box scores Sc (SVM),
Pc (softmax), and Sc0 (modified softmax) of Section 3.2. As can be seen in Table 1 and
Table 2, the best performing method is SSP-CNN evaluated at multiple scales, resulting in
59.7% mAP on the PASCAL VOC 2007 test data (this number matches the one reported
in [11], validating our implementation). Removing the SVM and using the CNN softmax
scores directly performs really poorly, with a drop of 21.6% mAP point. However, adjusting
the softmax scores using the simple formula Pc /P0 restores the performance almost entirely,
back to 58.4% mAP. While there is still a small 1.3% drop in mAP accuracy compared to
using the SVM, removing the latter dramatically simplifies the detector pipeline, resulting
in particular in significantly faster training as it removes the need of preparing and caching
data for the SVM (as well as learning it).
LENC, VEDALDI: R-CNN MINUS R 9

SS
0.6
SS-BBR
Cx

mAP
0.5
Cx-BBR
SW
0.4 SW-BBR
1 2 3 4 5 6 7
Num Boxes/im [103 ]

Figure 3: mAP on the PASCAL VOC 2007 test data as a function of the number of candidate
boxes per image, proposal generation method, and using or not bounding box regression. In
all cases, the CNN is fine-tuned for the particular bounding-box generation algorithm.

Impl. [ms] SelS Prep. Move Conv SPP FC BBR Σ − SelS

SPP 23.3 67.5 186.6 211.1 91.0 39.8 619.2 ±118.0
MS
OURS 23.7 17.7 179.4 38.9 87.9 9.8 357.4 ±34.3
1.98 · 103
SPP 9.0 47.7 31.1 207.1 90.4 39.9 425.1 ±117.0
SS
OURS 9.0 3.0 30.3 19.4 88.0 9.8 159.5 ±31.5

Table 3: Timing (in ms) of the original SPP-CNN and our streamlined full-GPU implemen-
tation, broken down into selective search (SS) and preprocessing: image loading and scaling
(Prep), CPU/GPU data transfer (Move), convolution layers (Conv), spatial pyramid pool-
ing (SPP), fully connected layers and SVM evaluation (FC), and bounding box regression
(BBR). The performance of the tested classifiers is referred in the first two rows of Table 2.

Multi-scale evaluation. The second set of experiments assess the importance of performing
multi-scale evaluation of the detector. Results are reported once more in Tables 1 and 2. Once
more, multi-scale detection is the best performing method, with performance up to 59.7%
mAP. However, single scale testing is very close to this level of performance, at 58.6%, with
a drop of just 1.1% mAP points. Just like when removing the SVM, the resulting simplifi-
cation and in this case detection speedup make this drop in accuracy more than tolerable. In
particular, testing at a single scale accelerates detection roughly five-folds.
Dropping region proposal generation. The next experiment evaluates replacing the SS
region proposals RSS (x) with the fixed proposals R0 (n) as suggested in Section 3.1 (fine-
tuning the CNN and retraining the bounding-box regression algorithm for the different region
proposals in the training set). Table 2 shows the detection performance for n = 3,000, a
number of candidates comparable with the 2,000 extracted by selective search. While there
is a drop in performance compared to using SS, this is small (59.68% vs 53.41%, i.e. a 6.1%
reduction), which is surprising since bounding box proposals are now oblivious of the image
content.
Figure 3 looks at these results in greater detail. Three bounding box generation meth-
ods are compared: selective search, sliding windows, and clustering (see also Section 3.1),
with or without bounding box regression. Neither clustering nor sliding windows result in
an accurate detector: even if the number of candidate boxes is increased substantially (up to
n = 7K), performance saturates at around 46% mAP. This is much poorer than the ∼56%
achieved by selective search. Bounding box regression improves selective search by about
3% mAP, up to ∼59%, but it has a much more significant effect on the other two meth-
ods, improving performance by about 10% mAP. Note that clustering with 3K candidates
performs as well as sliding window with 7K.
10 LENC, VEDALDI: R-CNN MINUS R

We can draw several interesting conclusions. First, for the same low number of candidate
boxes, selective search is much better than any fixed proposal set; less expected is that per-
formance does not increase even with 2× more candidates, indicating that the CNN is unable
to tell which bounding boxes wrap objects better even when tight boxes are contained in the
shortlist of proposals. This can be explained by the high degree of geometric invariance in
the CNN. At the same time, the CNN-based bounding box regressor can make loose bound-
ing boxes significantly tighter, which requires geometric information to be preserved by the
CNN. This apparent contradiction can be explained by noting that bounding box classifica-
tion is built on top of the FC layers of the CNN, whereas bounding box regression is built on
the convolutional ones. Evidently, geometric information is removed in the FC layers, but is
still contained in the convolutional layers (see also Figure 1).
Detection speed. The last experiment (Table 3) evaluates the detection speed of SPP-CNN
(which is already orders of magnitude faster than R-CNN) and our streamlined implementa-
tion using the MatConvNet [22] (the original SPP detector is using Caffe [13] with identical
GPU kernels). Not counting SS proposal generation, the streamlined implementation is be-
tween 1.7× (multi-scale) to 2.6× (single-scale) faster than original SPP, with the most sig-
nificant gain emerging from the integrated SPP and bounding box regression implementation
on GPU and consequent reduction of data transfer cost between CPU and GPU.
As suggested before, however, the bottleneck is selective search. Compared to the slow-
est MS SPP-CNN implementation of [11], using all the simplifications of Section 3, includ-
ing removing selective search, results in an overall detection speedup of more than 16×,
from about 2.5s per image down to 160ms (this at a reduction of about 6% mAP points).

5 Conclusions
Our most significant finding is that current CNNs do not require to be supplemented with ac-
curate geometric information obtained from segmentation based methods to achieve accurate
object detection. The necessary geometric information is in fact contained in the CNN, albeit
in the intermediate convolutional layers instead of the deeper fully-connected ones (this find-
ing is independently corroborated by visualisations such as the ones in [16]). This does not
mean that proposal generation is not useful; in particular, in datasets such as MSR COCO
that contain many small objects a fixed list of proposal might not work as well as it does
for PASCAL VOC; however, our findings mean that fairly coarse proposals are sufficient as
geometric information can be extracted frogit m the CNN.
These findings open the possibility of building state-of-the-art object detectors that rely
exclusively on CNNs, removing region proposal generation schemes such as selective search,
and resulting in integrated, simpler, and faster detectors. Our current implementation of a
proposal-free detector is already much faster than SPP-CNN, and very close, but not quite as
good, in term of mAP. However, we have only begun exploring the design possibilities and
we believe that it is a matter of time before the gap closes entirely. In fact, papers recently
appeared in arXiv, such as [17], appear to be heading in this direction.

6 Acknowledgements
Financial support for Karel Lenc was provided by a BP industrial grant and by a DTA of the
Engineering Science Department of the University of Oxford.
LENC, VEDALDI: R-CNN MINUS R 11

References
[1] B. Alexe, T. Deselaers, and V. Ferrari. What is an object? In Proc. CVPR, 2010.

[2] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and J. Malik. Multiscale combina-

torial grouping. In Proc. CVPR, 2014.

[3] J. Carreira and C. Sminchisescu. Cpmc: Automatic object segmentation using con-
strained parametric min-cuts. In PAMI, 2012.

[4] M. Cimpoi, S. Maji, and A. Vedaldi. Deep convolutional filter banks for texture recog-
nition and segmentation. In Proc. CVPR, 2015.

[5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc.
CVPR, 2005.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale
Hierarchical Image Database. In Proc. CVPR, 2009.

[7] M. Everingham, A. Zisserman, C. Williams, and L. Van Gool. The PASCAL visual
obiect classes challenge 2007 (VOC2007) results. Technical report, Pascal Challenge,
2007.

[8] P. F. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, mul-

tiscale, deformable part model. In Proc. CVPR, 2008.

[9] R. Girshick. Fast RCNN. In arXiv, number arXiv:1504.08083, 2015.

[10] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for
accurate object detection and semantic segmentation. In Proc. CVPR, 2014.

[11] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional
networks for visual recognition. In Proc. ECCV, 2014.

[12] J. Hosang, R. Beneson, P. Dollár, and B. Schiele. What makes for effective detection
proposals? arXiv:1502.05082, 2015.

[13] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross
Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture
for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.

[14] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep con-
volutional neural networks. In Proc. NIPS, 2012.

[15] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bag of features: Spatial pyramid match-
ing for recognizing natural scene categories. In Proc. CVPR, 2006.

[16] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations
by inverting them. In Proc. CVPR, 2015.

[17] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection
with region proposal networks. In arXiv:1506.01497, 2015.
12 LENC, VEDALDI: R-CNN MINUS R

[18] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann
LeCun. Overfeat: Integrated recognition, localization and detection using convolu-
tional networks. CoRR, abs/1312.6229, 2013.

[19] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale
image recognition. CoRR, abs/1409.1556, 2014.
[20] J. Sivic, B. C. Russel, A. A. Efros, A. Zisserman, and W. T. Freeman. Discovering
objects and their location in images. In Proc. ICCV, 2005.
[21] J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object
recognition. IJCV, 2013.
[22] A. Vedaldi and K. Lenc. Matconvnet – convolutional neural networks for MATLAB.
CoRR, abs/1412.4564, 2014.
[23] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple kernels for object
detection. In Proc. ICCV, 2009.
[24] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple fea-
tures. In Proc. CVPR, 2001.
[25] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In
Proc. ECCV, 2014.

[26] Q. Zhao and Z. Liu an B. Yin. Cracking bing and beyond. In Proc. BMVC, 2014.
[27] C. Zitnick and P. Dollár. Edge boxes: Locating object proposals from edges. In Proc.
ECCV, 2014.

The Restless Heart, No Rest Since Birth
No ratings yet
The Restless Heart, No Rest Since Birth
11 pages
ServerAdmin v10.6
No ratings yet
ServerAdmin v10.6
197 pages
AB-063 - 4 - EN Aluminium in Cement
No ratings yet
AB-063 - 4 - EN Aluminium in Cement
8 pages
R-CNN Minus R: Karel Lenc Andrea Vedaldi
No ratings yet
R-CNN Minus R: Karel Lenc Andrea Vedaldi
9 pages
Najibi G-CNN An Iterative CVPR 2016 Paper
No ratings yet
Najibi G-CNN An Iterative CVPR 2016 Paper
9 pages
Li 2021 J. Phys.: Conf. Ser. 1827 012085
No ratings yet
Li 2021 J. Phys.: Conf. Ser. 1827 012085
11 pages
10 1109@access 2019 2932731
No ratings yet
10 1109@access 2019 2932731
9 pages
Mini Project Synopsis
No ratings yet
Mini Project Synopsis
6 pages
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
No ratings yet
Real Time Object Detection in Surveillance Cameras With 2xjeq74wam
8 pages
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
No ratings yet
cv2021 Lec6 Object Detection - 1600 - PDF - Gdrive.vip
60 pages
Object Detection
No ratings yet
Object Detection
76 pages
Object Detection Using CNN-RCNN.-1
No ratings yet
Object Detection Using CNN-RCNN.-1
14 pages
R-CNN and FR-CNN Report: Methods Used at The Core of Object Detection
No ratings yet
R-CNN and FR-CNN Report: Methods Used at The Core of Object Detection
4 pages
Manuscript Template 2
No ratings yet
Manuscript Template 2
13 pages
Comprehensive Review of R-CNN and Its Variant Arch
No ratings yet
Comprehensive Review of R-CNN and Its Variant Arch
8 pages
MV cs4243 2024 Amir 6 p2
No ratings yet
MV cs4243 2024 Amir 6 p2
95 pages
3.1 Faster - R-CNN - Towards - Real-Time - Object - Detection - With - Region - Proposal - Networks
No ratings yet
3.1 Faster - R-CNN - Towards - Real-Time - Object - Detection - With - Region - Proposal - Networks
13 pages
5638 Faster R CNN Towards Real Time Object Detection With Region Proposal Networks
No ratings yet
5638 Faster R CNN Towards Real Time Object Detection With Region Proposal Networks
9 pages
Real Time Object Detection System
No ratings yet
Real Time Object Detection System
31 pages
10 R CNN
No ratings yet
10 R CNN
28 pages
A Comprehensive Survey of The R-CNN Family For Object Detection
No ratings yet
A Comprehensive Survey of The R-CNN Family For Object Detection
6 pages
Unit 3
No ratings yet
Unit 3
45 pages
Face Detection With The Faster R-CNN
No ratings yet
Face Detection With The Faster R-CNN
6 pages
I Jeter 039112021
No ratings yet
I Jeter 039112021
8 pages
Object Detection
No ratings yet
Object Detection
57 pages
The Framework For Object Detection: Generalized R-CNN
No ratings yet
The Framework For Object Detection: Generalized R-CNN
127 pages
1 ObjectDetection
No ratings yet
1 ObjectDetection
46 pages
Deep Learning Algorithms For Object Detection
No ratings yet
Deep Learning Algorithms For Object Detection
43 pages
Real-Time Object Detection Using Deep Learning and Open CV
No ratings yet
Real-Time Object Detection Using Deep Learning and Open CV
4 pages
Du 2018 J. Phys. Conf. Ser. 1004 012029
No ratings yet
Du 2018 J. Phys. Conf. Ser. 1004 012029
9 pages
Objectdetection
No ratings yet
Objectdetection
7 pages
CVR FDP
No ratings yet
CVR FDP
37 pages
L7 Detection
No ratings yet
L7 Detection
54 pages
BTP Report Faster R CNN Compressed
No ratings yet
BTP Report Faster R CNN Compressed
32 pages
Yolo Vs RCNN
No ratings yet
Yolo Vs RCNN
5 pages
Region-Based Object Detection and Classification Using Faster R-CNN
No ratings yet
Region-Based Object Detection and Classification Using Faster R-CNN
6 pages
Last Lab Report
No ratings yet
Last Lab Report
6 pages
A Brief Review and Challenges of Object 2020
No ratings yet
A Brief Review and Challenges of Object 2020
17 pages
Object Detection Techniques A Review
No ratings yet
Object Detection Techniques A Review
9 pages
Obstacle Detection and Classification Using Deep Learning For Tracking in High-Speed Autonomous Driving
No ratings yet
Obstacle Detection and Classification Using Deep Learning For Tracking in High-Speed Autonomous Driving
6 pages
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
No ratings yet
IT5409 - Ch7 - Part3 - DL For CV-v2 - 4pages
42 pages
2802 8020 1 PB
No ratings yet
2802 8020 1 PB
3 pages
Second Progress Report UID - 17BCS2127
No ratings yet
Second Progress Report UID - 17BCS2127
13 pages
IMINT Target Acquisition Using Deep Learning
No ratings yet
IMINT Target Acquisition Using Deep Learning
5 pages
R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms
No ratings yet
R-CNN, Fast R-CNN, Faster R-CNN, YOLO - Object Detection Algorithms
11 pages
Object Detection Using Deep Learning
No ratings yet
Object Detection Using Deep Learning
5 pages
Ross Girshick Et Al - in 2013 Proposed An Architecture Called R-CNN (Region
No ratings yet
Ross Girshick Et Al - in 2013 Proposed An Architecture Called R-CNN (Region
6 pages
Object Detection With Deep Learning - A Review Summary
No ratings yet
Object Detection With Deep Learning - A Review Summary
11 pages
Object and Face Detection Based On Center-Net 1
No ratings yet
Object and Face Detection Based On Center-Net 1
7 pages
Presentation (Theoretical Evaluation)
No ratings yet
Presentation (Theoretical Evaluation)
107 pages
Final Report - Removed
No ratings yet
Final Report - Removed
43 pages
CSE4261 Lecture-12
No ratings yet
CSE4261 Lecture-12
24 pages
Scalable High Quality Object Detection
No ratings yet
Scalable High Quality Object Detection
10 pages
Detection and Content Retrieval of Object in An Image Using YOLO
No ratings yet
Detection and Content Retrieval of Object in An Image Using YOLO
8 pages
End-To-End People Detection in Crowded Scenes
No ratings yet
End-To-End People Detection in Crowded Scenes
9 pages
5 Ijlemr 77839
No ratings yet
5 Ijlemr 77839
5 pages
Ref 16
No ratings yet
Ref 16
14 pages
2022 V13i3059
No ratings yet
2022 V13i3059
11 pages
Object Detection and Identification
67% (3)
Object Detection and Identification
20 pages
Object Detection Report
No ratings yet
Object Detection Report
27 pages
Image and Video Analytics Unit 3
No ratings yet
Image and Video Analytics Unit 3
18 pages
Final hmt-1 PDF
No ratings yet
Final hmt-1 PDF
211 pages
Sigachi Industries Limited: Purchase Order
100% (1)
Sigachi Industries Limited: Purchase Order
1 page
First Summative Test in English 5
No ratings yet
First Summative Test in English 5
2 pages
Fascinating Photos of Afghanistan in The 1960s Show Life Before The Taliban
No ratings yet
Fascinating Photos of Afghanistan in The 1960s Show Life Before The Taliban
1 page
CSC441 Script Video Sawanah Koko
No ratings yet
CSC441 Script Video Sawanah Koko
2 pages
Menu Homwok
No ratings yet
Menu Homwok
4 pages
CAWRT Drill Flyer
No ratings yet
CAWRT Drill Flyer
1 page
3 F Lower Godavari Subzone
No ratings yet
3 F Lower Godavari Subzone
90 pages
x100 Pad 2 User Manual PDF
No ratings yet
x100 Pad 2 User Manual PDF
29 pages
You Must Be Mad!: Warbirds RPG Mad Science Sourcebook
100% (2)
You Must Be Mad!: Warbirds RPG Mad Science Sourcebook
55 pages
Synthesis of Ethanol by Fermentation
No ratings yet
Synthesis of Ethanol by Fermentation
10 pages
Supported Upgrade Paths For FortiOS Firmware 5.2
0% (1)
Supported Upgrade Paths For FortiOS Firmware 5.2
20 pages
Algorithm - Pseudocode of 2D CNN
No ratings yet
Algorithm - Pseudocode of 2D CNN
7 pages
Mikro DM38
No ratings yet
Mikro DM38
2 pages
TMH 7 Moving Loads
No ratings yet
TMH 7 Moving Loads
30 pages
WHKF DWH Instructions
No ratings yet
WHKF DWH Instructions
11 pages
Hays Report V4 02122013 Online
No ratings yet
Hays Report V4 02122013 Online
13 pages
Pravin Kolhe,: Executive Engineer
No ratings yet
Pravin Kolhe,: Executive Engineer
21 pages
External Environment Affecting Business in Nigeria
No ratings yet
External Environment Affecting Business in Nigeria
9 pages
Status and Prospects For Helicopter Apus in Russia Gavrilov V.V., Ponomarev B.A
No ratings yet
Status and Prospects For Helicopter Apus in Russia Gavrilov V.V., Ponomarev B.A
16 pages
Foto Electrici2
No ratings yet
Foto Electrici2
98 pages
List of Units Competency: Daftar Unit Kompetensi
No ratings yet
List of Units Competency: Daftar Unit Kompetensi
1 page
GX-6000 Manual PDF
No ratings yet
GX-6000 Manual PDF
262 pages
Mated Ttbbi 3
No ratings yet
Mated Ttbbi 3
1 page
Royal Park Property Development Limited
No ratings yet
Royal Park Property Development Limited
7 pages
Tony Tella Resume
No ratings yet
Tony Tella Resume
2 pages
"Office Green Wants To Increase Brand Awareness.": Goal One: SMART Goal One
No ratings yet
"Office Green Wants To Increase Brand Awareness.": Goal One: SMART Goal One
2 pages