0% found this document useful (0 votes)
13 views

Active Learning for Deep Object Detection 2

Computer Science: A paper on Active Learning for Deep Object Detection.

Uploaded by

sadiqmw2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Active Learning for Deep Object Detection 2

Computer Science: A paper on Active Learning for Deep Object Detection.

Uploaded by

sadiqmw2014
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Active Learning for Deep Object Detection

Clemens-Alexander Brust1 , Christoph Käding1,2 and Joachim Denzler1,2


1 Computer Vision Group, Friedrich Schiller University Jena, Germany
2 Michael Stifel Center Jena, Germany

Keywords: Active Learning, Deep Learning, Object Detection, YOLO, Continuous Learning, Incremental Learning.

Abstract: The great success that deep models have achieved in the past is mainly owed to large amounts of labeled
training data. However, the acquisition of labeled data for new tasks aside from existing benchmarks is both
challenging and costly. Active learning can make the process of labeling new data more efficient by selecting
unlabeled samples which, when labeled, are expected to improve the model the most. In this paper, we
combine a novel method of active learning for object detection with an incremental learning scheme (Käding
et al., 2016b) to enable continuous exploration of new unlabeled datasets. We propose a set of uncertainty-
based active learning metrics suitable for most object detectors. Furthermore, we present an approach to
leverage class imbalances during sample selection. All methods are evaluated systematically in a continuous
exploration context on the PASCAL VOC 2012 dataset (Everingham et al., 2010).

1 INTRODUCTION over time or the distribution underlying the problem


changes itself. We simulate such an environment
Labeled training data is highly valuable and the ba- using splits of the PASCAL VOC 2012 (Everingham
sic requirement of supervised learning. Active lear- et al., 2010) dataset. With our proposed framework,
ning aims to expedite the process of acquiring new a deep object detection system can be trained in an
labeled data, ordering unlabeled samples by the ex- incremental manner while the proposed aggregation
pected value from annotating them. In this paper, we schemes enable selection of valuable data for anno-
propose novel active learning methods for object de- tation. In consequence, a deep object detector can
tection. Our main contributions are (i) an incremental explore unknown data and adapt itself involving mi-
learning scheme for deep object detectors without ca- nimal human supervision. This combination results
tastrophic forgetting based on (Käding et al., 2016b), in a complete system enabling continuously changing
(ii) active learning metrics for detection derived from scenarios.
uncertainty estimates and (iii) an approach to leverage
selection imbalances for active learning. 1.1 Related Work
While active learning is widely studied in classi-
fication tasks (Kovashka et al., 2016; Settles, 2009), Object Detection using CNNs. An important con-
it has received much less attention in the domain of tribution to object detection based on deep learning
deep object detection. In this work, we propose met- is R-CNN (Girshick et al., 2014). It delivers a con-
hods that can be used with any object detector that siderable improvement over previously published sli-
predicts a class probability distribution per object pro- ding window-based approaches. R-CNN employs se-
posal. Scores from individual detections are aggrega- lective search (Uijlings et al., 2013), an unsupervised
ted into a score for the whole image (see Fig. 1). All method to generate region proposals. A pre-trained
methods rely on the intuition that model uncertainty CNN performs feature extraction. Linear SVMs (one
and valuable samples are likely to co-occur (Settles, per class) are used to score the extracted features and
2009). Furthermore, we show how the balanced se- a threshold is applied to filter the large number of
lection of new samples can improve the resulting per- proposed regions. Fast R-CNN (Girshick, 2015) and
formance of an incrementally learned system. Faster R-CNN (Ren et al., 2015) offer further impro-
In continuous exploration application scenarios, vements in speed and accuracy. Later on, R-CNN
e.g., in camera streams, new data becomes available is combined with feature pyramids to enable efficient

181
Brust, C., Käding, C. and Denzler, J.
Active Learning for Deep Object Detection.
DOI: 10.5220/0007248601810190
In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2019), pages 181-190
ISBN: 978-989-758-354-4
Copyright c 2019 by SCITEPRESS – Science and Technology Publications, Lda. All rights reserved
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

Score from
margin sampling
(1-vs-2) +
Whole
image
score

ca

ca ycl
bir

ho g
dot
c

rs
e
e
Predict Calculate scores Aggregate individual
unlabeled example for each detection scores
Figure 1: Our proposed system for continuous exploration scenarios. Unlabeled images are evaluated by an deep object de-
tection method. The margins of predictions (i.e., absolute difference of highest and second-highest class score) are aggregated
to identify valuable instances by combining scores of individual detections.

multi-scale detections (Lin et al., 2017). YOLO (Red- part-based detector for SVM classifiers in combina-
mon et al., 2016) is a more recent deep learning-based tion with hashing is proposed for use in large-scale
object detector. Instead of using a CNN as a black box settings. Active learning is realized by selecting the
feature extractor, it is trained in an end-to-end fashion. most uncertain instances for labeling. In (Roy et al.,
All detections are inferred in a single pass (hence the 2016), object detection is interpreted as a structured
name “You Only Look Once”) while detection and prediction problem using a version space approach in
classification are capable of independent operation. the so called “difference of features” space. The aut-
YOLOv2 (Redmon and Farhadi, 2017) and YOLOv3 hors propose different margin sampling approaches
(Redmon and Farhadi, 2018) improve upon the ori- estimating the future margin of an SVM classifier.
ginal YOLO in several aspects. These include among Like our proposed approach, most related met-
others different network architectures, different priors hods presented above rely on uncertainty indicators
for bounding boxes and considering multiple scales like least confidence or 1-vs-2. However, they are
during training and detection. SSD (Liu et al., 2016) designed for a specific type of object detection and
is a single-pass approach comparable to YOLO intro- therefore can not be applied to deep object detection
ducing improvements like assumptions about the as- methods in general whereas our method can. Addi-
pect ratio distribution of bounding boxes as well as tionally, our method does not propose single objects
predictions on different scales. As a result of a series to the human annotator. It presents whole images at
of improvements, it is both faster and more accurate once and requests labels for every object.
than the original YOLO. DSSD (Fu et al., 2017) furt-
her improves upon SSD in focusing more on context
Active Learning for Deep Architectures. In
with the help of deconvolutional layers.
(Wang and Shang, 2014) and (Wang et al., 2016),
uncertainty-based active learning criteria for deep
Active Learning for Object Detection. The aut- models are proposed. The authors offer several me-
hors of (Abramson and Freund, 2006) propose an trics to estimate model uncertainty, including least
active learning system for pedestrian detection in vi- confidence, margin or entropy sampling. Wang et al.
deos taken by a camera mounted on the front of additionally describe a self-taught learning scheme,
a moving car. Their detection method is based on where the model’s prediction is used as a label for
AdaBoost while sampling of unlabeled instances is further training if uncertainty is below a threshold.
realized by hand-tuned thresholding of detections. Another type of margin sampling is presented in
Object detection using generalized Hough transform (Stark et al., 2015). The authors propose querying
in combination with randomized decision trees, cal- samples according to the quotient of the highest and
led Hough forests, is presented in (Yao et al., 2012). second-highest class probability. The visual detection
Here, costs are estimated for annotations, and instan- of defects using a ResNet is presented in (Feng et al.,
ces with highest costs are selected for labeling. This 2017). The authors propose two methods: uncertainty
follows the intuition that those examples are most li- sampling (i.e., defect probability of 0.5) and positive
kely to be difficult and therefore considered most va- sampling (i.e., selecting every positive sample since
luable. Another active learning approach for satellite they are very rare) for querying unlabeled instances
images using sliding windows in combination with as model update after labeling. Another work which
an SVM classifier and margin sampling is proposed presents uncertainty sampling is (Liu et al., 2017). In
in (Bietti, 2012). The combination of active learning addition, a query by committee strategy as well as
for object detection with crowd sourcing is presen- active learning involving weighted incremental dicti-
ted in (Vijayanarasimhan and Grauman, 2014). A onary learning for active learning are proposed. In the

182
Active Learning for Deep Object Detection

work of (Gal et al., 2017), several uncertainty-related located close to a decision boundary. In this case, it
measures for active learning are presented. Since they can be used to refine the decision boundary and is the-
use Bayesian CNNs, they can make use of the proba- refore valuable. The metric is defined using the hig-
bilistic output and employ methods like variance sam- hest scoring classes c1 and c2 :
pling, entropy sampling or maximizing mutual infor- 2
mation. Finally, the authors of (Beluch et al., 2018) v1vs2 (x) = 1 − (max p̂(c1 |x) − max p̂(c2 |x)) .
c1 ∈K c2 ∈K\c1
show that ensamble-based uncertainty measures are (1)
able to perform best in their evaluation. All of the This procedure is known as 1-vs-2 or margin sam-
works introduced above are tailored to active learning pling (Settles, 2009). We use 1-vs-2 as part of our
in classification scenarios. Most of them rely on mo- methods since its operation is intuitive and it can pro-
del uncertainty, similar to our applied selection crite- duce better estimates than e.g., least confidence ap-
ria. proaches (Käding et al., 2016a). However, our propo-
Besides estimating the uncertainty of the model, sed aggregation method could be applied to any other
further retraining-based approaches are maximizing active learning measure.
the expected model change (Huang et al., 2016) or the
expected model output change (Käding et al., 2016a)
that unlabeled samples would cause after labeling.
Since each bounding box inside an image has to be 3 ACTIVE LEARNING FOR DEEP
evaluated according its active learning score, both me- OBJECT DETECTION
asures would be impractical in terms of runtime wit-
hout further modifications. A more complete over- Using a classification metric on a single detection is
view of general active learning strategies can be found straightforward, if class scores are available. Though,
in (Kovashka et al., 2016; Settles, 2009). aggregating metrics of individual detections for a
complete image can be done in many different ways.
In the section below, we propose simple and efficient
2 PREREQUISITE: ACTIVE aggregation strategies. Afterwards, we discuss the
problem of class imbalance in datasets.
LEARNING
3.1 Aggregation of Detection Metrics
In active learning, a value or metric v(x) is assigned
to any unlabeled example x to determine its possible
Possible aggregations include calculating the sum, the
contribution to model improvement. The current mo-
average or the maximum over all detections. Ho-
del’s output can be used to estimate a value, as can
wever, for some aggregations, it is not clear how an
statistical properties of the example itself. A high v(x)
image without any detections should be handled.
means that the example should be preferred during se-
lection because of its estimated value for the current
model. Sum. A straightforward method of aggregation is
In the following section, we propose a method to the sum. Intuitively, this method prefers images with
adapt an active learning metric for classification to ob- lots of uncertain detections in them. When aggrega-
ject detection using an aggregation process. This met- ting detections using a sum, empty examples should
hod is applicable to any object detector whose output be valued zero. It is the neutral element of addition,
contains class scores for each detected object. making it a reasonable value for an empty sum. A low
valuation effectively delays the selection of empty ex-
amples until there are either no better examples left or
Classification. For classification, the model output
the model has improved enough to actually produce
for a given example x is an estimated distribution of
detections on them. The value of a single example x
class scores p̂(c|x) over classes K. This distribution
can be calculated from the detections D in the follo-
can be analyzed to determine whether the model made
wing way:
an uncertain prediction, a good indicator of a valua-
ble example. Different measures of uncertainty are vSum (x) = ∑ v1vs2 (xi ) . (2)
a common choice for selection, e.g., (Ertekin et al., i∈D
2007; Fu and Yang, 2015; Hoi et al., 2006; Jain and
Kapoor, 2009; Kapoor et al., 2010; Käding et al., Average. Another possibility is averaging each de-
2016c; Tong and Koller, 2001; Beluch et al., 2018). tection’s scores. The average is not sensitive to the
For example, if the difference between the two number of detections, which may make scores more
highest class scores is very low, the example may be comparable between images. If a sample does not

183
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

contain any detections, it will be assigned a zero Data. We use the PASCAL VOC 2012 dataset (Eve-
score. This is an arbitrary rule because there is no true ringham et al., 2010) to assess the effects of our met-
neutral element w.r.t. averages. However, we choose hods on learning. To specifically measure incremen-
zero to keep the behavior in line with the other me- tal and active learning performance, both training and
trics: validation set are split into parts A and B in two diffe-
1 rent random ways to obtain more general experimen-
vAvg (x) = ∑ v1vs2 (xi ) . (3)
|D| i∈D tal results. Part B is considered “new” and is compri-
sed of images with the object classes bird, cow and
sheep (first way) or tvmonitor, cat and boat (se-
Maximum. Finally, individual detection scores can cond way). Part A contains all other 17 classes and
be aggregated by calculating the maximum. This can is used for initial training. The training set for part B
result in a substantial information loss. However, it contains 605 and 638 images for the first and second
may also prove beneficial because of increased robus- way, respectively. The decision towards VOC in favor
tness to noise from many detections. For the maxi- of recently published datasets was motivated by the
mum aggregation, a zero score for empty examples is conditions of the dataset itself. Since it mainly con-
valid. The maximum is not affected by zero valued tains images showing fewer objects, it is possible to
detections, because no single detection’s score can be split the data into a known and unknown part without
lower than zero: having images containing classes from both parts of
vMax (x) = max v1vs2 (xi ) . (4) the splits.
i∈D

3.2 Handling Selection Imbalances Active Exploration Protocol. Before an experi-


mental run, the part B datasets are divided randomly
Class imbalances can lead to worse results for clas- into unlabeled batches of ten samples each. This fixed
ses underrepresented in the training set. In a continu- assignment decreases the probability of very similar
ous learning scenario, this imbalance can be counte- images being selected for the same batch compared
red during selection by preferring instances where the to always selecting the highest valued samples, which
predicted class is underrepresented in the training set. would lead to less diverse batches. This is valuable
An instance is weighted by the following rule: while dealing with data streams, e.g., from camera
traps, or data with low intra-class variance. The con-
#instances + #classes struction of diverse unlabeled data batches is a well
wc = , (5)
#instancesc + 1 known topic in batch-mode active learning (Settles,
2009). However, the construction of diverse batches
where c denotes the predicted class. We assume a
could lead to unintended side-effects and an evalua-
symmetric Dirichlet prior with α = 1, meaning that
tion of those is beyond the scope of the current study.
we have no prior knowledge of the class distribution,
The unlabeled batch size is a trade-off between a tight
and estimate the posterior after observing the total
feedback loop (smaller batches) and computational
number of training instances as well as the number
efficiency (larger batches). As side-effect of the fixed
of instances of class c in the training set. The weight
batch assignment, there are some samples left over
wc is then defined as the inverse of the posterior to
during selection (i.e., five for first way and eight for
prefer underrepresented classes. It is multiplied with
second way of splitting).
v1vs2 (x) before aggregation to obtain a final score.
The unlabeled batches are assigned a value using
the sum of the active learning metric over all images
in the corresponding batch as a meta-aggregation. Ot-
4 EXPERIMENTS her functions such as average or maximum could be
considered too, but are also beyond the scope of this
In the following, we present our evaluation. First we paper.
show how the proposed aggregation metrics are able The highest valued batch is selected for an incre-
to enhance recognition performance while selecting mental training step (Käding et al., 2016b). The net-
new data for annotation. After this, we will analyze work is updated using the annotations from the dataset
the gained improvements when our proposed weig- in lieu of a human annotator. Please note, annotations
hting scheme is applied. are not needed for update batch selection but for the
The code for our experiments is available 1 . update itself. This process is repeated from the point
of batch valuation until there are no unlabeled batches
1 https://github.com/cvjena/cn24-active left. The assignment of samples to unlabeled batches

184
Active Learning for Deep Object Detection

Algorithm 1: Detailed description of the experimental protocol. Please note that in an actual continuous learning scenario,
new examples are always added to U. The loop is never left because U is never exhausted. The described splitting process
would have to be applied regularly.
Require: Known labeled samples L, unknown samples U, initial model f0 , active learning metric v

U = U1 , U2 , . . . ← split of U into random batches


f ← f0

while U is not empty do


calculate scores for all batches in U using f
Ubest ← highest scoring batch in U according to v

Ybest ← annotations for Ubest human-machine interaction


f ← incrementally train f using L and (Ubest , Ybest )

U ← U\Ubest
L ← L ∪ (Ubest , Ybest )
end while
is not changed during an experimental run. 2016; Shmelkov et al., 2017). We use a straightfor-
ward, but effective fine-tuning method (Käding et al.,
2016b) to implement incremental learning. With each
Evaluation. We report mean average precision
gradient step, the mini-batch is constructed by rand-
(mAP) as described in (Everingham et al., 2010) and
omly selecting from old and new examples with a
validate each five new batches (i.e., 50 new samples).
certain probability of λ or 1 − λ, respectively. After
The result is averaged over five runs for each active
completing the learning step, the new data is simply
learning metric and way of splitting for a total of ten
considered old data for the next step. This method
runs. As a baseline for comparison, we evaluate the
can balance known and unknown data performance
performance of random selection, since there is no ot-
successfully. We use a value of 0.5 for λ to make as
her work suitable for direct comparison without any
few assumptions as possible and perform 100 iterati-
adjustments as of yet.
ons per update. Algorithm 1 describes the protocol
in more detail. The method can be applied to YOLO
Setup – Object Detector. We use YOLO as deep object detection with some adjustments. Mainly, the
object detection framework (Redmon et al., 2016). architecture needs to be changed when new classes
More precisely, we use the YOLO-Small architecture are added. Because of the design of YOLO’s output
as an alternative to larger object detection networks, layer, we rearrange the weights to fit new classes, ad-
because it allows for much faster training. Our ini- ding 49 zero-initialized weights per class.
tial model is obtained by fine-tuning the Extraction
model2 on part A of the VOC dataset for 24,000 ite- 4.1 Results
rations using the Adam optimizer (Kingma and Ba,
2014), for each way of splitting the dataset into parts We focus our analysis on the new, unknown data. Ho-
A and B, resulting in two initial models. The first half wever, not losing performance on known data is also
of initial training is completed with a learning rate of important. We analyze the performance on the known
0.0001. The second half and all incremental experi- part of the data (i.e., part A of the VOC dataset) to va-
ments use a lower learning rate of 0.00001 to prevent lidate our method. In worst case, the mAP decreases
divergence. Other hyperparameters match (Redmon from 36.7% initially to 32.1% averaged across all ex-
et al., 2016), including the augmentation of training perimental runs and methods although three new clas-
data using random crops, exposure or saturation ad- ses were introduced. We can see that the incremental
justments. learning method from (Käding et al., 2016b) causes
only minimal losses on known data. These losses in
Setup – Incremental Learning. Extending an exis- performance are also referred to as “catastrophic for-
ting CNN without sacrificing performance on known getting” in literature (Kirkpatrick et al., 2016). The
data is not a trivial task. Fine-tuning exclusively on method from (Käding et al., 2016b) does not require
new data leads to a severe degradation of performance additional model parameters or adjusted loss terms
on previously learned examples (Kirkpatrick et al., for added samples like comparable approaches such
as (Shmelkov et al., 2017) do, which is important for
2 http://pjreddie.com/media/files/extraction.weights learning indefinitely.

185
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

Table 1: Validation results on part B of the VOC data (i.e., new classes only). Bold face indicates block-wise best results, i.e.,
best results with and without additional weighting (· + w). Underlined face highlights overall best results.
50 samples 100 samples 150 samples 200 samples 250 samples All samples
mAP/AULC mAP/AULC mAP/AULC mAP/AULC mAP/AULC mAP/AULC
Baseline
Random 8.7 / 4.3 12.4 / 14.9 15.5 / 28.8 18.7 / 45.9 21.9 / 66.2 32.4 / 264.0
Our Methods
Max 9.2 / 4.6 12.9 / 15.7 15.7 / 30.0 19.8 / 47.8 22.6 / 69.0 32.0 / 269.3
Avg 9.0 / 4.5 12.4 / 15.2 15.8 / 29.2 19.3 / 46.8 22.7 / 67.8 33.3 / 266.4
Sum 8.5 / 4.2 14.3 / 15.6 17.3 / 31.4 19.8 / 49.9 22.7 / 71.2 32.4 / 268.2
Max + w 9.2 / 4.6 13.0 / 15.7 17.0 / 30.7 20.6 / 49.5 23.2 / 71.4 33.0 / 271.0
Avg + w 8.7 / 4.3 12.5 / 14.9 16.6 / 29.4 19.9 / 47.7 22.4 / 68.8 32.7 / 267.1
Sum + w 8.7 / 4.4 13.7 / 15.6 17.5 / 31.2 20.9 / 50.4 24.3 / 72.9 32.7 / 273.6

Most valuable examples (highest score)

Sum (+ w)

Avg (+ w)

Max (+ w)

Least valuable examples (zero score)

All

Figure 2: Value of examples of cow, sheep and bird as determined by the Sum, Avg and Max metrics using the initial model.
The top seven selection is not affected by using our weighting method to counter training set class imbalaces.

Performance of active learning methods is usu- we evaluate the models after learning small amounts
ally evaluated by observing points on a learning curve of samples. At this point there is still a large number
(i.e., performance over number of added samples). In of diverse samples for the methods to choose from,
Table 1, we show the mAP for the new classes from which makes the following results much more rele-
part B of VOC at several intermediate learning steps vant for practical applications than results on the full
as well as exhausting the unlabeled pool. In addition dataset.
we show the area under learning curve (AULC) to In general, we can see that incremental learning
further improve comparability among the methods. In works in the context of the new classes in part B of
our experiments, the number of samples added equals the data, meaning that we observe an improving per-
the number of images. formance for all methods. After adding only 50 sam-
ples, Max and Avg are performing better than pas-
sive selection while the Sum metric is outperformed
Quantitative Results – Fast Exploration. Gaining marginally. When more and more samples are added
accuracy as fast as possible while minimizing the hu- (i.e., 100 to 250 samples), we observe a superior per-
man supervision is one of the main goals of active formance of the Sum aggregation. But also the two
learning. Moreover, in continuous exploration scena- other aggregation techniques are able to reach better
rios, like live camera feeds or other continuous auto- rates than mere random selection. We attribute the
matic measurements, it is assumed that new data is fast increase of performance for the Sum metric to its
always available. Hence, the pool of valuable exam- tendency to select samples with many object inside
ples will rarely be exhausted. To assess the perfor- which leads to more annotated bounding boxes. Ho-
mance of our methods in this fast exploration context, wever, the target application is a scenario where the

186
Active Learning for Deep Object Detection

New classes (part B) Known classes (part A)


bird cow sheep aeroplane car

Initial prediction

After 50 samples

After 150 samples

Figure 3: Evolution of detections on examples from validation set.

amount of unlabeled data is huge or new data is ap- cate that the chosen incremental learning technique
proaching continuously and hence a complete evalu- (Käding et al., 2016b) is suitable for the faced scena-
ation by humans is infeasible. Here, we consider the rio. In continuous exploration, it is usually assumed
amount of images to be evaluated more critical as the that there will be more new unlabeled data available
time needed to draw single bounding boxes. Anot- than can be processed. Nevertheless, evaluating the
her interesting fact is the almost equal performance long term performance of our metrics is important to
of Max and Avg which can be explained as follows: detect possible deterioration over time compared to
the VOC dataset consists mostly of images with only random selection. In contrast to this, the differences
one object in them. Therefore, both metrics lead to a in AULC arise from the improvements of the different
similar score if objects are identified correctly. methods during the experimental run and therefore
We can also see that the proposed balance hand- should be considered as distinctive feature implying
ling (i.e., · + w) causes slight losses in performance at the performance over the whole experiment. Having
very early stages up to 100 samples. At subsequent this in mind, we can still see that Sum performs best
stages, it helps to gain noticeable improvements. Es- while the weighting generally leads to improvements.
pecially for the Sum method benefits from the sam-
ple weighting scheme. A possible explanation for this
behavior would be the following: At early stages, the Quantitative Results — Class-wise Analysis To
classifier has not seen many samples of each class and validate the efficacy of our sample weighting strategy
therefore suffers more from miss-classification errors. as discussed in Section 3.2, it is important to mea-
Hence, the weighting scheme is not able to encourage sure not only overall performance, but to look at me-
the selection of rare class samples since the classi- trics for individual classes. Fig. 4 shows the perfor-
fier decisions are still too unstable. At later stages, mance over time on the validation set for each indi-
this problem becomes less severe and the weighting vidual class. For reference, we also provide the class
scheme is much more helpful than in the beginning. distribution over the relevant part of the VOC data-
This could also explain the performance of Sum in set, indicated by number of object instances in total as
general. Further details on learning pace are given well as number of images with at least one instance in
later in a qualitative study on model evolution. Addi- it.
tionally, the Sum aggregation tends to select batches In the first row, we observe an advantage for the
with many detections in it. Hence, it is natural that weighted method when looking at the performance of
the improvement is noticeable the most with this ag- cow. Out of the three classes in this way of splitting
gregation technique since it helps to find batches with cow has the fewest instances in the dataset. The per-
many rare objects in it. formance of tvmonitor in the second row shows a si-
milar pattern, where it is also the class with the lowest
number of object instances in the dataset. Analyzing
Quantitative Results – All Available Samples. In bird and cat, the top classes by number of instan-
our case, active learning only affects the sequence of ces in each way of splitting, we observe only small
unlabeled batches if we train until there is no new data differences in performance. Thus, we can show evi-
available. Therefore, there are only very small diffe- dence that our balancing scheme is able to improve
rences between each method’s results for mAP after performance on rare classes while it does not effect
training has completed. The small differences indi- performance on frequent classes.

187
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

bird cow Qualitative Results – Sample Valuation We cal-


30
culate whole image scores over bird, cow and sheep
30
20 samples using our corresponding initial model trained
AP (%)

AP (%)
20 on the remaining classes for the first way of splitting.
Sum 10 Sum Figure 2 shows those images that the three aggrega-
10
Sum + w Sum + w tion metrics consider the most valuable. Additionally,
0 0 common zero scoring images are shown. The least
0 250 500 0 250 500 valuable images shown here are representative of all
# samples # samples proposed metrics because they do not lead to any de-
sheep boat tections using the initial model. Note that there are
more than seven images with zero score in the trai-
20 20 ning dataset. The images shown in the figure have
been selected randomly.
AP (%)

AP (%)

10 10 Intuitively, the Sum metric should prefer images


Sum Sum
with many objects in them over single objects, even if
Sum + w Sum + w
0 0 individual detection values are low. Although VOC
0 250 500 0 250 500 contains mostly of images with a single object, all
# samples # samples seven of the highest scoring images contain at le-
ast three objects. The Average and Maximum metric
cat tvmonitor
60 prefer almost identical images since the average and
20 maximum are used to be nearly equal for few detecti-
40 ons. With few exceptions, the most valuable images
AP (%)

AP (%)

10
contain pristine examples of each object. They are
20 Sum Sum well lit and isolated. The objects in the zero scoring
Sum + w Sum + w images are more noisy and hard to identify even for
0 0
the human viewer, resulting in fewer reliable detecti-
0 250 500 0 250 500
# samples # samples
ons.

Number of samples in VOC dataset by class Qualitative Results – Model Evolution. Obser-
ving the change in model output as new data is lear-
Objects ned can help estimate the number of samples needed
1000
Images to learn new classes and identify possible confusions.
# samples

Fig. 3 shows the evolution from initial guesses to cor-


500 rect detections after learning 150 samples, correspon-
ding to an fast exploration scenario. For selection, the
0 Sum metric is used.
bird cow sheep boat cat tvmonitor The class confusions shown in the figure are typi-
cal for this scenario. cow and sheep are recognized
Figure 4: Class-wise validation results on part B of the VOC
as visually similar dog, horse and cat. bird is often
dataset (i.e.,, unknown classes). The first row details the
first way of splitting (bird,cow,sheep) while the second classified as aeroplane. After selecting and learning
row shows the second way (boat,cat,tvmonitor). For re- 150 samples, the objects are detected and classified
ference, the distribution of samples (object instances as well correctly and reliably.
as images with at least one instance) over the VOC dataset During the learning process, there are also
is provided in the third row. unknown objects. Please note, being able to mark
objects as unknown is a direct consequence of using
Intuitively, these observations are in line with our YOLO. Those objects have a detection confidence
expectations regarding our handling of class imbalan- above the required threshold, but no classification is
ces, where examples of rare classes should be prefer- certain enough. This property of YOLO is important
red during selection. We start to observe the advanta- for the discovery of objects of new classes. Nevert-
ges after around 100 training examples, because, for heless, if similar information is available from other
the selection to happen correctly, the prediction of the detection methods, our techniques could easily be ap-
rare class needs to be correct in the first place. plied.

188
Active Learning for Deep Object Detection

5 CONCLUSIONS of California, San Diego.


Beluch, W. H., Genewein, T., Nürnberger, A., and Köhler,
In this paper, we propose several uncertainty-based J. M. (2018). The power of ensembles for active lear-
active learning metrics for object detection. They ning in image classification. In Computer Vision and
Pattern Recognition (CVPR).
only require a distribution of classification scores per
detection. Depending on the specific task, an object Bietti, A. (2012). Active learning for object detection on
satellite images. Technical report, California Institute
detector that will report objects of unknown classes of Technology, Pasadena.
is also important. Additionally, we propose a sample
Ertekin, S., Huang, J., Bottou, L., and Giles, L. (2007). Le-
weighting scheme to balance selections among clas- arning on the border: active learning in imbalanced
ses. data classification. In Conference on Information and
We evaluate the proposed metrics on the PASCAL Knowledge Management.
VOC 2012 dataset (Everingham et al., 2010) and offer Everingham, M., Van Gool, L., Williams, C. K. I., Winn,
quantitative and qualitative results and analysis. We J., and Zisserman, A. (2010). The pascal visual ob-
show that the proposed metrics are able to guide the ject classes (voc) challenge. International Journal of
annotation process efficiently which leads to superior Computer Vision (IJCV).
performance in comparison to a random selection ba- Feng, C., Liu, M.-Y., Kao, C.-C., and Lee, T.-Y. (2017).
seline. In our experimental evaluation, the Sum me- Deep active learning for civil infrastructure defect de-
tric is able to achieve best results overall which can tection and classification. In International Workshop
on Computing in Civil Engineering (IWCCE).
be attributed to the fact that it tends to select batches
with many single objects in it. However, the targe- Fu, C.-J. and Yang, Y.-P. (2015). A batch-mode active le-
ted scenario is an application with huge amounts of arning svm method based on semi-supervised cluste-
ring. Intelligent Data Analysis.
unlabeled data where we consider the amount of ima-
ges to be evaluated as more critical than the time nee- Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., and Berg, A. C.
(2017). Dssd: Deconvolutional single shot detector.
ded to draw single bounding boxes. Examples would arXiv preprint arXiv:1701.06659.
be camera streams or camera trap data. To expedite
Gal, Y., Islam, R., and Ghahramani, Z. (2017). Deep bay-
annotation, our approach could be combined with a esian active learning with image data. arXiv preprint
weakly supervised learning approach as presented in arXiv:1703.02910.
(Papadopoulos et al., 2016). We also showed that our Girshick, R. (2015). Fast R-CNN. In International Confe-
weighting scheme leads to even increased accuracies. rence on Computer Vision (ICCV).
All presented metrics could be applied to other Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014).
deep object detectors, such as the variants of SSD Rich feature hierarchies for accurate object detection
(Liu et al., 2016), the improved R-CNNs e.g., (Ren and semantic segmentation. In Computer Vision and
et al., 2015) or the newer versions of YOLO (Red- Pattern Recognition (CVPR).
mon and Farhadi, 2017). Moreover, our proposed me- Hoi, S. C., Jin, R., and Lyu, M. R. (2006). Large-scale text
trics are not restricted to deep object detection and categorization by batch mode active learning. In In-
could be applied to arbitrary object detection met- ternational Conference on World Wide Web (WWW).
hods if they fulfill the requirements. It only requires Huang, J., Child, R., Rao, V., Liu, H., Satheesh, S., and
a complete distribution of classifications scores per Coates, A. (2016). Active learning for speech re-
detection. Also the underlying uncertainty measure cognition: the power of gradients. arXiv preprint
arXiv:1612.03226.
could be replaced with arbitrary active learning me-
trics to be aggregated afterwards. Depending on the Jain, P. and Kapoor, A. (2009). Active learning for large
multi-class problems. In Computer Vision and Pattern
specific task, an object detector that will report objects Recognition (CVPR).
of unknown classes is also important.
Käding, C., Freytag, A., Rodner, E., Perino, A., and Den-
The proposed aggregation strategies also genera- zler, J. (2016a). Large-scale active learning with ap-
lize to selection of images based on segmentation re- proximated expected model output changes. In Ger-
sults or any other type of image partition. The re- man Conference on Pattern Recognition (GCPR).
sulting scores could also be applied in a novelty de- Käding, C., Rodner, E., Freytag, A., and Denzler, J.
tection scenario. (2016b). Fine-tuning deep neural networks in con-
tinuous learning scenarios. In ACCV Workshop on
Interpretation and Visualization of Deep Neural Nets
(ACCV-WS).
REFERENCES Käding, C., Rodner, E., Freytag, A., and Denzler, J.
(2016c). Watch, ask, learn, and improve: A lifelong
Abramson, Y. and Freund, Y. (2006). Active learning for learning cycle for visual recognition. In European
visual object detection. Technical report, University Symposium on Artificial Neural Networks (ESANN).

189
VISAPP 2019 - 14th International Conference on Computer Vision Theory and Applications

Kapoor, A., Grauman, K., Urtasun, R., and Darrell, T. Uijlings, J. R., Van De Sande, K. E., Gevers, T., and Smeul-
(2010). Gaussian processes for object categorization. ders, A. W. (2013). Selective search for object re-
International Journal of Computer Vision (IJCV). cognition. International Journal of Computer Vision
Kingma, D. P. and Ba, J. (2014). Adam: A method (IJCV), 104(2):154–171.
for stochastic optimization. arXiv preprint arXiv: Vijayanarasimhan, S. and Grauman, K. (2014). Large-scale
1412.6980. live active learning: Training object detectors with
Kirkpatrick, J., Pascanu, R., Rabinowitz, N. C., Veness, J., crawled data and crowds. International Journal of
Desjardins, G., Rusu, A. A., Milan, K., Quan, J., Ra- Computer Vision (IJCV).
malho, T., Grabska-Barwinska, A., Hassabis, D., Clo- Wang, D. and Shang, Y. (2014). A new active labeling met-
path, C., Kumaran, D., and Hadsell, R. (2016). Over- hod for deep learning. In International Joint Confe-
coming catastrophic forgetting in neural networks. rence on Neural Networks (IJCNN).
arXiv preprint arXiv:1612.00796. Wang, K., Zhang, D., Li, Y., Zhang, R., and Lin, L. (2016).
Kovashka, A., Russakovsky, O., Fei-Fei, L., and Grauman, Cost-effective active learning for deep image classifi-
K. (2016). Crowdsourcing in computer vision. Foun- cation. Circuits and Systems for Video Technology.
dations and Trends in Computer Graphics and Vision. Yao, A., Gall, J., Leistner, C., and Van Gool, L. (2012).
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Interactive object detection. In Computer Vision and
and Belongie, S. (2017). Feature pyramid networks Pattern Recognition (CVPR).
for object detection. In CVPR.
Liu, P., Zhang, H., and Eom, K. B. (2017). Active deep le-
arning for classification of hyperspectral images. Se-
lected Topics in Applied Earth Observations and Re-
mote Sensing.
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,
C.-Y., and Berg, A. C. (2016). SSD: Single shot mul-
tibox detector. In European Conference on Computer
Vision (ECCV).
Papadopoulos, D. P., Uijlings, J. R. R., Keller, F., and Fer-
rari, V. (2016). We dont need no bounding-boxes:
Training object class detectors using only human veri-
fication. In Computer Vision and Pattern Recognition
(CVPR).
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A.
(2016). You only look once: Unified, real-time object
detection. In Computer Vision and Pattern Recogni-
tion (CVPR).
Redmon, J. and Farhadi, A. (2017). Yolo9000: Better, fas-
ter, stronger. In Computer Vision and Pattern Recog-
nition (CVPR).
Redmon, J. and Farhadi, A. (2018). Yolov3: An incremental
improvement. arXiv preprint arXiv:1804.02767.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster R-
CNN: Towards real-time object detection with region
proposal networks. In Neural Information Processing
Systems (NIPS).
Roy, S., Namboodiri, V. P., and Biswas, A. (2016). Active
learning with version spaces for object detection.
arXiv preprint arXiv:1611.07285.
Settles, B. (2009). Active learning literature survey. Techni-
cal report, University of Wisconsin–Madison.
Shmelkov, K., Schmid, C., and Alahari, K. (2017). In-
cremental learning of object detectors without cata-
strophic forgetting. In International Conference on
Computer Vision (ICCV).
Stark, F., Hazırbas, C., Triebel, R., and Cremers, D. (2015).
Captcha recognition with active deep learning. In
Workshop New Challenges in Neural Computation.
Tong, S. and Koller, D. (2001). Support vector machine
active learning with applications to text classification.
Journal of Machine Learning Research (JMLR).

190

You might also like