CornerNet Detecting Objects As Paired Keypoints
CornerNet Detecting Objects As Paired Keypoints
Abstract We propose CornerNet, a new approach to more efficient. One-stage detectors place anchor boxes
object detection where we detect an object bounding densely over an image and generate final box predic-
box as a pair of keypoints, the top-left corner and the tions by scoring anchor boxes and refining their coordi-
bottom-right corner, using a single convolution neural nates through regression.
network. By detecting objects as paired keypoints, we But the use of anchor boxes has two drawbacks.
eliminate the need for designing a set of anchor boxes First, we typically need a very large set of anchor boxes,
commonly used in prior single-stage detectors. In addi- e.g. more than 40k in DSSD (Fu et al., 2017) and more
tion to our novel formulation, we introduce corner pool- than 100k in RetinaNet (Lin et al., 2017). This is be-
ing, a new type of pooling layer that helps the network cause the detector is trained to classify whether each
better localize corners. Experiments show that Corner- anchor box sufficiently overlaps with a ground truth
Net achieves a 42.2% AP on MS COCO, outperforming box, and a large number of anchor boxes is needed to
all existing one-stage detectors. ensure sufficient overlap with most ground truth boxes.
As a result, only a tiny fraction of anchor boxes will
Keywords Object Detection
overlap with ground truth; this creates a huge imbal-
ance between positive and negative anchor boxes and
slows down training (Lin et al., 2017).
1 Introduction
Second, the use of anchor boxes introduces many
hyperparameters and design choices. These include how
Object detectors based on convolutional neural net-
many boxes, what sizes, and what aspect ratios. Such
works (ConvNets) (Krizhevsky et al., 2012; Simonyan
choices have largely been made via ad-hoc heuristics,
and Zisserman, 2014; He et al., 2016) have achieved
and can become even more complicated when combined
state-of-the-art results on various challenging bench-
with multiscale architectures where a single network
marks (Lin et al., 2014; Deng et al., 2009; Everingham
makes separate predictions at multiple resolutions, with
et al., 2015). A common component of state-of-the-art
each scale using different features and its own set of an-
approaches is anchor boxes (Ren et al., 2015; Liu et al.,
chor boxes (Liu et al., 2016; Fu et al., 2017; Lin et al.,
2016), which are boxes of various sizes and aspect ra-
2017).
tios that serve as detection candidates. Anchor boxes
In this paper we introduce CornerNet, a new one-
are extensively used in one-stage detectors (Liu et al.,
stage approach to object detection that does away with
2016; Fu et al., 2017; Redmon and Farhadi, 2016; Lin
anchor boxes. We detect an object as a pair of keypoints—
et al., 2017), which can achieve results highly competi-
the top-left corner and bottom-right corner of the bound-
tive with two-stage detectors (Ren et al., 2015; Girshick
ing box. We use a single convolutional network to pre-
et al., 2014; Girshick, 2015; He et al., 2017) while being
dict a heatmap for the top-left corners of all instances
H. Law of the same object category, a heatmap for all bottom-
Princeton University, Princeton, NJ, USA right corners, and an embedding vector for each de-
E-mail: [email protected] tected corner. The embeddings serve to group a pair of
J. Deng corners that belong to the same object—the network is
Princeton Universtiy, Princeton, NJ, USA trained to predict similar embeddings for them. Our ap-
2 Hei Law, Jia Deng
Heatmaps Embeddings
Top-Left Corners
ConvNet
Bottom-Right Corners
Fig. 1 We detect an object as a pair of bounding box corners grouped together. A convolutional network outputs a heatmap
for all top-left corners, a heatmap for all bottom-right corners, and an embedding vector for each detected corner. The network
is trained to predict similar embeddings for corners that belong to the same object.
proach greatly simplifies the output of the network and way of densely discretizing the space of boxes: we just
eliminates the need for designing anchor boxes. Our ap- need O(wh) corners to represent O(w2 h2 ) possible an-
proach is inspired by the associative embedding method chor boxes.
proposed by Newell et al. (2017), who detect and group We demonstrate the effectiveness of CornerNet on
keypoints in the context of multiperson human-pose es- MS COCO (Lin et al., 2014). CornerNet achieves a
timation. Fig. 1 illustrates the overall pipeline of our 42.2% AP, outperforming all existing one-stage detec-
approach. tors. In addition, through ablation studies we show that
Another novel component of CornerNet is corner corner pooling is critical to the superior performance of
pooling, a new type of pooling layer that helps a con- CornerNet. Code is available at https://github.com/
volutional network better localize corners of bounding princeton-vl/CornerNet.
boxes. A corner of a bounding box is often outside the
object—consider the case of a circle as well as the ex-
amples in Fig. 2. In such cases a corner cannot be lo- 2 Related Works
calized based on local evidence. Instead, to determine
whether there is a top-left corner at a pixel location, 2.1 Two-stage object detectors
we need to look horizontally towards the right for the
topmost boundary of the object, and look vertically to- Two-stage approach was first introduced and popular-
wards the bottom for the leftmost boundary. This mo- ized by R-CNN (Girshick et al., 2014). Two-stage de-
tivates our corner pooling layer: it takes in two feature tectors generate a sparse set of regions of interest (RoIs)
maps; at each pixel location it max-pools all feature and classify each of them by a network. R-CNN gener-
vectors to the right from the first feature map, max- ates RoIs using a low level vision algorithm (Uijlings
pools all feature vectors directly below from the second et al., 2013; Zitnick and Dollár, 2014). Each region is
feature map, and then adds the two pooled results to- then extracted from the image and processed by a Con-
gether. An example is shown in Fig. 3. vNet independently, which creates lots of redundant
We hypothesize two reasons why detecting corners computations. Later, SPP (He et al., 2014) and Fast-
would work better than bounding box centers or pro- RCNN (Girshick, 2015) improve R-CNN by designing
posals. First, the center of a box can be harder to lo- a special pooling layer that pools each region from fea-
calize because it depends on all 4 sides of the object, ture maps instead. However, both still rely on separate
whereas locating a corner depends on 2 sides and is thus proposal algorithms and cannot be trained end-to-end.
easier, and even more so with corner pooling, which en- Faster-RCNN (Ren et al., 2015) does away low level
codes some explicit prior knowledge about the defini- proposal algorithms by introducing a region proposal
tion of corners. Second, corners provide a more efficient network (RPN), which generates proposals from a set of
CornerNet: Detecting Objects as Paired Keypoints 3
Fig. 2 Often there is no local evidence to determine the location of a bounding box corner. We address this issue by proposing
a new type of pooling layer.
output
pre-determined candidate boxes, usually known as an- stage detectors while maintaining competitive perfor-
chor boxes. This not only makes the detectors more effi- mance on different challenging benchmarks.
cient but also allows the detectors to be trained end-to-
end. R-FCN (Dai et al., 2016) further improves the effi- SSD places anchor boxes densely over feature maps
ciency of Faster-RCNN by replacing the fully connected from multiple scales, directly classifies and refines each
sub-detection network with a fully convolutional sub- anchor box. YOLO predicts bounding box coordinates
detection network. Other works focus on incorporating directly from an image, and is later improved in YOLO9000 (Red-
sub-category information (Xiang et al., 2016), generat- mon and Farhadi, 2016) by switching to anchor boxes.
ing object proposals at multiple scales with more con- DSSD (Fu et al., 2017) and RON (Kong et al., 2017)
textual information (Bell et al., 2016; Cai et al., 2016; adopt networks similar to the hourglass network (Newell
Shrivastava et al., 2016; Lin et al., 2016), selecting bet- et al., 2016), enabling them to combine low-level and
ter features (Zhai et al., 2017), improving speed (Li high-level features via skip connections to predict bound-
et al., 2017), cascade procedure (Cai and Vasconcelos, ing boxes more accurately. However, these one-stage
2017) and better training procedure (Singh and Davis, detectors are still outperformed by the two-stage de-
2017). tectors until the introduction of RetinaNet (Lin et al.,
2017). In (Lin et al., 2017), the authors suggest that
the dense anchor boxes create a huge imbalance be-
2.2 One-stage object detectors tween positive and negative anchor boxes during train-
ing. This imbalance causes the training to be inefficient
On the other hand, YOLO (Redmon et al., 2016) and and hence the performance to be suboptimal. They pro-
SSD (Liu et al., 2016) have popularized the one-stage pose a new loss, Focal Loss, to dynamically adjust the
approach, which removes the RoI pooling step and de- weights of each anchor box and show that their one-
tects objects in a single network. One-stage detectors stage detector can outperform the two-stage detectors.
are usually more computationally efficient than two- RefineDet (Zhang et al., 2017) proposes to filter the an-
4 Hei Law, Jia Deng
DeNet (Tychsen-Smith and Petersson, 2017a) is a In CornerNet, we detect an object as a pair of keypoints—
two-stage detector which generates RoIs without using the top-left corner and bottom-right corner of the bound-
anchor boxes. It first determines how likely each loca- ing box. A convolutional network predicts two sets of
tion belongs to either the top-left, top-right, bottom- heatmaps to represent the locations of corners of dif-
left or bottom-right corner of a bounding box. It then ferent object categories, one set for the top-left corners
generates RoIs by enumerating all possible corner com- and the other for the bottom-right corners. The network
binations, and follows the standard two-stage approach also predicts an embedding vector for each detected cor-
to classify each RoI. Our approach is very different from ner (Newell et al., 2017) such that the distance between
DeNet. First, DeNet does not identify if two corners the embeddings of two corners from the same object
are from the same objects and relies on a sub-detection is small. To produce tighter bounding boxes, the net-
network to reject poor RoIs. In contrast, our approach work also predicts offsets to slightly adjust the locations
is a one-stage approach which detects and groups the of the corners. With the predicted heatmaps, embed-
corners using a single ConvNet. Second, DeNet selects dings and offsets, we apply a simple post-processing
features at manually determined locations relative to algorithm to obtain the final bounding boxes.
a region for classification, while our approach does not Fig. 4 provides an overview of CornerNet. We use
require any feature selection step. Third, we introduce the hourglass network (Newell et al., 2016) as the back-
corner pooling, a novel type of layer to enhance corner bone network of CornerNet. The hourglass network is
detection. followed by two prediction modules. One module is for
the top-left corners, while the other one is for the bottom-
Point Linking Network (PLN) (Wang et al., 2017) right corners. Each module has its own corner pooling
is an one-stage detector without anchor boxes. It first module to pool features from the hourglass network be-
predicts the locations of the four corners and the center fore predicting the heatmaps, embeddings and offsets.
of a bounding box. Then, at each corner location, it pre- Unlike many other object detectors, we do not use fea-
dicts how likely each pixel location in the image is the tures from different scales to detect objects of different
center. Similarly, at the center location, it predicts how sizes. We only apply both modules to the output of the
likely each pixel location belongs to either the top-left, hourglass network.
top-right, bottom-left or bottom-right corner. It com-
bines the predictions from each corner and center pair
to generate a bounding box. Finally, it merges the four 3.2 Detecting Corners
bounding boxes to give a bounding box. CornerNet is
very different from PLN. First, CornerNet groups the We predict two sets of heatmaps, one for top-left corners
corners by predicting embedding vectors, while PLN and one for bottom-right corners. Each set of heatmaps
groups the corner and center by predicting pixel loca- has C channels, where C is the number of categories,
tions. Second, CornerNet uses corner pooling to better and is of size H × W . There is no background channel.
localize the corners. Each channel is a binary mask indicating the locations
of the corners for a class.
Our approach is inspired by Newell et al. (2017) on For each corner, there is one ground-truth positive
Associative Embedding in the context of multi-person location, and all other locations are negative. During
pose estimation. Newell et al. propose an approach that training, instead of equally penalizing negative loca-
detects and groups human joints in a single network. In tions, we reduce the penalty given to negative locations
their approach each detected human joint has an em- within a radius of the positive location. This is because
bedding vector. The joints are grouped based on the a pair of false corner detections, if they are close to
distances between their embeddings. To the best of our their respective ground truth locations, can still pro-
knowledge, we are the first to formulate the task of duce a box that sufficiently overlaps the ground-truth
object detection as a task of detecting and grouping box (Fig. 5). We determine the radius by the size of an
corners with embeddings. Another novelty of ours is object by ensuring that a pair of points within the ra-
the corner pooling layers that help better localize the dius would generate a bounding box with at least t IoU
corners. We also significantly modify the hourglass ar- with the ground-truth annotation (we set t to 0.3 in all
chitecture and add our novel variant of focal loss (Lin experiments). Given the radius, the amount of penalty
et al., 2017) to help better train the network. reduction is given by an unnormalized 2D Gaussian,
CornerNet: Detecting Objects as Paired Keypoints 5
Prediction Module
Embeddings
Prediction Module
Offsets
Bottom-right corners
Hourglass Network
Fig. 4 Overview of CornerNet. The backbone network is followed by two prediction modules, one for the top-left corners and
the other for the bottom-right corners. Using the predictions from both modules, we locate and group the corners.
x jx k y j y k
k k k k
ok = − , − (2)
n n n n
where ok is the offset, xk and yk are the x and y coor-
dinate for corner k. In particular, we predict one set of
Fig. 5 “Ground-truth” heatmaps for training. Boxes (green offsets shared by the top-left corners of all categories,
dotted rectangles) whose corners are within the radii of the and another set shared by the bottom-right corners. For
positive locations (orange circles) still have large overlaps training, we apply the smooth L1 Loss (Girshick, 2015)
with the ground-truth annotations (red solid rectangles).
at ground-truth corner locations:
N
x2 +y 2 1 X
e− 2σ2 , whose center is at the positive location and Loff = SmoothL1Loss (ok , ôk ) (3)
N
whose σ is 1/3 of the radius. k=1
be small. We can then group the corners based on the where we apply an elementwise max operation. Both
distances between the embeddings of the top-left and tij and lij can be computed efficiently by dynamic pro-
bottom-right corners. The actual values of the embed- gramming as shown Fig. 8.
dings are unimportant. Only the distances between the We define bottom-right corner pooling layer in a
embeddings are used to group the corners. similar way. It max-pools all feature vectors between
We follow Newell et al. (2017) and use embeddings (0, j) and (i, j), and all feature vectors between (i, 0)
of 1 dimension. Let etk be the embedding for the top-left and (i, j) before adding the pooled results. The corner
corner of object k and ebk for the bottom-right corner. pooling layers are used in the prediction modules to
As in Newell and Deng (2017), we use the “pull” loss to predict heatmaps, embeddings and offsets.
train the network to group the corners and the “push” The architecture of the prediction module is shown
loss to separate the corners: in Fig. 7. The first part of the module is a modified
version of the residual block (He et al., 2016). In this
N
1 Xh 2 2
i modified residual block, we replace the first 3 × 3 con-
Lpull = (etk − ek ) + (ebk − ek ) , (4)
N volution module with a corner pooling module, which
k=1
first processes the features from the backbone network
by two 3 × 3 convolution modules 1 with 128 channels
N N and then applies a corner pooling layer. Following the
1 XX
Lpush = max (0, ∆ − |ek − ej |) , (5) design of a residual block, we then feed the pooled fea-
N (N − 1) j=1 tures into a 3 × 3 Conv-BN layer with 256 channels and
k=1
j6=k
add back the projection shortcut. The modified residual
where ek is the average of etk and ebk and we set ∆ block is followed by a 3×3 convolution module with 256
to be 1 in all our experiments. Similar to the offset channels, and 3 Conv-ReLU-Conv layers to produce the
loss, we only apply the losses at the ground-truth corner heatmaps, embeddings and offsets.
location.
2 1 3 0 2 3 3 3 2 2
5 4 1 1 6 6 6 6 6 6
6 7
9 10
3 1 3 4
1 1 3 4
3 4 3 4
2 2 2 2
0 2 0 2
Fig. 6 The top-left corner pooling layer can be implemented very efficiently. We scan from right to left for the horizontal
max-pooling and from bottom to top for the vertical max-pooling. We then add two max-pooled feature maps.
ReLU
Embeddings
3x3 Conv-BN-ReLU
Backbone
Offsets
Fig. 7 The prediction module starts with a modified residual block, in which we replace the first convolution module with
our corner pooling module. The modified residual block is then followed by a convolution module. We have multiple branches
for predicting the heatmaps, embeddings and offsets.
Table 2 Reducing the penalty given to the negative locations near positive locations helps significantly improve the perfor-
mance of the network
AP AP50 AP75 APs APm APl
w/o reducing penalty 32.9 49.1 34.8 19.0 37.0 40.7
fixed radius 35.6 52.5 37.7 18.7 38.5 46.0
object-dependent radius 38.4 53.8 40.9 18.6 40.5 51.8
Table 3 Corner pooling consistently improves the network performance on detecting corners in different image quadrants,
showing that corner pooling is effective and stable over both small and large areas.
mAP w/o pooling mAP w/ pooling improvement
Top-Left Corners
Top-Left Quad. 66.1 69.2 +3.1
Bottom-Right Quad. 60.8 63.5 +2.7
Bottom-Right Corners
Top-Left Quad. 53.4 56.2 +2.8
Bottom-Right Quad. 65.0 67.6 +2.6
γ to 1. We find that 1 or larger values of α and β lead zeros before feeding it to CornerNet. Both the original
to poor performance. We use a batch size of 49 and and flipped images are used for testing. We combine the
train the network on 10 Titan X (PASCAL) GPUs (4 detections from the original and flipped images, and ap-
images on the master GPU, 5 images per GPU for the ply soft-nms (Bodla et al., 2017) to suppress redundant
rest of the GPUs). To conserve GPU resources, in our detections. Only the top 100 detections are reported.
ablation experiments, we train the networks for 250k The average inference time is 244ms per image on a
iterations with a learning rate of 2.5 × 10−4 . When we Titan X (PASCAL) GPU.
compare our results with other detectors, we train the
networks for an extra 250k iterations and reduce the
learning rate to 2.5 × 10−5 for the last 50k iterations.
w/ corner pooling
Fig. 8 Qualitative examples showing corner pooling helps better localize the corners.
Table 5 CornerNet performs much better at high IoUs than other state-of-the-art detectors.
AP AP50 AP60 AP70 AP80 AP90
RetinaNet (Lin et al., 2017) 39.8 59.5 55.6 48.2 36.4 15.1
Cascade R-CNN (Cai and Vasconcelos, 2017) 38.9 57.8 53.4 46.9 35.8 15.8
Cascade R-CNN + IoU Net (Jiang et al., 2018) 41.4 59.3 55.3 49.6 39.4 19.5
CornerNet 40.6 56.1 52.0 46.8 38.8 23.4
Table 6 Error analysis. We replace the predicted heatmaps and offsets with the ground-truth values. Using the ground-truth
heatmaps alone improves the AP from 38.4% to 73.1%, suggesting that the main bottleneck of CornerNet is detecting corners.
AP AP50 AP75 APs APm APl
38.4 53.8 40.9 18.6 40.5 51.8
w/ gt heatmaps 73.1 87.7 78.4 60.9 81.2 81.8
w/ gt heatmaps + offsets 86.1 88.9 85.5 84.8 87.2 82.0
Fig. 9 Qualitative example showing errors in predicting corners and embeddings. The first row shows images where CornerNet
mistakenly combines boundary evidence from different objects. The second row shows images where CornerNet predicts similar
embeddings for corners from different objects.
2.8%, APm by 2.0% and APl by 5.8%. In addition, tor which uses the hourglass network as its backbone.
we see that the penalty reduction especially benefits Each hourglass module predicts anchor boxes at multi-
medium and large objects. ple resolutions by using features at multiple scales dur-
ing upsampling stage. We follow the anchor box design
4.4.4 Hourglass Network in RetinaNet (Lin et al., 2017) and add intermediate
supervisions during training. In both experiments, we
CornerNet uses the hourglass network (Newell et al., initialize the networks from scratch and follow the same
2016) as its backbone network. Since the hourglass net- training procedure as we train CornerNet (Sec. 4.1).
work is not commonly used in other state-of-the-art de-
tectors, we perform an experiment to study the contri-
bution of the hourglass network in CornerNet. We train Tab. 4 shows that CornerNet with hourglass net-
a CornerNet in which we replace the hourglass network work outperforms CornerNet with FPN by 8.2% AP,
with FPN (w/ ResNet-101) (Lin et al., 2017), which is and the anchor box based detector with hourglass net-
more commonly used in state-of-the-art object detec- work by 5.5% AP. The results suggest that the choice of
tors. We only use the final output of FPN for predic- the backbone network is important and the hourglass
tions. Meanwhile, we train an anchor box based detec- network is crucial to the performance of CornerNet.
CornerNet: Detecting Objects as Paired Keypoints 11
Table 7 CornerNet versus others on MS COCO test-dev. CornerNet outperforms all one-stage detectors and achieves results
competitive to two-stage detectors
Method Backbone AP AP50 AP75 APs APm APl AR1 AR10 AR100 ARs ARm ARl
Two-stage detectors
DeNet (Tychsen-Smith and Petersson, 2017a) ResNet-101 33.8 53.4 36.1 12.3 36.1 50.8 29.6 42.6 43.5 19.2 46.9 64.3
CoupleNet (Zhu et al., 2017) ResNet-101 34.4 54.8 37.2 13.4 38.1 50.8 30.0 45.0 46.4 20.7 53.1 68.5
Faster R-CNN by G-RMI (Huang et al., 2017) Inception-ResNet-v2 (Szegedy et al., 2017) 34.7 55.5 36.7 13.5 38.1 52.0 - - - - - -
Faster R-CNN+++ (He et al., 2016) ResNet-101 34.9 55.7 37.4 15.6 38.7 50.9 - - - - - -
Faster R-CNN w/ FPN (Lin et al., 2016) ResNet-101 36.2 59.1 39.0 18.2 39.0 48.2 - - - - - -
Faster R-CNN w/ TDM (Shrivastava et al., 2016) Inception-ResNet-v2 36.8 57.7 39.2 16.2 39.8 52.1 31.6 49.3 51.9 28.1 56.6 71.1
D-FCN (Dai et al., 2017) Aligned-Inception-ResNet 37.5 58.0 - 19.4 40.1 52.5 - - - - - -
Regionlets (Xu et al., 2017) ResNet-101 39.3 59.8 - 21.7 43.7 50.9 - - - - - -
Mask R-CNN (He et al., 2017) ResNeXt-101 39.8 62.3 43.4 22.1 43.2 51.2 - - - - - -
Soft-NMS (Bodla et al., 2017) Aligned-Inception-ResNet 40.9 62.8 - 23.3 43.6 53.3 - - - - - -
LH R-CNN (Li et al., 2017) ResNet-101 41.5 - - 25.2 45.3 53.1 - - - - - -
Fitness-NMS (Tychsen-Smith and Petersson, 2017b) ResNet-101 41.8 60.9 44.9 21.5 45.0 57.5 - - - - - -
Cascade R-CNN (Cai and Vasconcelos, 2017) ResNet-101 42.8 62.1 46.3 23.7 45.5 55.2 - - - - - -
D-RFCN + SNIP (Singh and Davis, 2017) DPN-98 (Chen et al., 2017) 45.7 67.3 51.1 29.3 48.8 57.1 - - - - - -
One-stage detectors
YOLOv2 (Redmon and Farhadi, 2016) DarkNet-19 21.6 44.0 19.2 5.0 22.4 35.5 20.7 31.6 33.3 9.8 36.5 54.4
DSOD300 (Shen et al., 2017a) DS/64-192-48-1 29.3 47.3 30.6 9.4 31.5 47.0 27.3 40.7 43.0 16.7 47.1 65.0
GRP-DSOD320 (Shen et al., 2017b) DS/64-192-48-1 30.0 47.9 31.8 10.9 33.6 46.3 28.0 42.1 44.5 18.8 49.1 65.0
SSD513 (Liu et al., 2016) ResNet-101 31.2 50.4 33.3 10.2 34.5 49.8 28.3 42.1 44.4 17.6 49.2 65.8
DSSD513 (Fu et al., 2017) ResNet-101 33.2 53.3 35.2 13.0 35.4 51.1 28.9 43.5 46.2 21.8 49.1 66.4
RefineDet512 (single scale) (Zhang et al., 2017) ResNet-101 36.4 57.5 39.5 16.6 39.9 51.4 - - - - - -
RetinaNet800 (Lin et al., 2017) ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2 - - - - - -
RefineDet512 (multi scale) (Zhang et al., 2017) ResNet-101 41.8 62.9 45.7 25.6 45.1 54.1 - - - - - -
CornerNet511 (single scale) Hourglass-104 40.6 56.4 43.2 19.1 42.8 54.3 35.3 54.7 59.4 37.4 62.4 77.2
CornerNet511 (multi scale) Hourglass-104 42.2 57.8 45.2 20.7 44.8 56.6 36.6 55.9 60.3 39.5 63.2 77.3
4.4.5 Quality of the Bounding Boxes nerNet is able to generate bounding boxes of higher
quality compared to other state-of-the-art detectors.
A good detector should predict high quality bound-
ing boxes that cover objects tightly. To understand the
4.4.6 Error Analysis
quality of the bounding boxes predicted by CornerNet,
we evaluate the performance of CornerNet at multi- CornerNet simultaneously outputs heatmaps, offsets,
ple IoU thresholds, and compare the results with other and embeddings, all of which affect detection perfor-
state-of-the-art detectors, including RetinaNet (Lin et al., mance. An object will be missed if either corner is
2017), Cascade R-CNN (Cai and Vasconcelos, 2017) missed; precise offsets are needed to generate tight bound-
and IoU-Net (Jiang et al., 2018). ing boxes; incorrect embeddings will result in many
Tab. 5 shows that CornerNet achieves a much higher false bounding boxes. To understand how each part con-
AP at 0.9 IoU than other detectors, outperforming Cas- tributes to the final error, we perform an error analysis
cade R-CNN + IoU-Net by 3.9%, Cascade R-CNN by by replacing the predicted heatmaps and offsets with
7.6% and RetinaNet 2 by 7.3%. This suggests that Cor- the ground-truth values and evaluting performance on
2 the validation set.
We use the best model publicly available on
https://github.com/facebookresearch/Detectron/blob/ Tab. 6 shows that using the ground-truth corner
master/MODEL_ZOO.md heatmaps alone improves the AP from 38.4% to 73.1%.
12 Hei Law, Jia Deng
APs , APm and APl also increase by 42.3%, 40.7% and scale evaluation, CornerNet achieves an AP of 42.2%,
30.0% respectively. If we replace the predicted offsets the state of the art among existing one-stage methods
with the ground-truth offsets, the AP further increases and competitive with two-stage methods.
by 13.0% to 86.1%. This suggests that although there
is still ample room for improvement in both detecting
and grouping corners, the main bottleneck is detecting
corners. Fig. 9 shows some qualitative examples where
the corner locations or embeddings are incorrect. 5 Conclusion
Acknowledgements This work is partially supported by ings of the IEEE conference on computer vision
a grant from Toyota Research Institute and a DARPA grant and pattern recognition, pages 580–587.
FA8750-18-2-0019. This article solely reflects the opinions and
He, K., Gkioxari, G., Dollár, P., and Girshick,
conclusions of its authors.
R. (2017). Mask r-cnn. arxiv preprint arxiv:
170306870.
He, K., Zhang, X., Ren, S., and Sun, J. (2014). Spatial
References
pyramid pooling in deep convolutional networks
for visual recognition. In European Conference on
Bell, S., Lawrence Zitnick, C., Bala, K., and Girshick,
Computer Vision, pages 346–361. Springer.
R. (2016). Inside-outside net: Detecting objects in
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep
context with skip pooling and recurrent neural net-
residual learning for image recognition. In Proceed-
works. In Proceedings of the IEEE Conference on
ings of the IEEE conference on computer vision
Computer Vision and Pattern Recognition, pages
and pattern recognition, pages 770–778.
2874–2883.
Huang, J., Rathod, V., Sun, C., Zhu, M., Korattikara,
Bodla, N., Singh, B., Chellappa, R., and Davis, L. S.
A., Fathi, A., Fischer, I., Wojna, Z., Song, Y.,
(2017). Soft-nmsimproving object detection with
Guadarrama, S., et al. (2017). Speed/accuracy
one line of code. In 2017 IEEE International Con-
trade-offs for modern convolutional object detec-
ference on Computer Vision (ICCV), pages 5562–
tors. In IEEE CVPR.
5570. IEEE.
Ioffe, S. and Szegedy, C. (2015). Batch normalization:
Cai, Z., Fan, Q., Feris, R. S., and Vasconcelos, N.
Accelerating deep network training by reducing in-
(2016). A unified multi-scale deep convolutional
ternal covariate shift. In International conference
neural network for fast object detection. In Euro-
on machine learning, pages 448–456.
pean Conference on Computer Vision, pages 354–
Jiang, B., Luo, R., Mao, J., Xiao, T., and Jiang, Y.
370. Springer.
(2018). Acquisition of localization confidence for
Cai, Z. and Vasconcelos, N. (2017). Cascade r-cnn:
accurate object detection. In Computer Vision–
Delving into high quality object detection. arXiv
ECCV 2018, pages 816–832. Springer.
preprint arXiv:1712.00726.
Kingma, D. P. and Ba, J. (2014). Adam: A
Chen, Y., Li, J., Xiao, H., Jin, X., Yan, S., and Feng, J.
method for stochastic optimization. arXiv preprint
(2017). Dual path networks. In Advances in Neural
arXiv:1412.6980.
Information Processing Systems, pages 4470–4478.
Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., and Chen,
Dai, J., Li, Y., He, K., and Sun, J. (2016). R-fcn: Ob-
Y. (2017). Ron: Reverse connection with object-
ject detection via region-based fully convolutional
ness prior networks for object detection. arXiv
networks. arXiv preprint arXiv:1605.06409.
preprint arXiv:1707.01691.
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H.,
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012).
and Wei, Y. (2017). Deformable convolutional net-
Imagenet classification with deep convolutional
works. CoRR, abs/1703.06211, 1(2):3.
neural networks. In Advances in neural informa-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and
tion processing systems, pages 1097–1105.
Fei-Fei, L. (2009). Imagenet: A large-scale hi-
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., and Sun,
erarchical image database. In Computer Vision
J. (2017). Light-head r-cnn: In defense of two-stage
and Pattern Recognition, 2009. CVPR 2009. IEEE
object detector. arXiv preprint arXiv:1711.07264.
Conference on, pages 248–255. IEEE.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariha-
Everingham, M., Eslami, S. A., Van Gool, L., Williams,
ran, B., and Belongie, S. (2016). Feature pyra-
C. K., Winn, J., and Zisserman, A. (2015). The
mid networks for object detection. arXiv preprint
pascal visual object classes challenge: A retrospec-
arXiv:1612.03144.
tive. International journal of computer vision,
Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár,
111(1):98–136.
P. (2017). Focal loss for dense object detection.
Fu, C.-Y., Liu, W., Ranga, A., Tyagi, A., and Berg,
arXiv preprint arXiv:1708.02002.
A. C. (2017). Dssd: Deconvolutional single shot
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona,
detector. arXiv preprint arXiv:1701.06659.
P., Ramanan, D., Dollár, P., and Zitnick, C. L.
Girshick, R. (2015). Fast r-cnn. arXiv preprint
(2014). Microsoft coco: Common objects in con-
arXiv:1504.08083.
text. In European conference on computer vision,
Girshick, R., Donahue, J., Darrell, T., and Malik, J.
pages 740–755. Springer.
(2014). Rich feature hierarchies for accurate object
detection and semantic segmentation. In Proceed-
14 Hei Law, Jia Deng
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, volume 4, page 12.
S., Fu, C.-Y., and Berg, A. C. (2016). Ssd: Single Tychsen-Smith, L. and Petersson, L. (2017a). Denet:
shot multibox detector. In European conference on Scalable real-time object detection with directed
computer vision, pages 21–37. Springer. sparse sampling. arXiv preprint arXiv:1703.10295.
Newell, A. and Deng, J. (2017). Pixels to graphs by Tychsen-Smith, L. and Petersson, L. (2017b).
associative embedding. In Advances in Neural In- Improving object localization with fitness
formation Processing Systems, pages 2168–2177. nms and bounded iou loss. arXiv preprint
Newell, A., Huang, Z., and Deng, J. (2017). Associative arXiv:1711.00164.
embedding: End-to-end learning for joint detection Uijlings, J. R., van de Sande, K. E., Gevers, T., and
and grouping. In Advances in Neural Information Smeulders, A. W. (2013). Selective search for ob-
Processing Systems, pages 2274–2284. ject recognition. International journal of computer
Newell, A., Yang, K., and Deng, J. (2016). Stacked vision, 104(2):154–171.
hourglass networks for human pose estimation. In Wang, X., Chen, K., Huang, Z., Yao, C., and Liu, W.
European Conference on Computer Vision, pages (2017). Point linking network for object detection.
483–499. Springer. arXiv preprint arXiv:1706.03646.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, Xiang, Y., Choi, W., Lin, Y., and Savarese, S. (2016).
E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Subcategory-aware convolutional neural networks
and Lerer, A. (2017). Automatic differentiation in for object proposals and detection. arXiv preprint
pytorch. arXiv:1604.04693.
Redmon, J., Divvala, S., Girshick, R., and Farhadi, A. Xu, H., Lv, X., Wang, X., Ren, Z., and Chellappa, R.
(2016). You only look once: Unified, real-time ob- (2017). Deep regionlets for object detection. arXiv
ject detection. In Proceedings of the IEEE confer- preprint arXiv:1712.02408.
ence on computer vision and pattern recognition, Zhai, Y., Fu, J., Lu, Y., and Li, H. (2017). Feature selec-
pages 779–788. tive networks for object detection. arXiv preprint
Redmon, J. and Farhadi, A. (2016). Yolo9000: better, arXiv:1711.08879.
faster, stronger. arXiv preprint, 1612. Zhang, S., Wen, L., Bian, X., Lei, Z., and Li, S. Z.
Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster (2017). Single-shot refinement neural network for
r-cnn: Towards real-time object detection with re- object detection. arXiv preprint arXiv:1711.06897.
gion proposal networks. In Advances in neural in- Zhu, Y., Zhao, C., Wang, J., Zhao, X., Wu, Y., and Lu,
formation processing systems, pages 91–99. H. (2017). Couplenet: Coupling global structure
Shen, Z., Liu, Z., Li, J., Jiang, Y.-G., Chen, Y., and with local parts for object detection. In Proc. of
Xue, X. (2017a). Dsod: Learning deeply supervised Intl Conf. on Computer Vision (ICCV).
object detectors from scratch. In The IEEE Inter- Zitnick, C. L. and Dollár, P. (2014). Edge boxes: Lo-
national Conference on Computer Vision (ICCV), cating object proposals from edges. In European
volume 3, page 7. Conference on Computer Vision, pages 391–405.
Shen, Z., Shi, H., Feris, R., Cao, L., Yan, S., Liu, Springer.
D., Wang, X., Xue, X., and Huang, T. S.
(2017b). Learning object detectors from scratch
with gated recurrent feature pyramids. arXiv
preprint arXiv:1712.00886.
Shrivastava, A., Sukthankar, R., Malik, J., and Gupta,
A. (2016). Beyond skip connections: Top-down
modulation for object detection. arXiv preprint
arXiv:1612.06851.
Simonyan, K. and Zisserman, A. (2014). Very deep con-
volutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556.
Singh, B. and Davis, L. S. (2017). An analysis of scale
invariance in object detection-snip. arXiv preprint
arXiv:1711.08189.
Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi, A. A.
(2017). Inception-v4, inception-resnet and the im-
pact of residual connections on learning. In AAAI,