Superpoint
Superpoint
Abstract Point
Image Pair SuperPoint Network Correspondence
This paper presents a self-supervised framework for Interest Points
training interest point detectors and descriptors suitable
for a large number of multiple-view geometry problems in
computer vision. As opposed to patch-based neural net-
works, our fully-convolutional model operates on full-sized
images and jointly computes pixel-level interest point loca- Descriptors
tions and associated descriptors in one forward pass. We
introduce Homographic Adaptation, a multi-scale, multi-
homography approach for boosting interest point detec-
tion repeatability and performing cross-domain adapta-
tion (e.g., synthetic-to-real). Our model, when trained on
the MS-COCO generic image dataset using Homographic
Adaptation, is able to repeatedly detect a much richer set
of interest points than the initial pre-adapted deep model
and any other traditional corner detector. The final system
gives rise to state-of-the-art homography estimation results Figure 1. SuperPoint for Geometric Correspondences. We
on HPatches when compared to LIFT, SIFT and ORB. present a fully-convolutional neural network that computes SIFT-
like 2D interest point locations and descriptors in a single forward
1. Introduction pass and runs at 70 FPS on 480 × 640 images with a Titan X GPU.
1
(a) Interest Point Pre-Training (b) Interest Point Self-Labeling (c) Joint Training
SuperPoint
Labeled Interest Interest
Unlabeled Image Pseudo-Ground
Point Images Point Loss
Truth Interest
Base Detector Points
Homographic Descriptor
Train Warp
Base Detector Adaptation Loss
Interest
Point Loss
of examples from a synthetic dataset we created called Syn- vances in applying deep learning to interest point detection
thetic Shapes (see Figure 2a). The synthetic dataset con- and descriptor learning. At the ability to match image sub-
sists of simple geometric shapes with no ambiguity in the structures, we are similar to UCN [3] and to a lesser extent
interest point locations. We call the resulting trained de- DeepDesc [6]; however, both do not perform any interest
tector MagicPoint—it significantly outperforms traditional point detection. On the other end, LIFT [32], a recently in-
interest point detectors on the synthetic dataset (see Sec- troduced convolutional replacement for SIFT stays close to
tion 4). MagicPoint performs surprising well on real im- the traditional patch-based detect then describe recipe. The
ages despite domain adaptation difficulties [7]. However, LIFT pipeline contains interest point detection, orientation
when compared to classical interest point detectors on a di- estimation and descriptor computation, but additionally re-
verse set of image textures and patterns, MagicPoint misses quires supervision from a classical SfM system. These dif-
many potential interest point locations. To bridge this gap ferences are summarized in Table 1.
in performance on real images, we developed a multi-scale,
Interest Descriptors? Full Image Single Real
multi-transform technique − Homographic Adaptation. Points? Input? Network? Time?
Homographic Adaptation is designed to enable self- SuperPoint (ours) 3 3 3 3 3
LIFT [32] 3 3
supervised training of interest point detectors. It warps the UCN [3] 3 3 3
input image multiple times to help an interest point detec- TILDE [29] 3 3
DeepDesc [6] 3 3
tor see the scene from many different viewpoints and scales SIFT 3 3
(see Section 5). We use Homographic Adaptation in con- ORB 3 3 3
junction with the MagicPoint detector to boost the perfor- Table 1. Qualitative Comparison to Relevant Methods. Our Su-
mance of the detector and generate the pseudo-ground truth perPoint method is the only one to compute both interest points
interest points (see Figure 2b). The resulting detections are and descriptors in a single network in real-time.
more repeatable and fire on a larger set of stimuli; thus we
named the resulting detector SuperPoint. On the other extreme of the supervision spectrum, Quad-
The most common step after detecting robust and repeat- Networks [23] tackles the interest point detection problem
able interest points is to attach a fixed dimensional descrip- from an unsupervised approach; however, their system is
tor vector to each point for higher level semantic tasks, e.g., patch-based (inputs are small image patches) and relatively
image matching. Thus we lastly combine SuperPoint with shallow 2-layer network. The TILDE [29] interest point
a descriptor subnetwork (see Figure 2c). Since the Super- detection system used a principle similar to Homographic
Point architecture consists of a deep stack of convolutional Adaptation; however, their approach does not benefit from
layers which extract multi-scale features, it is straightfor- the power of large fully-convolutional neural networks.
ward to then combine the interest point network with an ad- Our approach can also be compared to other self-
ditional subnetwork that computes interest point descriptors supervised methods, synthetic-to-real domain-adaptation
(see Section 3). The resulting system is shown in Figure 1. methods. A similar approach to Homographic Adaptation
is by Honari et al. [10] under the name “equivariant land-
2. Related Work mark transform.” Also, Geometric Matching Networks [20]
Traditional interest point detectors have been thoroughly and Deep Image Homography Estimation [4] use a similar
evaluated [24, 16]. The FAST corner detector [21] was the self-supervision strategy to create training data for estimat-
first system to cast high-speed corner detection as a machine ing global transformations. However, these methods lack
learning problem, and the Scale-Invariant Feature Trans- interest points and point correspondences, which are typi-
form, or SIFT [15], is still probably the most well-known cally required for doing higher level computer vision tasks
traditional local feature descriptor in computer vision. such as SLAM and SfM. Joint pose and depth estimation
Our SuperPoint architecture is inspired by recent ad- models also exist [33, 30, 28], but do not use interest points.
2
Interest Point Decoder W
Conv
3. SuperPoint Architecture W/8
Input Softmax Reshape
Encoder
We designed a fully-convolutional neural network archi- W H/8 H
65
tecture called SuperPoint which operates on a full-sized im- 1
Descriptor Decoder
age and produces interest point detections accompanied by Conv
W
H W/8
fixed length descriptors in a single forward pass (see Fig- Bi-Cubic L2
ure 3). The model has a single, shared encoder to pro- 1 Interpolate Norm
H/8 H
cess and reduce the input image dimensionality. After the D
D
encoder, the architecture splits into two decoder “heads”,
which learn task specific weights – one for interest point de- Figure 3. SuperPoint Decoders. Both decoders operate on a
tection and the other for interest point description. Most of shared and spatially reduced representation of the input. To keep
the network’s parameters are shared between the two tasks, the model fast and easy to train, both decoders use non-learned
upsampling to bring the representation back to RH×W .
which is a departure from traditional systems which first
detect interest points, then compute descriptors and lack the
ability to share computation and representation across the 3.3. Descriptor Decoder
two tasks. The descriptor head computes D ∈ RHc ×Wc ×D and out-
puts a tensor sized RH×W ×D . To output a dense map of L2-
3.1. Shared Encoder normalized fixed length descriptors, we use a model simi-
lar to UCN [3] to first output a semi-dense grid of descrip-
Our SuperPoint architecture uses a VGG-style [27] en- tors (e.g., one every 8 pixels). Learning descriptors semi-
coder to reduce the dimensionality of the image. The en- densely rather than densely reduces training memory and
coder consists of convolutional layers, spatial downsam- keeps the run-time tractable. The decoder then performs bi-
pling via pooling and non-linear activation functions. Our cubic interpolation of the descriptor and then L2-normalizes
encoder uses three max-pooling layers, letting us define the activations to be unit length. This fixed, non-learned de-
Hc = H/8 and Wc = W/8 for an image sized H × W . scriptor decoder is shown in Figure 3.
We refer to the pixels in the lower dimensional output as
“cells,” where three 2 × 2 non-overlapping max pooling op- 3.4. Loss Functions
erations in the encoder result in 8 × 8 pixel cells. The en- The final loss is the sum of two intermediate losses: one
coder maps the input image I ∈ RH×W to an intermediate for the interest point detector, Lp , and one for the descrip-
tensor B ∈ RHc ×Wc ×F with smaller spatial dimension and tor, Ld . We use pairs of synthetically warped images which
greater channel depth (i.e., Hc < H, Wc < W and F > 1). have both (a) pseudo-ground truth interest point locations
and (b) the ground truth correspondence from a randomly
3.2. Interest Point Decoder generated homography H which relates the two images.
This allows us to optimize the two losses simultaneously,
For interest point detection, each pixel of the output cor- given a pair of images, as shown in Figure 2c. We use λ to
responds to a probability of “point-ness” for that pixel in the balance the final loss:
input. The standard network design for dense prediction in-
volves an encoder-decoder pair, where the spatial resolution L(X , X 0 , D, D0 ; Y, Y 0 , S) =
is decreased via pooling or strided convolution, and then (1)
Lp (X , Y ) + Lp (X 0 , Y 0 ) + λLd (D, D0 , S).
upsampled back to full resolution via upconvolution oper-
ations, such as done in SegNet [1]. Unfortunately, upsam- The interest point detector loss function Lp is a fully-
pling layers tend to add a high amount of computation and convolutional cross-entropy loss over the cells xhw ∈ X .
can introduce unwanted checkerboard artifacts [18], thus We call the set of corresponding ground-truth interest point
we designed the interest point detection head with an ex- labels2 Y and individual entries as yhw . The loss is:
plicit decoder1 to reduce the computation of the model.
HX
c ,Wc
The interest point detector head computes X ∈ 1
Lp (X , Y ) = lp (xhw ; yhw ), (2)
RHc ×Wc ×65 and outputs a tensor sized RH×W . The 65 Hc W c
h=1
channels correspond to local, non-overlapping 8 × 8 grid w=1
tion” [26] or “depth to space” in TensorFlow or “pixel shuffle” in PyTorch. domly select one ground truth corner location.
3
Quads / Tris Cubes Stars MagicPoint FAST Harris Shi
Train MagicPoint
Base Detector
Figure 4. Synthetic Pre-Training. We use our Synthetic Shapes dataset consisting of rendered triangles, quadrilaterals, lines, cubes,
checkerboards, and stars each with ground truth corner locations. The dataset is used to train the MagicPoint convolutional neural network,
which is more robust to noise when compared to classical detectors.
The descriptor loss is applied to all pairs of descriptor called Synthetic Shapes that consists of simplified 2D geom-
cells, dhw ∈ D from the first image and d0 h0 w0 ∈ D0 etry via synthetic data rendering of quadrilaterals, triangles,
from the second image. The homography-induced corre- lines and ellipses. Examples of these shapes are shown in
spondence between the (h, w) cell and the (h0 , w0 ) cell can Figure 4. In this dataset, we are able to remove label ambi-
be written as follows: guity by modeling interest points with simple Y-junctions,
( L-junctions, T-junctions as well as centers of tiny ellipses
1, if ||Hp
\ hw − ph0 w0 || ≤ 8
shwh0 w0 = (4) and end points of line segments.
0, otherwise Once the synthetic images are rendered, we apply ho-
mographic warps to each image to augment the number of
where phw denotes the location of the center pixel in the training examples. The data is generated on-the-fly and no
(h, w) cell, and Hp
\ hw denotes multiplying the cell location example is seen by the network twice. While the types of
phw by the homography H and dividing by the last coor- interest points represented in Synthetic Shapes represents
dinate, as is usually done when transforming between Eu- only a subset of all potential interest points found in the real
clidean and homogeneous coordinates. We denote the entire world, we found it to work reasonably well in practice when
set of correspondences for a pair of images with S. used to train an interest point detector.
We also add a weighting term λd to help balance the fact
that there are more negative correspondences than positive 4.2. MagicPoint
ones. We use a hinge loss with positive margin mp and We use the detector pathway of the SuperPoint architec-
negative margin mn . The descriptor loss is defined as: ture (ignoring the descriptor head) and train it on Synthetic
Ld (D, D0 , S) = Shapes. We call the resulting model MagicPoint.
HX
Interestingly, when we evaluate MagicPoint against
c ,Wc HX
c ,Wc
1 other traditional corner detection approaches such as
ld (dhw , d0h0 w0 ; shwh0 w0 ), (5)
(Hc Wc )2 0
FAST [21], Harris corners [8] and Shi-Tomasi’s “Good Fea-
h=1 h =1
w=1 w0 =1 tures To Track” [25] on the Synthetic Shapes dataset, we
discovered a large performance gap in our favor. We mea-
where sure the mean Average Precision (mAP) on 1000 held-out
ld (d, d0 ; s) = λd ∗ s ∗ max(0, mp − dT d0 ) images of the Synthetic Shapes dataset, and report the re-
(6) sults in Table 2. The classical detectors struggle in the pres-
+(1 − s) ∗ max(0, dT d0 − mn ). ence of imaging noise – qualitative examples of this are
shown in Figure 4. More detailed experiments can be found
4. Synthetic Pre-Training in Appendix B.
In this section, we describe our method for training a
base detector (shown in Figure 2a) called MagicPoint which MagicPoint FAST Harris Shi
is used in conjunction with Homographic Adaptation to mAP no noise 0.979 0.405 0.678 0.686
generate pseudo-ground truth interest point labels for un- mAP noise 0.971 0.061 0.213 0.157
labeled images in a self-supervised fashion.
4.1. Synthetic Shapes Table 2. Synthetic Shapes Detector Performance. The Magic-
Point model outperforms classical detectors in detecting corners
There is no large database of interest point labeled im- of simple geometric shapes and is robust to added noise.
ages that exists today. Thus to bootstrap our deep interest
point detector, we first create a large-scale synthetic dataset The MagicPoint detector performs very well on Syn-
4
Homographic Adaptation
Sample Random Warp Apply Get Point Unwarp
Unlabeled Image Heatmaps
Homography Images Detector Response
Interest Point
H1 Aggregate
Superset
Heatmap
H2 +
Base Detector
HN
Figure 5. Homographic Adaptation. Homographic Adaptation is a form of self-supervision for boosting the geometric consistency of an
Sampling Random Homographies
interest point detector trained with convolutional neural networks. The entire procedure is mathematically defined in Equation 10.
5
plausible camera transformations, we decompose a poten-
tial homography into more simple, less expressive transfor-
mation classes. We sample within pre-determined ranges
for translation, scale, in-plane rotation, and symmetric per-
spective distortion using a truncated normal distribution.
Homographic
Adaptation
These transformations are composed together with an ini-
tial root center crop to help avoid bordering artifacts. This
process is shown in Figure 6.
When applying Homographic Adaptation to an image,
we use the average response across a large number of ho-
mographic warps of the input image. The number of homo-
graphic warps Nh is a hyper-parameter of our approach. We
typically enforce the first homography to be equal to iden-
tity, so that Nh =1 in our experiments corresponds to doing
no adaptation. We performed an experiment to determine
the best value for Nh , varying the range of Nh from small Figure 7. Iterative Homographic Adaptation. Top row: ini-
tial base detector (MagicPoint) struggles to find repeatable de-
Nh = 10, to medium Nh = 100, and large Nh = 1000. Our
tections. Middle and bottom rows: further training with Homo-
experiments suggest that there is diminishing returns when
graphic Adaption improves detector performance.
performing more than 100 homographies. On a held-out set
of images from MS-COCO, we obtain a repeatability score
model is trained for 200,000 iterations of synthetic data.
of .67 without any Homographic Adaptation, a repeatabil-
Since the synthetic data is simple and fast to render, the data
ity boost of 21% when performing Nh = 100 transforms,
is rendered on-the-fly, thus no single example is seen twice
and a repeatability boost of 22% when Nh = 1000, thus
by the network.
the added benefit of using more than 100 homographies is
We generate pseudo-ground truth labels using the MS-
minimal. For a more detailed analysis and discussion of this
COCO 2014 [13] training dataset split which has 80,000
experiment see Appendix C.
images and the MagicPoint base detector. The images
5.3. Iterative Homographic Adaptation are sized to a resolution of 240 × 320 and converted to
grayscale. The labels are generated using Homographic
We apply the Homographic Adaptation technique at Adaptation with Nh = 100, as motivated by our results
training time to improve the generalization ability of the from Section 5.2. We repeat the Homographic Adaptation a
base MagicPoint architecture on real images. The process second time, using the resulting model trained from the first
can be repeated iteratively to continually self-supervise and round of Homographic Adaptation.
improve the interest point detector. In all of our experi- The joint training of SuperPoint is also done on 240×320
ments, we call the resulting model, after applying Homo- grayscale COCO images. For each training example, a ho-
graphic Adaptation, SuperPoint and show the qualitative mography is randomly sampled. It is sampled from a more
progression on images from HPatches in Figure 7. restrictive set of homographies than during Homographic
Adaptation to better model the target application of pair-
6. Experimental Details wise matching (e.g., we avoid sampling extreme in-plane
rotations as they are rarely seen in HPatches). The image
In this section we provide some implementation de-
and corresponding pseudo-ground truth are transformed by
tails for training the MagicPoint and SuperPoint models.
the homography to create the needed inputs and labels. The
This encoder has a VGG-like [27] architecture that has
descriptor size used in all experiments is D = 256. We
eight 3x3 convolution layers sized 64-64-64-64-128-128-
use a weighting term of λd = 250 to keep the descriptor
128-128. Every two layers there is a 2x2 max pool layer.
learning balanced. The descriptor hinge loss uses a positive
Each decoder head has a single 3x3 convolutional layer of
margin mp = 1 and negative margin mn = 0.2. We use a
256 units followed by a 1x1 convolution layer with 65 units
factor of λ = 0.0001 to balance the two losses.
and 256 units for the interest point detector and descriptor
All training is done using PyTorch [19] with mini-batch
respectively. All convolution layers in the network are fol-
sizes of 32 and the ADAM solver with default parameters of
lowed by ReLU non-linear activation and BatchNorm nor-
lr = 0.001 and β = (0.9, 0.999). We also use standard data
malization.
augmentation techniques such as random Gaussian noise,
To train the fully-convolutional SuperPoint model, we motion blur, brightness level changes to improve the net-
start with a base MagicPoint model trained on Synthetic work’s robustness to lighting and viewpoint changes.
Shapes. The MagicPoint architecture is the SuperPoint ar-
chitecture without the descriptor head. The MagicPoint
6
57 Illumination Scenes 59 Viewpoint Scenes Homography Estimation Detector Metrics Descriptor Metrics
NMS=4 NMS=8 NMS=4 NMS=8
=1 =3 =5 Rep. MLE NN mAP M. Score
SuperPoint .652 .631 .503 .484
SuperPoint .310 .684 .829 .581 1.158 .821 .470
MagicPoint .575 .507 .322 .260
LIFT .284 .598 .717 .449 1.102 .664 .315
FAST .575 .472 .503 .404
SIFT .424 .676 .759 .495 0.833 .694 .313
Harris .620 .533 .556 .461
ORB .150 .395 .538 .641 1.157 .735 .266
Shi .606 .511 .552 .453
Random .101 .103 .100 .104
Table 4. HPatches Homography Estimation. SuperPoint out-
Table 3. HPatches Detector Repeatability. SuperPoint is the performs LIFT and ORB and performs comparably to SIFT using
most repeatable under illumination changes, competitive on view- various thresholds of correctness. We also report related metrics
point changes, and outperforms MagicPoint in all scenarios. which measure detector and descriptor performance individually.
7
SuperPoint LIFT SIFT ORB
Figure 8. Qualitative Results on HPatches. The green lines show correct correspondences. SuperPoint tends to produce more dense and
correct matches compared to LIFT, SIFT and ORB. While ORB has the highest average repeatability, the detections cluster together and
generally do not result in more matches or more accurate homography estimates (see 4). Row 4: Failure case of SuperPoint and LIFT due
to extreme in-plane rotation not seen in the training examples. See Appendix D for additional homography estimation example pairs.
8
References [17] R. Mur-Artal, J. Montiel, and J. D. Tardos. ORB-
SLAM: a versatile and accurate monocular SLAM
[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Seg- system. IEEE Transactions on Robotics, 2015. 7
Net: A deep convolutional encoder-decoder architec-
ture for image segmentation. PAMI, 2017. 3, 8 [18] A. Odena, V. Dumoulin, and C. Olah. Deconvolution
and checkerboard artifacts. Distill, 2016. 3
[2] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikola-
[19] A. Paszke, S. Gross, S. Chintala, and G. Chanan.
jczyk. HPatches: A benchmark and evaluation of
PyTorch. https://github.com/pytorch/
handcrafted and learned local descriptors. In CVPR,
pytorch. 6
2017. 7
[20] I. Rocco, R. Arandjelović, and J. Sivic. Convolutional
[3] C. B. Choy, J. Gwak, S. Savarese, and M. Chandraker. neural network architecture for geometric matching.
Universal Correspondence Network. In NIPS. 2016. In CVPR, 2017. 2
2, 3, 8
[21] E. Rosten and T. Drummond. Machine learning for
[4] D. DeTone, T. Malisiewicz, and A. Rabinovich. high-speed corner detection. In ECCV, 2006. 2, 4, 7,
Deep image homography estimation. arXiv preprint 10
arXiv:1606.03798, 2016. 2
[22] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski.
[5] D. DeTone, T. Malisiewicz, and A. Rabinovich. ORB: An efficient alternative to SIFT or SURF. In
Toward geometric deepslam. arXiv preprint ICCV, 2011. 7
arXiv:1707.07410, 2017. 10 [23] N. Savinov, A. Seki, L. Ladicky, T. Sattler, and
[6] L. F. I. K. P. F. F. M.-N. Edgar Simo-Serra, Ed- M. Pollefeys. Quad-networks: unsupervised learning
uard Trulls. Discriminative learning of deep convo- to rank for interest point detection. In CVPR. 2017. 2
lutional feature point descriptors. In ICCV, 2015. 2 [24] C. Schmid, R. Mohr, and C. Bauckhage. Evaluation
[7] Y. Ganin and V. Lempitsky. Unsupervised domain of interest point detectors. IJCV, 2000. 2
adaptation by backpropagation. In ICML, 2015. 2 [25] J. Shi and C. Tomasi. Good features to track. In CVPR,
[8] C. Harris and M. Stephens. A combined corner and 1994. 4, 7, 10
edge detector. In Alvey vision conference, volume 15, [26] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken,
pages 10–5244. Manchester, UK, 1988. 4, 7, 10 R. Bishop, D. Rueckert, and Z. Wang. Real-time sin-
[9] R. Hartley and A. Zisserman. Multiple View Geometry gle image and video super-resolution using an efficient
in computer vision. 2003. 1 sub-pixel convolutional neural network. In CVPR,
2016. 3
[10] S. Honari, P. Molchanov, S. Tyree, P. Vincent,
C. Pal, and J. Kautz. Improving landmark localiza- [27] K. Simonyan and A. Zisserman. Very deep convo-
tion with semi-supervised learning. arXiv preprint lutional networks for large-scale image recognition.
arXiv:1709.01591, 2017. 2 arXiv preprint arXiv:1409.1556, 2014. 3, 6
[28] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg,
[11] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long,
A. Dosovitskiy, and T. Brox. DeMoN: Depth and mo-
R. Girshick, S. Guadarrama, and T. Darrell. Caffe:
tion network for learning monocular stereo. In CVPR,
Convolutional architecture for fast feature embedding.
2017. 2
arXiv preprint arXiv:1408.5093, 2014. 7
[29] Y. Verdie, K. Yi, P. Fua, and V. Lepetit. TILDE: A
[12] C.-Y. Lee, V. Badrinarayanan, T. Malisiewicz, and
Temporally Invariant Learned DEtector. In CVPR,
A. Rabinovich. RoomNet: End-to-end room layout
2015. 2
estimation. In ICCV, 2017. 1
[30] S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Suk-
[13] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, thankar, and K. Fragkiadaki. SfM-Net: Learning
D. Ramanan, P. Dollár, and L. Zitnick. Microsoft of structure and motion from video. arXiv preprint
COCO: Common objects in context. In ECCV, 2014. arXiv:1704.07804, 2017. 2
6
[31] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh.
[14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, Convolutional pose machines. In CVPR, 2016. 1
C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox [32] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua. LIFT:
detector. In ECCV, 2016. 1, 8 Learned Invariant Feature Transform. In ECCV, 2016.
[15] D. G. Lowe. Distinctive image features from scale- 2, 7, 8
invariant keypoints. IJCV, 2004. 2, 7 [33] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe.
[16] K. Mikolajczyk and C. Schmid. A performance eval- Unsupervised learning of depth and ego-motion from
uation of local descriptors. PAMI, 2005. 2, 7, 10 video. In CVPR, 2017. 2
9
APPENDIX Nearest Neighbor mean Average Precision. This met-
ric captures how discriminating the descriptor is by eval-
A. Evaluation Metrics uating it at multiple descriptor distance thresholds. It is
computed by measuring Area Under Curve (AUC) of the
In this section we present more details on the metrics Precision-Recall curve, using the Nearest Neighbor match-
used for evaluation. In our experiments we follow the ing strategy. This metric is computed symmetrically across
protocol of [16], with one exception. Since our fully- the pair of images and averaged.
convolutional model does not use local patches, we instead Matching Score. This metric measures the overall per-
compare detection distances by measuring the distance be- formance of the interest point detector and descriptor com-
tween the 2D detection centers, rather than measure patch bined. It measures the ratio of ground truth correspondences
overlap. For multi-scale methods such as SIFT and ORB, that can be recovered by the whole pipeline over the number
we compare distances at the highest resolution scale. of features proposed by the pipeline in the shared viewpoint
Corner Detection Average Precision. We compute region. This metric is computed symmetrically across the
Precision-Recall curves and the corresponding Area-Under- pair of images and averaged.
Curve (also known as Average Precision), the pixel loca- Homography Estimation. We measure the ability of
tion error for correct detections, and the repeatability rate. an algorithm to estimate the homography relating a pair of
For corner detection, we use a threshold ε to determine if images by comparing the estimated homography Ĥ to the
a returned point location x is correct relative to a set of K ground truth homography H. It is not straightforward to
ground-truth corners {x̂1 , . . . , x̂K }. We define the correct- compare the 3 × 3 H matrices directly, since different en-
ness as follows: tries in the matrix have different scales. Instead we com-
pare how well the homography transforms the four corners
Corr(x) = (min ||x − x̂j ||) ≤ ε. (11)
j of one image onto the other. We define the four corners of
The precision recall curve is created by varying the de- the first image as c1 , c2 , c3 , c4 . We then apply the ground
tection confidence and summarized with a single number, truth H to get the ground truth corners in the second im-
namely the Average Precision (which ranges from 0 to 1), age c01 , c02 , c03 , c04 and the estimated homography Ĥ to get
and larger AP is better. ĉ01 , ĉ02 , ĉ03 , ĉ04 . We use a threshold ε to denote a correct ho-
Localization Error. To complement the AP analysis, mography.
we compute the corner localization error, but solely for the N
4
correct detections. We define the Localization Error as fol- 1 X 1 X 0
CorrH = ||c ij − ĉ0ij || ≤ ε . (15)
lows: N i=1 4 j=1
1 X
LE = min ||xi − x̂j ||. (12) The scores range between 0 and 1, higher is better.
N j∈{1,...,K}
i:Corr(xi )
10
Synthetic Shapes
Figure 10. Per Shape Category Results. These plots report Av-
Metric Noise MagicPointL MagicPointS FAST Harris Shi
erage Precision and Corner Localization Error for each of the 10
mAP no noise 0.979 0.980 0.405 0.678 0.686 categories in the Synthetic Shapes dataset with and without noise.
mAP noise 0.971 0.939 0.061 0.213 0.157
MLE no noise 0.860 0.922 1.656 1.245 1.188
The sequences with “Random” inputs are especially difficult for
MLE noise 1.012 1.078 1.766 1.409 1.383 the classical detectors.
Mean Average Precision and Mean Localization Er- Figure 11. Effect of Noise Magnitude. Two versions of Magic-
ror. For each category, there are 1000 images sampled from Point are compared to three classical point detectors on the Syn-
the Synthetic Shapes generator. We compute Average Pre- thetic Shapes dataset (shown in Figure 9). The MagicPoint models
cision and Localization Error with and without added imag- outperform the classical techniques in both metrics, especially in
ing noise. A summary of the per category results are shown the presence of image noise.
in Figure 10 and the mean results are shown in Table 5. The
MagicPoint detectors outperform the classical detectors in
all categories. There is a significant performance gap in
mAP in all categories in the presence of noise.
Effect of Noise Magnitude. Next we study the effect
Effect of Noise Filters
image 12
of noise more carefully by varying its magnitude. We were
curious if the noise we add to the images is too extreme and
no noise brightness Gaussian speckle shadow all-speckle
unreasonable for a point detector. To test this hypothesis, motion all
11
We created a sequence of 96 × 96 images of a black
square on a white background. We vary the square’s width
to range from 3 to 91 pixels and report MagicPoint’s confi-
dence for two special pixels in the output heatmap: the cen-
ter pixel (location of the blob) and the square’s top-left pixel
(an easy-to-detect corner). The MagicPoint blob+corner a.)
confidence plot for this experiment can be seen in Figure 13.
We observe that we can confidently detect the center of the
blob when the square is between 11 and 43 pixels wide (red
region in Figure 13), detect with lower confidence when the
square is between 43 and 71 pixels wide (yellow region in
Figure 13), and unable to detect the a.)center blob when the b.)
square is larger than 71 (blue regions in Figure 13). Figure 14. Homographic Adaptation. Top: we vary the num-
ber of homographies applied during
c.) Homographic Adaptation and
report repeatability. Bottom: we isolate the effect of scale.
sponse within-scale that we really want, and the maximum
response across-scale. We can additionally use the average
response across scale as a multi-scale measure of interest
point confidence. The average response across scales will
c.) be maximized when the interestd.) point is visible across all
scales, and these are likely to be the most robust interest
points for tracking applications.
Within-scale aggregation. We use the average response
across a large number of Homographic warps of the input
image. Care should be taken in choosing random homo-
graphies because not all homographies are realistic image
transformations. The number of homographic warps Nh is
a hyper-parameter of our approach. We typically enforce
the first homography to be equal to identity, so that Nh = 1
in our experiments corresponds to doing no homographies
(or equivalently, applying the identity Homography). Our
Figure 13. MagicPoint: Blob Center Detection Top: we exper-
experiments range from “small” Nh = 10, to “medium”
imented with MagicPoint’s ability to detect the centers of shapes
and plot detection confidences for both the top-left (TL) corner and
Nh = 100, and “large” Nh = 1000.
the center blob. Bottom: point detection heatmaps (MagicPoint Across-scale aggregation. When aggregating across
outputs) superimposed on the black rectangle images. Notice that scales, the number of scales considered Ns is a hyper-
our model is able to detect centers of 71 pixel rectangles, meaning parameter of our approach. The setting of Ns = 1 corre-
that our network’s receptive field is at least 71 pixels. sponds to no multi-scale aggregation (or simply aggregat-
ing across the large possible image size only). For Ns > 1,
C. Homographic Adaptation Experiment we refer to the multi-scale set of images being processed
as “the multi-scale image pyramid.” We consider weighting
When combining interest point response maps, it is im- schemes that weigh levels of the pyramid differently, giving
portant to differentiate between within-scale aggregation higher-resolution images a larger weight. This is important
and across-scale aggregation. Real-world images typically because interest points detected at lower resolutions have
contain features at different scales, as some points which poorer localization ability, and we want the final aggregated
would be deemed interesting in a high-resolution images, points to be localized as well as possible.
are often not even visible in coarser, lower resolution im- We experimented with within-scale and across-scale ag-
ages. However, within a single-scale, transformations of the gregation on a held out test of MS-COCO images. The re-
image such as rotations and translations should not make in- sults are summarized in Figure 14. We find that within-scale
terest points appear/disappear. This underlying multi-scale aggregation has the biggest effect on repeatability.
nature of images has different implications for within-scale
and across-scale aggregation strategies. Within-scale ag- D. Extra Qualitative Examples
gregation should be similar to computing the intersection
of a set and across-scale aggregation should be similar to We show extra qualitative examples of SuperPoint, LIFT,
the union of a set. In other words, it is the average re- SIFT and ORB on HPatches matching in Figure 15.
12
SuperPoint LIFT SIFT ORB
Figure 15. Extra Qualitative Results on HPatches. More examples like in Figure 8. The green lines show correct correspondences, green
dots show matched points, red dots show mis-matched points, blue dots show points outside of the shared viewpoint region.
13