0% found this document useful (0 votes)

39 views19 pages

Crocov 2

Uploaded by

cjiashun06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views19 pages

Crocov 2

Uploaded by

cjiashun06

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

CroCo v2: Improved Cross-view Completion Pre-training

for Stereo Matching and Optical Flow

Philippe Weinzaepfel Thomas Lucas Vincent Leroy

Yohann Cabon Vaibhav Arora Romain Brégier Gabriela Csurka
Leonid Antsfeld Boris Chidlovskii Jérôme Revaud
arXiv:2211.10408v3 [cs.CV] 18 Aug 2023

NAVER LABS Europe

https://github.com/naver/croco

Abstract

Despite impressive performance for high-level down-

stream tasks, self-supervised pre-training methods have not
yet fully delivered on dense geometric vision tasks such as
stereo matching or optical flow. The application of self-
supervised concepts, such as instance discrimination or
masked image modeling, to geometric tasks is an active
area of research. In this work, we build on the recent cross-
view completion framework, a variation of masked image
modeling that leverages a second view from the same scene
which makes it well suited for binocular downstream tasks.
The applicability of this concept has so far been limited in
at least two ways: (a) by the difficulty of collecting real- Figure 1: Pre-training for dense geometric tasks. We pre-
world image pairs – in practice only synthetic data have train a generic architecture, with a monocular encoder and a
been used – and (b) by the lack of generalization of vanilla binocular decoder, with cross-view completion before fine-
transformers to dense downstream tasks for which relative tuning it on the stereo matching or optical flow downstream
position is more meaningful than absolute position. We ex- task.
plore three avenues of improvement. First, we introduce a
method to collect suitable real-world image pairs at large which supervision signal can be extracted from the data it-
scale. Second, we experiment with relative positional em- self, as well as generic architectures that can be easily trans-
beddings and show that they enable vision transformers ferred. We hypothesize that successfully pre-training large
to perform substantially better. Third, we scale up vision models for geometric tasks such as stereo matching or op-
transformer based cross-completion architectures, which is tical flow, see Figure 1, requires three things all together:
made possible by the use of large amounts of data. With (a) a well-designed dense pretext task inciting the under-
these improvements, we show for the first time that state- standing of 3D scene layout and geometry, (b) an archi-
of-the-art results on stereo matching and optical flow can tecture that processes pairs of images, suitable for different
be reached without using any classical task-specific tech- downstream tasks, and (c) large-scale real-world data.
niques like correlation volume, iterative estimation, image Early self-supervised methods proceeded by discarding
warping or multi-scale reasoning, thus paving the way to- part of the signal (e.g. image color [97], patch ordering [57]
wards universal vision models. or image orientation [25]) and trying to recover it. Later
methods based on instance discrimination [12, 13, 16, 30]
were first to surpass supervised pre-training on high-level
1. Introduction tasks: they are based on the idea that output features should
be invariant to well-designed classes of augmentations. An-
Self-supervised pre-training methods aim at learning other recently successful pretext task is masked image mod-
rich representations from large amounts of unannotated eling (MIM) [2, 22, 29, 83, 86, 102], where part of the
data, which can then be finetuned on a variety of down- input data is masked and an auto-encoder is trained to re-
stream tasks. This requires the design of pretext tasks, for store the full signal from the remaining visible parts. In-

1
stance discrimination and MIM methods have achieved ex-
cellent performance on semantic tasks such as image clas-
sification, in particular with limited amounts of annotated
data [2, 17, 71], but have not led to breakthroughs in more
geometric tasks like stereo matching and optical flow.
Adapting self-supervised pre-training to geometric vi-
sion tasks is an active area of research. Attempts have
been made to design contrastive learning objectives at the
Figure 2: Overview of the improvements in CroCo v2
pixel or patch level [82, 85, 87], but their performance gains
for cross-view completion pre-training: (a) collecting and
have so far been more moderate than for global tasks. Be-
using real-world images, (b) using rotary positional embed-
sides, these gains are mainly demonstrated for dense seman-
dings which model relative token positions, instead of ab-
tic tasks such as semantic segmentation or object detection,
solute positions using the standard cosine embedding, (c)
rather than for geometric tasks such as depth estimation or
increasing network size both in the encoder and the decoder.
stereo matching. Recently, [84] proposed the pretext task of
cross-view completion (CroCo), a variant of MIM where a
partially masked input image is reconstructed given visible
patches and an additional view of the same scene. This pre- age extra information available such as 3D meshes, addi-
training objective is well suited to geometric downstream tional sensors like LIDAR, or Structure-from-Motion (SfM)
tasks as (a) it leverages pairs of images and (b) extracting reconstructions for datasets with sufficient image coverage.
relevant information from the second view requires geo- From these data, we generate a set of high quality image
metric understanding of the scene. The CroCo architecture pairs with sufficient overlap and viewpoint difference while
consists of a vision transformer (ViT) [20] encoder to ex- also ensuring high diversity between pairs. Second, these
tract features for the non-masked tokens of the first image, large-scale datasets of pre-training pairs allow to scale up
as well as for the second reference image, and a transformer the model: (a) we use a larger encoder to extract better
to decode the features and reconstruct the masked image, as image features and (b) also scale up the decoder, which
illustrated in Figure 2. is responsible for combining information coming from the
In spite of these advances, leveraging cross-view com- two views. Third, instead of the standard cosine positional
pletion for geometric vision tasks remains challenging for embedding which encodes absolute positional information,
at least two reasons. First, training with cross-view com- we rely on the Rotary Positional Embedding (RoPE) [73]
pletion requires image pairs depicting the same scene; this which efficiently injects relative positional information of
can be hard to acquire at scale, yet scale is the cornerstone token pairs in the attention mechanism.
of the success of self-supervised pre-training. In practice, We finetune our pre-trained model, referred to as
the CroCo model of [84] is pre-trained solely with synthetic CroCo v2, with this improved cross-view completion
data, which may limit its final performance. Second, most scheme on stereo matching and optical flow using a Dense
models trained with masking rely on ViTs [20], which typ- Prediction Transformer (DPT) [61] head. Our models,
ically use absolute positional embeddings. These do not termed CroCo-Stereo and CroCo-Flow, are simple and
generalize well to new image resolutions when finetuning, generic: we rely on a plain ViT encoder, followed by a
and are not always robust to cropping. This limits the appli- plain transformer decoder which directly predicts the out-
cability of current cross-view completion methods and may put (disparity for stereo, or optical flow) through the DPT
explain why the downstream tasks presented in [84] mostly head. We believe this is a meaningful step towards a univer-
use low-resolution squared images. sal vision model, i.e., that can solve numerous vision tasks
In this paper, we propose solutions to these limitations with a common architecture. In contrast to state-of-the-art
that enable to pre-train a large-scale cross-view completion methods for stereo matching or optical flow, our architec-
model, see Figure 2, leading to state-of-the-art performance ture does not rely on task-specific designs such as cost vol-
on stereo matching and optical flow. First, we tackle the umes [31, 36, 40, 41, 92], image warping [10, 76], iterative
problem of scalable pair collection, and gather millions of refinement [45, 48, 79] or multi-level feature pyramids [18,
training pairs from different real-world datasets which cover 45, 76]. While task-specific structures and prior knowledge
various scenarios like indoor environments, street view data may yield more data-efficient approaches, they come at the
and landmarks, see Figure 3. To generate high-quality pre- cost of being tailored to a single task. Our proposed pre-
training pairs, we carefully control the visual overlap for training allows us to eschew these and still reaches state-of-
each pair of images. In fact, pairs with high overlap make the-art performance on various stereo matching and optical
the task trivial, whereas pairs with negligible overlap reduce flow benchmarks such as KITTI 2015 [55], ETH3D [67],
it to standard MIM [84]. To measure this overlap, we lever- Spring [54] or MPI-Sintel [11].

2
ARKitScenes Habitat (synth.)
MegaDepth
3DStreetView
IndoorVL

Figure 3: Example of pre-training cropped image pairs from Habitat which was the synthetic data used by CroCo [84] on
the top row, and from real-world datasets we use in this paper (from ARKitScenes, MegaDepth, 3DStreetView and IndoorVL)
below.

2. Related work cess on denser tasks such as object detection [29] or hu-
man pose estimation [91], and have been applied to robotic
Self-supervised learning. The success of instance dis- vision [59] when pre-trained on related datasets. More re-
crimination [12, 13, 16, 28, 30] has drawn a lot of atten- cently, CroCo [84] introduces the pretext task of cross-view
tion to self-supervised learning in computer vision [37]. In completion, where a second view of the same scene is added
that paradigm, variants of an image are obtained by apply- to MIM. This is well suited to geometric downstream tasks:
ing different data augmentations. Features extracted from to leverage the second view and improve reconstruction ac-
the different variants are trained to be similar, while be- curacy, the model has to implicitly be aware of the geome-
ing pushed away from features obtained from other im- try of the scene. CroCo outperforms MIM pre-training on
ages. Such self-supervised models are particularly well an array of geometric tasks. However, it relies on synthetic
tailored to image-level tasks, such as image classifica- data only, which may be sub-optimal, and does not reach
tion, and have led to state-of-the-art performance on var- the performance of the best task-specific methods.
ious benchmarks. Recent studies suggest that this suc- Positional embeddings. Since a ViT treats its input as an
cess could be due to the object-centric [58] and the bal- orderless set of image patches or tokens, positional embed-
anced [1] nature of ImageNet [63] that is used for pre- dings are a necessary tool to keep track of the position of
training. Recently, inspired by BERT [19] in natural lan- each patch token from the original image. They can be ei-
guage processing, different masked modeling methods have ther learned [13, 20] or handcrafted, such as the cosine posi-
been adapted to computer vision. MIM pre-training aims tional embeddings from the original transformer [80]. Both
at reconstructing masked information from an input im- learned and cosine embeddings are added explicitly to the
age either in the pixel space [3, 4, 15, 22, 29, 86], or in signal and contain absolute positional information. How-
the feature space [2, 5, 83], and sometimes after quanti- ever, models for pixel-level dense computer vision tasks
zation [7, 102]. Recent works combine this framework in should be able to process various image resolutions and
a teacher-student approach [44, 46] with improved mask- be robust to cropping. Thus, relative positional embed-
ing strategy [23, 38, 46]. Overall, MIM models perform dings, e.g. [68], that consider distances between tokens are
well on classification tasks. They have obtained some suc- preferable. For instance, Bello et al. [9] achieve better ob-

3
ject detection results using relative self-attention. Similarly, are divided into sets of non-overlapping patches, denoted
Swin Transformers [51] and Swin V2 [50] observed im- as tokens, and 90% of the tokens from the first image are
proved performance using relative positional embeddings, masked. The remaining ones are fed to a ViT [20] encoder
while [74] showed it to be crucial in the cross attention to extract features for the first image. Similarly, tokens from
for optical flow. Recently, [73] introduced the Rotary Po- the second image are fed to the same encoder with shared
sitional Embedding (RoPE): a transformation to each key weights, and a ViT decoder processes the two sets of fea-
and query features is applied according to their absolute tures together to reconstruct the target. Figure 2 provides
position, in such a way that the pairwise similarity scores an overview of the pre-training stage. Compared to stan-
used in the attention computation only depend on the rela- dard masked image modeling methods, this approach can
tive positions of the token pairs and on their feature similar- leverage the information in the second view to resolve some
ity. RoPE thus models relative positions at any resolution. of the ambiguities about the masked context. To leverage
Stereo matching and optical flow can both be seen as a this information, the model has to implicitly reason about
dense correspondence matching problem [90]. However the scene geometry and the spatial relationship between the
the priors about matching itself and the completion of un- two views, which primes it well for geometric tasks.
matched regions differ. This explains why most models Training data. Collecting pairs of images that are suitable
are dedicated to one specific task despite many similarities for this approach is non-trivial. First, images have to be
in the strategies [42, 95]. Dense matching is most often paired together without manual annotation; second, their
posed with correlation/cost volume estimation from which visual overlap has to be carefully controlled for the pairs
matches can be extracted [21, 53]. For stereo, this volume to be useful. In [84], only synthetic data generated with the
typically has three dimensions [36, 41, 92], the third dimen- Habitat simulator [64] is used, which restricts the variety of
sion representing a discretization of the disparity level, or the pre-training data. In contrast, we propose an approach
four dimensions [14, 40, 56]. For optical flow, each pixel to this real-world image pairing problem, necessary to use
of the first image can be associated to any pixel of the sec- cross-view completion at scale, as detailed in Section 3.1.
ond, resulting in a 4D correlation volume. The complexity Positional embeddings. The architecture used in [84]
of building, storing and leveraging such volume motivated adapts ViTs to process pairs of images, by using cross-
numerous methods revolving around the ideas of coarse- attention inside the decoder. Following standard practices,
to-fine [6, 79, 98], warping [76], sparse formulation [33], in their work cosine positional embedding is added to the to-
random search [99], dimension separation [96], tokeniza- ken features prior to the encoder and the decoder. This mod-
tion [31]. Interestingly, recent works [74, 89, 90] lever- els absolute position while dense tasks must typically be
age cross-attention to facilitate inter-image information ex- robust to cropping or images of various resolutions. In Sec-
changes but still rely on a low-resolution correlation vol- tion 3.2, we describe how relative positional embeddings
ume, followed by an iterative refinement similar to [79]. can be adapted to cross-view completion.
Unimatch [90] made an important step towards a unified Large-scale models. Finally, we discuss scaling-up the
architecture for flow and stereo, but still relies on task- model in Section 3.3. CroCo [84] uses a ViT-Base encoder
dependent (a) cross-attention mechanisms, (b) correlation (12 blocks, 768-dimensional features, 12 attention heads)
volume and (c) post-processing. We similarly use the same and a decoder composed of 8 blocks with 512-dimensional
architecture for both tasks, but our standard transformer features and 16 heads. Using our large-scale dataset of real-
model without cost volume can be pre-trained with exist- world image pairs, we are able to scale to larger ViT archi-
ing self-supervised approaches and directly finetuned as is. tectures and demonstrate consistent performance gain.
Several works propose self-supervised methods for esti-
mating depth using stereo pairs or videos [26, 27], stereo 3.1. Collecting real-world image pairs
with matching priors [101], or optical flow [75, 49, 93] typ-
ically with an unsupervised reconstruction loss. The main We now present our approach to automatically select im-
difference between this paradigm and ours is that we aim age pairs from real-world datasets that are suitable for pre-
to pre-train a task-agnostic model that can be finetuned to training. To be useful, pairs need to depict the same scene
different tasks, while these approaches aim to remove su- with some partial overlap. The overlap should not be small
pervision for a single task. to the point where the task boils down to auto-completion.
It should not be high either to the point where the task be-
3. Cross-view completion pre-training at scale comes a trivial ‘copy and paste’, i.e., without requiring any
understanding of scene geometry. On top of that, diversity
Our proposed pre-training method is based on the re- should be as high as possible among pairs. We propose to
cently introduced cross-view completion (CroCo) frame- use datasets that offer ways of getting information about the
work [84]. It extends MIM to pairs of images. Given two geometry of the scene and the camera poses. This signal can
different images depicting a given scene, the two images be captured using additional sensors like LIDAR, or it can

4
0.8
Overlap
computation 0.6
Greedy pair
0.0 Crop generation
selection algorithm
Angle difference
computation
0.5

Figure 4: Overview of our pre-training cropped image pair collection method. Given a dataset of posed images, op-
tionally with point clouds (e.g. from SfM) or meshes of the scene, we first measure the visual overlap between pairs and
the viewpoint angle difference. Based on these scores, we use a greedy algorithm to select diverse image pairs and finally
generate crops from them.

be extracted using structure-from-motion (SfM) techniques for MegaDepth. We start from an empty point cloud and
if the images offer enough coverage of the scene. We use append, for each target building, a 10 × 6 meters grid of
this information to obtain an image pair quality score based 7 × 11 balls oriented according to the provided annotation.
on overlap and difference in viewpoint angle. We then use ◦ Indoor Visual Localization datasets (IndoorVL) [43]
a greedy algorithm to select a diverse set of image pairs. contains over 135,000 images from a large shopping mall
Finally, we generate overlapping image crops by leverag- and a large metro station in Seoul, South Korea, captured
ing image matching. Figure 4 gives an overview of our ap- regularly with several months interval with 10 cameras and
proach and we detail each step below. 2 laser scanners. The data is provided with accurate cam-
Computing overlap scores. The first step is to compute era poses obtained via LiDAR SLAM refined by SfM-based
overlap scores for candidate pairs. We develop several ap- optimization. We directly measure the overlap between im-
proaches depending on the available information. ages using the intersection between the camera frustrums
◦ ARKitScenes [8] provides 450,000 frames from 1,667 using the accurate camera poses provided with the dataset.
different indoor environments. The availability of the cor- To encourage further diversity, we multiply this score by a
responding mesh for each frame enables the computation of factor 0.8 if both images come from the same capture ses-
the overlap between every pair of images. For each image sion, thus favoring pairs taken with several months interval.
I, we retrieve the set of mesh vertices P(I) that are visible. Greedy image pair selection. We rely on the overlap
We then measure the intersection-over-union (IoU) of the scores described above to select high quality pairs. This
vertices (3D points) for each pair of images (I1 , I2 ) as: is however not sufficient: we also need pairs to be diverse,
which would not be the case when randomly selecting good
pairs, as images in the dataset can be very correlated. There-
IoU(I_1,I_2) = \frac {\vert \mathcal {P}(I_1) \cap \mathcal {P}(I_2) \vert }{\vert \mathcal {P}(I_1) \cup \mathcal {P}(I_2) \vert }. (1)
fore, we use a greedy algorithm to select non-redundant im-
age pairs for pre-training. First, for each image pair (I1 , I2 )
◦ MegaDepth [47] consists of around 300,000 images we use a quality pair score s given by:
downloaded from the web corresponding to 200 different
landmarks. From these images, a point cloud model for s(I_1,I_2) = IoU(I_1,I_2) \times 4 \cos (\alpha ) \big (1-\cos (\alpha )\big ), (2)
each landmark obtained using structure-form-motion (SfM) where α denotes the viewpoint angle difference between the
with COLMAP [66] is also provided. As above, it is pos- two images (all the datasets above provide camera poses).
sible to measure the vertex-based IoU between pairs of im- The function 4 cos(x)(1 − cos(x)) has a maximum value of
ages, where each vertex is in this case a 3D point from the 1 for x = 60◦ , 0 value for x = 0◦ and x = 90◦ , and it is
point cloud. Unfortunately, occlusions cannot be taken into negative for angles above 90◦ . This score thus favors pairs
account due to the absence of 3D mesh, which greatly de- with different viewpoints while still having large overlaps.
grades the overlap estimation. We propose a simple yet ef- Given the score for every pair, we aim at building a large
fective solution: we create an artificial occlusion model by number of image pairs while ensuring diversity, i.e., avoid-
attaching a ball of fixed radius to each 3D point, which oc- ing content redundancy. To do this, we use a greedy algo-
cludes the vertices placed behind it. This way, we can com- rithm, where each time we select a pair of images with max-
pute a set of visible vertices for each image and evaluate the imum score, we discard the two images forming the pair, as
IoU as done previously. well as images that have too large IoU (above 0.75) with
◦ 3D Street View [94] contains 25 million street view im- any of the two. We iteratively repeat this process until there
ages from 8 cities. In addition to the camera pose, the 3D is no pair with a score above a certain threshold.
location and orientation (normal vector) of the target build- Crop generation per pair. For pre-training, we use
ings are provided. To compute the overlap score, we create fixed-size crops of 224×224 pixels, as considering higher-
a pseudo 3D point cloud and apply the same technique as resolution images would be too costly. In practice, we

5
generate 256×256 crops and apply random cropping dur- Pre-training detailed setting. We pre-train the network for
ing pre-training. To generate crops on pairs of images 100 epochs with the AdamW optimizer [52], a weight decay
while maintaining overlaps, we rely on quasi-dense key- of 0.05, a cosine learning rate schedule at a base learning
point matching, namely DeepMatching [62], except for rate of 3.10−4 with a linear warmup in the first 10 epochs,
pairs from ARKitScenes where we directly use matches and a batch size of 512 spread on 8 GPUs. During pre-
from the mesh. Given the matches, we consider a grid of training, we simply use random crops and color jittering as
crops in the first image, estimate the corresponding match- data augmentation. We mask 90% of the tokens from the
ing crop in the second image and keep those with the most first image. Examples of cross-view completion obtained
consistent matches and without overlap in the first image. with our model are shown in Appendix A.
Overall statistics. In total, we collected about 5.3
million real-world pairs of crops with the process de- 4. Application to stereo matching and flow
scribed above, with respectively 1,070,414 pairs from
ARKitScenes [8], 2,014,789 pairs from MegaDepth [47], We now present CroCo-Stereo and CroCo-Flow, our
655,464 from 3DStreetView [94], and 1,593,689 pairs from ViT-based correlation-free architecture for stereo matching
IndoorVL [43]. We added this to 1,821,391 synthetic pairs and optical flow respectively, pre-trained with cross-view
generated with the Habitat simulator [64], following the ap- completion. This is much in contrast to current state-of-the-
proach of [84]. Example pairs for each dataset are shown art methods which rely on task-specific design in the form
in Figure 3. They cover various scenarios, from indoor of cost volumes [31, 36, 40, 41, 72, 76, 88, 92, 99], image
rooms – synthetic with Habitat or real with ARKitScenes warping [10, 76], iterative refinement [45, 48] and multi-
– to larger crowded indoor environment (IndoorVL), land- level feature pyramids [18, 45, 76, 78]. Both CroCo-Stereo
marks (MegaDepth) and outdoor streets (3DStreetView). and CroCo-Flow share the same architecture.
Architecture. When finetuning the model for stereo or
3.2. Positional embeddings flow, both images are fed to the encoder as during pre-
training (but without masking), and the decoder processes
We replace the cosine embeddings, which inject abso- the tokens of both images. To output a pixel-wise predic-
lute positional information, by Rotary Positional Embed- tion, we rely on DPT [61], which adapts the standard up-
ding (RoPE) [73]. RoPE efficiently injects information convolutions and fusions from multiple layers used in fully-
about the relative positioning of feature pairs when comput- convolutional approaches for dense tasks, to vision trans-
ing attention. Formally, let q and k represent a query and a formers. This allows to combine features from different
key feature, at absolute positions m and n respectively. The blocks by reshaping them to different resolutions and fus-
main idea of RoPE is to design an efficient function f (x, p) ing them with convolutional layers. In practice, we use the
that transforms a feature x according to its absolute position features from 4 blocks, regularly spread by an interval of a
p such that the similarity between the transformed query and third of the decoder depth, starting from the last block, re-
the transformed key ⟨f (q, m), f (k, n)⟩ is a function of q, k sulting in 1 block at the end of the encoder and 3 decoder
and m − n only. [73] showed that a simple transformation blocks.
such as applying rotations on pairs of dimensions according Loss. We parameterize the output of the network with a
to a series of rotation matrices at different frequencies sat- Laplacian distribution [39]: given an input pair (x1 , x2 ),
isfy this desirable property. To deal with 2D signals such as the model outputs a location parameter µi and a scale pa-
images, we split the features into 2 parts, we apply the 1D rameter di per pixel location i and is trained to minimize the
positional embedding of the x-dimension on the first part, negative log-likelihood of the ground-truth target disparity,
and the embedding of the y-dimension on the second part. denoted µ̄, under the predicted distribution:
3.3. Scaling up the model
- \log p(\bar {\mu }| \mu , d) = \sum _i \left [ \frac {\vert \mu _i - \bar {\mu }_i \vert }{d_i} - 2 \log d_i \right ]. \label {eq:loss} (3)
The combination of information extracted from the two
images only occurs in the decoder. Following MAE [29],
CroCo [84] uses a small decoder of 8 blocks consisting of The scale parameter d can be interpreted as an uncertainty
self-attention, cross-attention and an MLP, with 512 dimen- score for the prediction: large errors are penalized less when
sions and 16 attention heads. As the decoder is crucial for d is high, while good predictions are rewarded more if d is
binocular tasks such as stereo or flow, we scale up the de- low. It is thus optimal for the network to adapt the scale
coder and follow the ViT-Base hyper-parameters with 12 parameter. The second term comes from the normalization
blocks, 768-dimensional features and 12 heads. We also term of the Laplacian density and avoids the degenerate so-
scale up the image encoder from ViT-Base to ViT-Large, lution of always predicting a low scale parameter. Empiri-
i.e., increase the depth from 12 to 24, the feature dimension cally, we find that using a probabilistic loss improves per-
from 768 to 1024 and the number of heads from 12 to 16. formance, see Appendix B.4 for the ablation, and is useful

6
Figure 5: Architecture of CroCo-Stereo and CroCo-Flow. The two images (left and right views for stereo, two frames for
flow) are split into patches and encoded with a series of transformer blocks with RoPE positional embeddings. The decoder
consists in a series of transformer decoder blocks (self-attention among token features from the first image, cross-attention
with the token features from the second image, and an MLP). Token features from different intermediate blocks are fed to
the DPT module [61] to obtain the final prediction.
pos. pre-train Stereo ([email protected]↓) Flow (EPE↓)
encoder decoder
emb. data Md ETH SF(c) SF(f) FT(c) FT(f) Si.(c) Si.(f)
cosine ViT-B Small 2M habitat (CroCo [84]) 26.3 1.82 6.7 7.0 3.89 3.56 2.07 2.57
RoPE ViT-B Small 2M habitat 25.3 0.60 6.0 6.3 3.73 3.37 2.13 2.77
RoPE ViT-B Small 2M habitat + 5.3M real 20.7 0.82 5.8 6.1 3.35 2.94 1.76 2.30
RoPE ViT-B Base 2M habitat + 5.3M real 17.1 1.14 5.3 5.6 3.10 2.73 1.51 1.99
RoPE ViT-L Base 2M habitat + 5.3M real (CroCo v2) 15.5 0.38 5.0 5.3 2.85 2.45 1.43 1.99

Table 1: Ablative study of each change to CroCo with the percentage of pixels with error above 1px ([email protected]) on validation
sets from Middlebury (Md), ETH3D, SceneFlow (SF) in clean (c) and final (f) renderings for stereo, and with the endpoint
error (EPE) on validation sets from FlyingThings (FT) and MPI-Sintel (Si.) in both clean (c) and final (f) renderings for
optical flow. A Small decoder has 8 decoder blocks with 16 attention heads on 512-dimensional features, while the Base one
has 12 blocks with 12 heads on 768-dimensional features.

′
for tiling strategies during inference, because it provides a we use a weighted average with weights e−2ηα(σ(di /α)−0.5)
per-pixel confidence estimate, as detailed below. A parame- with η = 5 for stereo matching and α = 5, η = 2 for optical
terization of di ensures its positiveness: for stereo matching flow, where d′i is the uncertainty predicted by the model.
′
we use di = e2α(σ(di /α)−0.5) , with σ the sigmoid function
and α = 3, and for optical flow di = 1/β + (β − 1/β)σ(d′i ) 5. Experiments
with β = 4, unless otherwise stated.
Training. We train CroCo-Stereo using 704×352 crops Ablations. We perform our ablations on the validation
from various stereo datasets: CREStereo [45], Scene- pairs (see Appendix C for the splits we use) of Middlebury,
Flow [53], ETH3D [67], Booster [60], Middlebury (2005, ETH3D and SceneFlow for stereo matching, and FlyingTh-
2006, 2014, 2021 and v3) [65]. We train CroCo-Flow using ings and MPI-Sintel for optical flow. Table 1 reports the
384×320 crops from the TartanAir [81], MPI-Sintel [11], impact of the changes in CroCo v2 to improve CroCo [84]
FlyingThings [53] and FlyingChairs [21] datasets. We refer (pre-training data, positional embedding, larger encoder and
to Appendix C for more details on these datasets, the splits decoder). We observe that they all lead to consistent im-
we use for the ablations, the data augmentation strategy, as provements: replacing the cosine absolute positional em-
well as training hyper-parameters. bedding by RoPE, scaling up the decoder, using larger-
Inference. We use a tiling-based approach. We sample scale pre-training data and a larger encoder. Altogether,
overlapping tiles with the same size as the training crops this allows e.g. to improve performance as measured by the
in the first image. For each tile, we create a pair by sam- [email protected] metric from 26.3 to 15.5 on Middlebury (stereo
pling a corresponding tile at the same position from the matching), or the EPE from 2.07 to 1.43 on MPI-Sintel in
second image. We then predict the disparity or flow be- its clean rendering (optical flow).
tween each pair of tiles. Such tiling approach was used e.g. To further benchmark CroCo v2, we evaluate the pre-
in [31]. To merge the predictions done at a given pixel, training of the encoder only on monocular tasks follow-

7
avg↓
D D D D D D D D
Bicyc2 Compu Austr AustrP Djemb DjembL Livgrm Plants Hoops Stairs Nkuba Class ClassE Crusa CrusaP
nd < 400px
LEAStereo [18] 1.83 3.81 2.81 2.52 1.07 1.64 2.59 5.13 5.34 2.79 3.09 2.46 2.75 2.91 3.09 2.89
AdaStereo [72] 2.19 2.29 4.37 3.08 1.40 1.64 3.93 7.58 4.46 2.67 3.69 3.29 3.35 3.78 2.94 3.39
HITNet [78] 1.43 1.87 3.61 3.27 0.90 9.12 2.37 4.07 4.45 3.38 3.45 2.43 3.20 4.67 4.74 3.29
RAFT-Stereo [48] 0.90 1.13 2.64 2.22 0.63 1.22 3.13 3.55 3.54 1.89 4.36 1.46 2.44 4.58 6.00 2.71
CREStereo [45] 1.38 1.06 2.63 2.53 0.64 1.11 1.42 5.31 3.22 2.40 2.51 1.92 2.31 1.78 1.83 2.10
GMStereo [90] 1.34 1.32 2.26 2.23 1.01 1.62 1.84 2.49 3.19 2.18 2.10 2.19 2.08 1.71 1.75 1.89
CroCo-Stereo 0.84 1.45 1.87 1.83 0.69 1.19 2.40 2.28 8.31 1.44 1.96 3.99 4.61 2.48 2.81 2.36

Table 2: Evaluation on Middlebury with the average error over all pixels for each sequence and the average (last column).
Sequences are ordered according to their ‘nd’ value, which is the official threshold of maximum disparity used to clip
predictions before evaluation.

Left image Ground truth CREStereo [45] CroCo-Stereo Method D1-bg↓ D1-fg↓ D1-all↓
AdaStereo [72] 2.59 5.55 3.08
HITNet [78] 1.74 3.20 1.98
PCWNet [69] 1.37 3.16 1.67
GMStereo [90] 1.49 3.14 1.77
ACVNet [88] 1.37 3.07 1.65
LEAStereo [18] 1.40 2.91 1.65
CREStreo [45] 1.45 2.86 1.69
CroCo-Stereo 1.38 2.65 1.59

Table 3: Evaluation on the KITTI 2015 stereo bench-

Figure 6: Three example results from the Middlebury mark with the percentage of outliers (i.e., error above 3
test set (Australia, Bicycle2 and Hoops) with from left to pixels) for background (D1-bg), foreground (D1-fg) and all
right: the left image, the ground truth, CREStereo [45] and (D1-all) pixels.
CroCo-Stereo.

is visible in the prediction of the bottom example of Fig-

ing the protocol of [4]. For semantic segmentation on ure 6 where one can observe tiling artefacts, e.g. next to the
ADE20k [100], we obtain 44.7 mean Intersection over stair pillars. In general, however, our method remains accu-
Union vs. 40.6 for CroCo [84], and for monocular depth rate, especially on thin structures like the pins on the map
estimation on NYU v2 [70], we obtain 93.2 delta-1 vs. 90.1 or the radius of the bicycle wheels in Figure 6.
for [84]. For KITTI 2015 (Table 3), we finetune CroCo-Stereo on
We provide in Appendix B an ablation on the impact of 1216×352 crops from KITTI 2012 [24] and 2015 [55] for
pre-training (i.e., a comparison with a randomly initialized 20 epochs. CroCo-Stereo performs the best on the main D1-
network for finetuning), an ablation on the masking ratio all metrics (outliers ratio at a 3px error threshold), with the
during pre-training as well as a comparison between the L1 best value also on foreground pixels, and at 0.01% of the
loss and Laplacian loss during finetuning. best methods on background pixels.
CroCo-Stereo vs. the state of the art. We now evalu- For ETH3D, we use a Laplacian loss without bounds as
ate CroCo-Stereo on the official leaderboards of Middle- it is limited to small disparities, i.e., with parameterization
′ ′
bury [65], KITTI 2015 [55], ETH3D [67] and Spring [54]. di = edi and weights e−3di for tiling. CroCo-Stereo sets a
On Middlebury (Table 2), CroCo-Stereo obtains the low- new state of the art for the ratio of pixels with an error over
est average error on 6 out of 15 sequences, in spite of using 0.5px ([email protected]) and performs on par with CREStereo [45]
a generic patch-based transformer without any of the usual for [email protected] and the average error, see Table 4. It out-
apparatus for stereo matching (e.g. cost-volume, coarse-to- performs recent approaches like GMStereo [90], RAFT-
scale processing, iterative refinement). However, in aver- Stereo [48], DIP-Stereo [99] or HITNet [78] by a large mar-
age, we obtain a worse error due to the fact that CroCo- gin, e.g. the [email protected] for non-occluded pixels is improved
Stereo produces really large errors for a few sequences like by 3% or more.
Hoops or ClassE. In fact, these errors correspond to cases Finally, we report results on the recent Spring benchmark
with large maximum disparities (based on the maximum in Table 5 where our model is finetuned for 8 epochs on its
threshold value applied before evaluation), which is harmful training set. CroCo-Stereo outperforms the leading methods
for our simple tiling-based inference approach. This effect on all metrics with a large margin, i.e., the main bad@1

8
[email protected] (%)↓ [email protected] (%)↓ avg err (px)↓ First image Ground truth GMFlow+ [90] CroCo-Flow
Method
noc all noc all noc all
AdaStereo [72] 10.22 10.85 3.09 3.34 0.24 0.25
HITNet [78] 7.89 8.41 2.79 3.11 0.20 0.22
RAFT-Stereo [48] 7.04 7.33 2.44 2.60 0.18 0.19
DIP-Stereo [99] 6.74 6.99 1.97 2.12 0.18 0.20
GMStereo [90] 5.94 6.44 1.83 2.07 0.19 0.21 Figure 7: Two examples from the MPI-Sintel test set with
CREStereo [45] 3.58 3.75 0.98 1.09 0.13 0.14 from left to right: the first image, the ground truth, GM-
CroCo-Stereo 3.27 3.51 0.99 1.14 0.14 0.15
Flow+ [90] and CroCo-Flow.
Table 4: Evaluation on ETH3D with the percentage of pix-
els with an error over 0.5px ([email protected]), over 1px ([email protected]) Method Fl-bg↓ Fl-fg↓ Fl-all↓
and the average error over non-occluded (noc) or all pixels. PWC-Net+ [76] 7.69 7.88 7.72
RAFT† [79] 4.74 6.87 5.10
Method 1px↓ 1px s0-10↓ 1px s10-40↓ 1px s40+↓ Abs↓ CRAFT† [74] 4.58 5.85 4.79
‡
RAFT-Stereo [48] 15.273 22.588 10.018 17.086 3.025 FlowFormer [31] 4.37 6.18 4.68
AVC-Net [88]‡ 14.772 18.386 11.346 18.145 1.516 GMFlow+ [90] 4.27 5.60 4.49
CroCo-Stereo 7.135 2.934 7.757 13.247 0.471
CroCo-Flow 3.18 5.94 3.64
Table 5: Evaluation of CroCo-Stereo on the Spring
benchmark with the percentage of outliers (error over 1px) Table 7: Evaluation of CroCo-Flow on the KITTI 2015
over all pixels, or over pixels with disparities in [0,10] (s0- benchmark with the percentage of outliers for background
10), in [10,40] (s10-40) and over 40 pixels (s40+), as well (F1-bg), foreground (F1-fg) and all (F1-all) pxiels. † means
as the average absolute error (Abs). ‡ means methods sub- that the flow prediction from the previous frames is used as
mitted by the leaderboard’s authors. initialization.

Method clean↓ final↓ For KITTI 2015 (Table 7), we finetuned the model on
the training set from KITTI 2012 and 2015 for 150 epochs
PWC-Net+ [76] 3.45 4.60
on crops of size 1216×352. CroCo-Flow performs best on
RAFT† [79] 1.61 2.86
the main F1-all metrics, i.e., the percentage of outliers, with
CRAFT† [74] 1.44 2.42
a large margin: the F1-all is reduced from 4.49% to 3.64%
FlowFormer [31] 1.20 2.12
compared to GMFlow+ [90]. This gap mainly comes from
SKFlow [77] 1.30 2.26
the background pixels, while we perform on par with the
GMFlow+ [90] 1.03 2.12
best methods on foreground pixels.
CroCo-Flow 1.09 2.44
Finally, on Spring, for which we finetune the model on
its training set for 12 epochs, we obtain state-of-the-art per-
Table 6: Evaluation on the MPI-Sintel benchmark with
formance, see Table 8. We obtain an EPE of 0.50, compared
the EPE (↓) on the clean and final renderings. † means that
to 0.64 for the second best method, with an outlier ratio re-
the flow prediction from the previous frames is used as ini-
duced for all flow norm ranges.
tialization.
Limitations. The tiling-based inference strategy may pre-
vent an accurate estimate in case of extremely large dispar-
ity or flow, where the corresponding pixels can be outside of
metric is reduced from 15% to 7% and the absolute error
the tile of the second image. A tiling strategy smarter than
from 1.5 to 0.5px.
taking the same cropping coordinates in a pair of images
CroCo-Flow vs. the state of the art. We compare CroCo-
could be considered.
Flow to the state of the art on the official leaderboards of
MPI-Sintel [11], KITTI 2015 [55] and Spring [54].
6. Conclusion
On MPI-Sintel (Table 6), CroCo-Flow performs better
than RAFT [79] which include many specialized refinement For the first time, we have shown that large-scale pre-
steps and use previous flow estimation as initialization. We training can be successful for dense geometric tasks, thanks
rank second on the clean rendering and perform competi- to a well-adapted pretext task and real-world data at scale.
tively on the final rendering, on par with most recent ap- This enables to reach state-of-the-art performance with a
proaches such as GMFlow+ [90], SKFlow [77] or Flow- ViT-based architecture without using task-specific designs,
Former [31]. Figure 7 shows some visualizations of flow and thereby opening novel routes to tackle these problems,
prediction. and new avenues towards more universal vision models.

9
Method 1px↓ 1px s0-10↓ 1px s10-40↓ 1px s40+↓ EPE↓ [13] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
FlowFormer [31]‡ 6.510 3.381 5.530 35.344 0.723 Julien Mairal, Piotr Bojanowski, and Armand Joulin.
MS-Raft+ [32]‡ 5.724 2.055 5.022 38.315 0.643 Emerging Properties in Self-Supervised Vision Transform-
CroCo-Flow 4.565 1.225 4.332 33.134 0.498 ers. In ICCV, 2021. 1, 3
[14] J. Chang and Y. Chen. Pyramid stereo matching network.
Table 8: Evaluation of CroCo-Flow on the Spring bench- In CVPR, 2018. 4
mark with the number of outliers (error over 1px) over all [15] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
pixels, or over pixels with flow norm in [0,10] (s0-10), in woo Jun, David Luan, and Ilya Sutskever. Generative Pre-
[10,40] (s10-40) and over 40 pixels (s40+) as well as the training From Pixels. In ICML, 2020. 3
endpoint error (EPE). ‡ means methods submitted by the [16] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
leaderboard’s authors. offrey Hinton. A simple framework for contrastive learning
of visual representations. In ICML, 2020. 1, 3
[17] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad
References Norouzi, and Geoffrey E. Hinton. Big self-supervised mod-
els are strong semi-supervised learners. In NeurIPS, 2020.
[1] Mahmoud Assran, Randall Balestriero, Quentin Duval, Flo- 2
rian Bordes, Ishan Misra, Piotr Bojanowski, Pascal Vincent, [18] Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao
Michael Rabbat, and Nicolas Ballas. The hidden uniform Dai, Xiaojun Chang, Hongdong Li, Tom Drummond, and
cluster prior in self-supervised learning. In ICLR, 2023. 3 Zongyuan Ge. Hierarchical neural architecture search for
[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- deep stereo matching. In NeurIPS, 2020. 2, 6, 8
janowski, Florian Bordes, Pascal Vincent, Armand Joulin, [19] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Michael Rabbat, and Nicolas Ballas. Masked siamese net- Toutanova. BERT: Pre-training of Deep Bidirectional
works for label-efficient learning. In ECCV, 2022. 1, 2, Transformers for Language Understanding. In NAACL
3 HLT, 2019. 3
[3] Sara Atito, Muhammad Awais, and Josef Kittler. SiT: [20] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Self-supervised vIsion Transformer. arXiv preprint Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
arXiv:2104.03602, 2021. 3 Mostafa Dehghani, Matthias Minderer, Georg Heigold,
[4] Roman Bachmann, David Mizrahi, Andrei Atanov, and Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Im-
Amir Zamir. MultiMAE: Multi-modal Multi-task Masked age is Worth 16x16 Words: Transformers for Image Recog-
Autoencoders. In ECCV, 2022. 3, 8 nition at Scale. In ICLR, 2021. 2, 3, 4
[5] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, [21] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip
Jiatao Gu, and Michael Auli. data2vec: A General Frame- Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van
work for Self-supervised Learning in Speech, Vision and Der Smagt, Daniel Cremers, and Thomas Brox. Flownet:
Language. arXiv preprint arXiv:2202.03555, 2022. 3 Learning optical flow with convolutional networks. In
[6] Shaojie Bai, Zhengyang Geng, Yash Savani, and J. Zico ICCV, 2015. 4, 7, 19
Kolter. Deep equilibrium optical flow estimation. In CVPR, [22] Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan
2022. 4 Laptev, Hervé Jegou, and Edouard Grave. Are Large-scale
[7] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: Datasets Necessary for Self-Supervised Pre-training? arXiv
BERT Pre-Training of Image Transformers. In ICLR, 2022. preprint arXiv:2112.10740, 2021. 1, 3
3 [23] Yuxin Fang, Li Dong, Hangbo Bao, Xinggang Wang, and
[8] Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Furu Wei. Corrupted Image Modeling for Self-Supervised
Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Visual Pre-Training. In ICLR, 2023. 3
Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes–a di- [24] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
verse real-world dataset for 3d indoor scene understanding ready for autonomous driving? the kitti vision benchmark
using mobile rgb-d data. In NeurIPS, 2021. 5, 6 suite. In CVPR, 2012. 8, 15
[9] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon [25] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
Shlens, and Quoc V. Le. Attention Augmented Convolu- supervised Representation Learning by Predicting Image
tional Networks. In ICCV, 2019. 3 Rotations. In ICLR, 2018. 1
[10] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim [26] Clément Godard, Oisin Mac Aodha, and Gabriel J Bros-
Weickert. High accuracy optical flow estimation based on tow. Unsupervised monocular depth estimation with left-
a theory for warping. In ECCV, 2004. 2, 6 right consistency. In CVPR, 2017. 4
[11] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and [27] Clément Godard, Oisin Mac Aodha, Michael Firman, and
Michael J Black. A naturalistic open source movie for op- Gabriel J Brostow. Digging into self-supervised monocular
tical flow evaluation. In ECCV, 2012. 2, 7, 9, 19 depth estimation. In ICCV, 2019. 4
[12] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, [28] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Piotr Bojanowski, and Armand Joulin. Unsupervised Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl
Learning of Visual Features by Contrasting Cluster Assign- Doersch, Bernardo Ávila Pires, Zhaohan Guo, Moham-
ments. In NeurIPS, 2020. 1, 3 mad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu,

10
Rémi Munos, and Michal Valko. Bootstrap Your Own La- [44] Youngwan Lee, Jeffrey Willette, Jonghee Kim, Juho Lee,
tent - A New Approach to Self-Supervised Learning. In and Sung Ju Hwang. Exploring the role of mean teachers
NeurIPS, 2020. 3 in self-supervised masked auto-encoders. In ICLR, 2023. 3
[29] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [45] Jiankun Li, Peisen Wang, Pengfei Xiong, Tao Cai, Ziwei
Dollár, and Ross Girshick. Masked Autoencoders are Scal- Yan, Lei Yang, Jiangyu Liu, Haoqiang Fan, and Shuaicheng
able Vision Learners. In CVPR, 2022. 1, 3, 6, 14, 15, 16 Liu. Practical stereo matching via cascaded recurrent net-
[30] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross work with adaptive correlation. In CVPR, 2022. 2, 6, 7, 8,
Girshick. Momentum Contrast for Unsupervised Visual 9, 19
Representation Learning. In CVPR, 2020. 1, 3 [46] Zhaowen Li, Zhiyang Chen, Fan Yang, Wei Li, Yousong
[31] Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Zhu, Chaoyang Zhao, Rui Deng, Liwei Wu, Rui Zhao,
Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hong- Ming Tang, and Jinqiao Wang. MST: Masked Self-
sheng Li. FlowFormer: A transformer architecture for op- Supervised Transformer for Visual Representation. In
tical flow. In ECCV, 2022. 2, 4, 6, 7, 9, 10, 16, 18 NeurIPS, 2021. 3
[32] Azin Jahedi, Maximilian Luz, Lukas Mehl, Marc Rivinius, [47] Zhengqi Li and Noah Snavely. Megadepth: Learning
and Andrés Bruhn. High resolution multi-scale raft (robust single-view depth prediction from internet photos. In
vision challenge 2022). arXiv preprint arXiv:2210.16900, CVPR, 2018. 5, 6
2022. 10
[48] Lahav Lipson, Zachary Teed, and Jia Deng. Raft-stereo:
[33] Shihao Jiang, Yao Lu, Hongdong Li, and Richard Hartley.
Multilevel recurrent field transforms for stereo matching.
Learning optical flow from a few matches. In CVPR, 2021.
In 3DV, 2021. 2, 6, 8, 9
4
[34] Shihao Jiang, Yao Lu, Hongdong Li, and Richard Hartley. [49] Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu. Self-
Learning optical flow from a few matches. In CVPR, 2021. low: Self-supervised learning of optical flow. In CVPR,
16 2019. 4
[35] Shihao Jiang, Yao Lu, Hongdong Li, and Richard Hartley. [50] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
Learning optical flow from a few matches. In CVPR, 2021. Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong,
16 et al. Swin transformer v2: Scaling up capacity and resolu-
[36] Zequn Jie, Pengfei Wang, Yonggen Ling, Bo Zhao, Yun- tion. In CVPR, 2022. 4
chao Wei, Jiashi Feng, and Wei Liu. Left-right comparative [51] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
recurrent model for stereo matching. In CVPR, 2018. 2, 4, Zhang, Stephen Lin, and Baining Guo. Swin transformer:
6 Hierarchical vision transformer using shifted windows. In
[37] Longlong Jing and Yingli Tian. Self-supervised Visual Fea- ICCV, 2021. 4
ture Learning with Deep Neural Networks: A Survey. IEEE [52] Ilya Loshchilov and Frank Hutter. Decoupled Weight De-
Trans. PAMI, 2021. 3 cay Regularization. In ICLR, 2019. 6, 18
[38] Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yan- [53] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer,
nis Avrithis, Andrei Bursuc, Konstantinos Karantzalos, and Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A
Nikos Komodakis. What to Hide from Your Students: large dataset to train convolutional networks for disparity,
Attention-Guided Masked Image Modeling. In ECCV, optical flow, and scene flow estimation. In CVPR, 2016. 4,
2022. 3 7, 19
[39] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task [54] Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava
learning using uncertainty to weigh losses for scene geom- Nalivayko, and Andrés Bruhn. Spring: A high-resolution
etry and semantics. In CVPR, 2018. 6 high-detail dataset and benchmark for scene flow, optical
[40] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. flow and stereo. In CVPR, 2023. 2, 8, 9
Kennedy, A. Bachrach, and A. Bry. End-to-end learning of
[55] Moritz Menze and Andreas Geiger. Object scene flow for
geometry and context for deep stereo regression. In ICCV,
autonomous vehicles. In CVPR, 2015. 2, 8, 9, 14
2017. 2, 4, 6
[56] Guang-Yu Nie, Ming-Ming Cheng, Yun Liu, Zhengfa
[41] Sameh Khamis, Sean Fanello, Christoph Rhemann, Adarsh
Liang, Deng-Ping Fan, Yue Liu, and Yongtian Wang. Multi-
Kowdle, Julien Valentin, and Shahram Izadi. Stereonet:
level context ultra-aggregation for stereo matching. In
Guided hierarchical refinement for real-time edge-aware
CVPR, 2019. 4
depth prediction. In ECCV, 2018. 2, 4, 6
[42] Hamid Laga, Laurent Valentin Jospin, Farid Boussaid, and [57] Mehdi Noroozi and Paolo Favaro. Unsupervised Learning
Mohammed Bennamoun. A survey on deep learning tech- of Visual Representations by Solving Jigsaw Puzzles. In
niques for stereo-based depth estimation. IEEE Trans. ECCV, 2016. 1
PAMI, 2020. 4 [58] Senthil Purushwalkam and Abhinav Gupta. Demystifying
[43] Donghwan Lee, Soohyun Ryu, Suyong Yeon, Yonghan contrastive self-supervised learning: Invariances, augmen-
Lee, Deokhwa Kim, Cheolho Han, Yohann Cabon, Philippe tations and dataset biases. In NeurIPS, 2020. 3
Weinzaepfel, Nicolas Guérin, Gabriela Csurka, et al. Large- [59] Ilija Radosavovic, Tete Xiao, Stephen James, Pieter Abbeel,
scale localization datasets in crowded indoor spaces. In Jitendra Malik, and Trevor Darrell. Real-world robot learn-
CVPR, 2021. 5, 6 ing with masked visual pre-training. CoRL, 2022. 3

11
[60] Pierluigi Zama Ramirez, Fabio Tosi, Matteo Poggi, [75] Deqing Sun, Stefan Roth, and Michael J Black. A quantita-
Samuele Salti, Stefano Mattoccia, and Luigi Di Stefano. tive analysis of current practices in optical flow estimation
Open challenges in deep stereo: the booster dataset. In and the principles behind them. IJCV, 2014. 4
CVPR, 2022. 7, 19 [76] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.
[61] René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- Models matter, so does training: An empirical study of cnns
sion transformers for dense prediction. In ICCV, 2021. 2, for optical flow estimation. IEEE Trans. PAMI, 2019. 2, 4,
6, 7 6, 9
[62] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and [77] Shangkun Sun, Yuanqi Chen, Yu Zhu, Guodong Guo, and
Cordelia Schmid. Deepmatching: Hierarchical deformable Ge Li. SKFlow: Learning optical flow with super kernels.
dense matching. IJCV, 2016. 6 In NeurIPS, 2022. 9
[63] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, [78] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh
Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierar-
Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. chical iterative tile refinement network for real-time stereo
Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recog- matching. In CVPR, 2021. 6, 8, 9
nition Challenge. IJCV, 2015. 3, 15, 16 [79] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field
[64] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, transforms for optical flow. In ECCV, 2020. 2, 4, 9, 16
Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, [80] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Jia Liu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,
Dhruv Batra. Habitat: A Platform for Embodied AI Re- and Illia Polosukhin. Attention is all you need. In NeurIPS,
search. In ICCV, 2019. 4, 6 2017. 3
[65] Daniel Scharstein, Heiko Hirschmüller, York Kitajima,
[81] Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu,
Greg Krathwohl, Nera Nešić, Xi Wang, and Porter West-
Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and
ling. High-resolution stereo datasets with subpixel-accurate
Sebastian Scherer. Tartanair: A dataset to push the limits of
ground truth. In GCPR, 2014. 7, 8, 14, 19
visual slam. In IROS, 2020. 7, 19
[66] Johannes Lutz Schönberger and Jan-Michael Frahm.
[82] Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong,
Structure-from-motion revisited. In CVPR, 2016. 5
and Lei Li. Dense Contrastive Learning for Self-Supervised
[67] Thomas Schops, Johannes L Schonberger, Silvano Gal-
Visual Pre-Training. In CVPR, 2021. 2
liani, Torsten Sattler, Konrad Schindler, Marc Pollefeys,
[83] Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan
and Andreas Geiger. A multi-view stereo benchmark with
Yuille, and Christoph Feichtenhofer. Masked Feature Pre-
high-resolution images and multi-camera videos. In CVPR,
diction for Self-Supervised Visual Pre-Training. In CVPR,
2017. 2, 7, 8, 19
2022. 1, 3
[68] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-
attention with Relative Position Representations. In NAACL [84] Weinzaepfel, Philippe and Leroy, Vincent and Lucas,
HLT, 2018. 3 Thomas and Brégier, Romain and Cabon, Yohann and
Arora, Vaibhav and Antsfeld, Leonid and Chidlovskii,
[69] Zhelun Shen, Yuchao Dai21, Xibin Song11, Zhibo Rao,
Boris and Csurka, Gabriela and Revaud Jérôme. CroCo:
Dingfu Zhou, and Liangjun Zhang. Pcw-net: Pyramid com-
Self-Supervised Pre-training for 3D Vision Tasks by Cross-
bination and warping cost volume for stereo matching. In
View Completion. In NeurIPS, 2022. 2, 3, 4, 6, 7, 8, 14,
ECCV, 2022. 8
15, 16
[70] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from [85] Zhenda Xie, Yutong Lin, Zheng Zhang, Yue Cao, Stephen
RGBD images. In ECCV, 2012. 8 Lin, and Han Hu. Propagate Yourself: Exploring Pixel-
[71] Kihyuk Sohn, David Berthelot, Chun-Liang Li, Zizhao Level Consistency for Unsupervised Visual Representation
Zhang, Nicholas Carlini, Ekin D. Cubuk, Alex Kurakin, Learning. In CVPR, 2021. 2
Han Zhang, and Colin Raffel. Fixmatch: Simplifying semi- [86] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
supervised learning with consistency and confidence. In Bao, Zhuliang Yao, Qi Dai, and Han Hu. SimMIM: A Sim-
NeurIPS, 2020. 2 ple Framework for Masked Image Modeling. In CVPR,
[72] Xiao Song, Guorun Yang, Xinge Zhu, Hui Zhou, Zhe 2022. 1, 3
Wang, and Jianping Shi. Adastereo: a simple and efficient [87] Yuwen Xiong, Mengye Ren, and Raquel Urtasun. Loco:
approach for adaptive stereo matching. In CVPR, 2021. 6, Local contrastive representation learning. In NeurIPS,
8, 9 2020. 2
[73] Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng [88] Gangwei Xu, Junda Cheng, Peng Guo, and Xin Yang.
Liu. Roformer: Enhanced transformer with rotary position Attention concatenation volume for accurate and efficient
embedding. arXiv preprint arXiv:2104.09864, 2021. 2, 4, stereo matching. In CVPR, 2022. 6, 8, 9
6 [89] Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and
[74] Xiuchao Sui, Shaohua Li, Xue Geng, Yan Wu, Xinxing Xu, Dacheng Tao. Gmflow: Learning optical flow via global
Yong Liu, Rick Goh, and Hongyuan Zhu. Craft: Cross- matching. In CVPR, 2022. 4, 16
attentional flow transformer for robust optical flow. In [90] Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi,
CVPR, 2022. 4, 9 Fisher Yu, Dacheng Tao, and Andreas Geiger. Unifying

12
flow, stereo and depth estimation. IEEE Trans. PAMI, 2023.
4, 8, 9, 16, 18, 19
[91] Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao.
Vitpose: Simple vision transformer baselines for human
pose estimation. In NeurIPS, 2022. 3
[92] Zhichao Yin, Trevor Darrell, and Fisher Yu. Hierarchical
discrete distribution decomposition for match density esti-
mation. In CVPR, 2019. 2, 4, 6
[93] Jason J Yu, Adam W Harley, and Konstantinos G Derpa-
nis. Back to basics: Unsupervised learning of optical flow
via brightness constancy and motion smoothness. In ECCV
workshop, 2016. 4
[94] Amir R Zamir, Tilman Wekel, Pulkit Agrawal, Colin Wei,
Jitendra Malik, and Silvio Savarese. Generic 3D represen-
tation via pose estimation and matching. In ECCV, 2016.
5, 6
[95] Mingliang Zhai, Xuezhi Xiang, Ning Lv, and Xiangdong
Kong. Optical flow and scene flow estimation: A survey.
Pattern Recognition, 2021. 4
[96] Feihu Zhang, Oliver J. Woodford, Victor Prisacariu, and
Philip H. S. Torr. Separable flow: Learning motion cost
volumes for optical flow estimation. In ICCV, 2021. 4
[97] Richard Zhang, Phillip Isola, and Alexei A. Efros. Colorful
Image Colorization. In ECCV, 2016. 1
[98] Shiyu Zhao, Long Zhao, Zhixing Zhang, Enyu Zhou, and
Dimitris N. Metaxas. Global matching with overlapping
attention for optical flow estimation. In CVPR, 2022. 4
[99] Zihua Zheng, Ni Nie, Zhi Ling, Pengfei Xiong, Jiangyu
Liu, Hao Wang, and Jiankun Li. DIP: deep inverse patch-
match for high-resolution optical flow. In CVPR, 2022. 4,
6, 8, 9
[100] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Barriuso, and Antonio Torralba. Scene parsing through
ADE20K Dataset. In CVPR, 2017. 8
[101] Chao Zhou, Hong Zhang, Xiaoyong Shen, and Jiaya Jia.
Unsupervised learning of stereo matching. In ICCV, 2017.
4
[102] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT Pre-
training with Online Tokenizer. In ICLR, 2022. 1, 3

13
Reference image Masked target CroCo [84] CroCo v2 Target image

Figure 8: Cross-view reconstruction examples (pre-training pretext task) on scenes unseen during pretraining for the origi-
nal CroCo [84] and with our improvements. The images come from the Middlebury stereo benchmark [65].

Appendix display: this means that the overall color of each patch will
be correct, as it comes from the ground-truth values. While
the most important measure of performance of these models
In this appendix, we first provide visualizations of the is their transfer to downstream tasks, as explored in the main
capabilities of CroCo v2 on the pretext task of cross-view paper, a qualitative observation of the fact that our improved
completion (Section A). We then present additional exper- method is better at solving the pretext task is noteworthy.
imental results in Section B, including in particular (a) the We clearly observe that the reconstructions from the origi-
impact of pre-training, (b) the runtime of our model and (c) nal CroCo [84] tend to be quite blurry in many areas, which
an analysis of the probabilistic distributions regressed by might come from the fact that it relies on a smaller model
our CroCo-Stereo model for the stereo matching task. We and was pre-trained only on synthetic data from indoor en-
finally detail our training setup and the dataset splits. (Sec- vironments, while details are impressively preserved thanks
tion C). to our improvements. In Figure 8, note how the lines and
the eyes are well reconstructed in the first row, or the roads
A. Cross-view completion examples on the maps of the third row, despite the high masking ratio
that is applied to the masked image (90%). Similarly, the
To qualitatively evaluate the impact of CroCo v2, i.e., of
text is clearly readable on the first row of Figure 9. Some
the improvements that we propose on top of the CroCo [84]
predictions by our model have some blur (e.g. left of the
pre-training, we show several examples of cross-view com-
first and thirds rows of Figure 8), which makes sense be-
pletions on real-world scenes, coming either from Middle-
cause these parts are not visible in the reference image.
bury v3 [65] in Figure 8 or KITTI [55] in Figure 9. Note that
these methods, as MAE [29], regress pixel values that are
normalized according to the mean and standard deviation
inside each patch, we thus apply the inverse transform for

14
Reference image Masked target CroCo [84] CroCo v2 Target image

Figure 9: Cross-view reconstruction examples (pre-training pretext task) on scenes unseen during pre-training for the
original CroCo [84] and with our improvements. The images come from the stereo benchmark of KITTI [24].

Stereo ([email protected]↓) Flow (EPE↓)

Network initialization
Md ETH SF(c) SF(f) FT(c) FT(f) Si.(c) Si.(f)
RoPE positional embedding, ViT-L encoder, Base decoder, 2M Habitat + 5.3M real pre-training pairs
CroCo v2 pre-training 15.5 0.38 5.0 5.3 2.85 2.45 1.43 1.99
random init. 43.4 1.06 11.0 11.2 10.53 10.57 4.84 5.49
cosine positional embedding, ViT-B encoder, Small decoder, 2M Habitat (synthetic only) pre-training pairs
CroCo [84] pre-training 26.3 1.82 6.7 7.0 3.89 3.56 2.07 2.57
MAE [29] (ImageNet) pre-training (encoder only) 35.8 1.68 8.6 8.8 5.13 4.83 2.92 3.82
random init. 87.5 5.42 24.6 24.6 14.28 14.31 8.99 9.76
Table 9: Impact of pre-training. We compare the performance of our final model (first row) with improved cross-view
completion pre-training to a randomly initialized version (second row). To compare to MAE [29], that is pre-trained on
ImageNet [63], and which is based on cosine positional embeddings, we make the comparison with the original CroCo in the
bottom rows.

B. Further experimental results gain of performance, e.g. on the FlyingThings flow test set
in the final rendering with an EPE of 2.45 pixels with pre-
B.1. Impact of pre-training training vs. 10.57 without it, or on the Middlebury v3 stereo
validation set with a [email protected] of 15.5% with pre-training
In Table 9, we measure the impact of the pre-training on vs. 43.4% without it.
the downstream performance when the model is finetuned
for stereo matching or optical flow. The first two rows com- We are not aware of any other pre-training strategy, other
pare our model, using our improved cross-view completion than cross-view completion, that readily includes a binoc-
pre-training vs. a random initialization. We observe a clear ular decoder or architecture. While it is still possible to

15
Masking Stereo ([email protected]↓) Flow (EPE↓)
ratio Md ETH SF(c) SF(f) FT(c) FT(f) Si.(c) Si.(f)
80% 32.5 1.96 7.3 7.5 4.29 4.06 2.06 2.71
85% 59.2 1.15 8.7 9.0 3.48 3.08 1.99 2.41
90% 20.7 0.82 5.8 6.1 3.35 2.94 1.76 2.30
Table 10: Impact of the pre-training masking ratio for a model with RoPE positional embeddings, a ViT-B encoder, a
Small decoder, pre-trained on 2M Habitat + 5.3M real pairs.

initialize part of the layers using other pre-training strate- MPI-Sintel(↓)

gies, this means that some important parts of the network Method
clean final
are still being initialized at random. Nevertheless, to com-
LiteFlowNet2 [34] 2.24 3.78
pare to other pre-training strategies, we consider MAE [29]
FM-RAFT [35] 1.29 2.95
pre-trained on ImageNet [63], thus with a cosine positional
FlowFormer [31] 1.01 2.40
embedding, a ViT-Base encoder, and with a Small decoder
RAFT [79] before refinement 4.04 5.45
that is randomly initialized. We compare that to the original
RAFT [79] 1.41 2.69
CroCo [84] pre-trained on synthetic data only. We observe
GMFlow [90] before refinement 1.31 2.96
that CroCo pre-training obtains the lowest errors, signifi-
GMFlow [90] 1.08 2.48
cantly outperforming the MAE pre-training and the random
CroCo-Flow 1.28 2.58
initialization.
Interestingly, the performance of this smaller model is Table 11: Optical flow results when training on Fly-
also significantly better than the large one without pre- ingChairs and FlyingThings only. We report the EPE on
training. This again highlights the importance of the pre- MPI-Sintel training set (clean or final rendering). Numbers
training with such generic architecture. for the first three rows come from [31], numbers for RAFT
Masking ratio. CroCo [84] finds that using a masking ratio and GMFlow (before and after refinement) from [90].
of 90% performs best for cross-view completion on their
synthetic data. This is higher than the 75% masking ratio Pos. Encoder Decoder runtime #Parameters
of MAE [29], as the unmasked reference view of the same cosine ViT-B Small 25ms 139.4M (85.6M+34.0M+19.7M)
RoPE ViT-B Small 26ms 139.4M (85.6M+34.0M+19.7M)
scene adds redundancy. A question is whether this masking RoPE ViT-B Base 29ms 219.7M (85.6M+114.0M+20.1M)
ratio of 90% that has been found optimal on synthetic data RoPE ViT-L Base 53ms 437.4M (303.1M+114.2M+20.1M)
generalizes to real data. Table 10 reports the performance Table 12: Runtime and number of parameters. Run-
on stereo and flow downstream tasks for a masking ratio of time is measured for a single tile of size 704×352, on a
80%, 85% and 90%. We find that a masking ratio of 90% NVIDIA A100 GPU. For the number of parameters we re-
performs best also in the case of using real data. port in parenthesis the numbers for the encoder, the decoder
and the DPT head separately.
B.2. Smaller training data
Most optical flow methods also report the performance
on the MPI-Sintel training set when training on Fly- Our method remains relatively fast on one tile, in the order
ingChairs and FlyingThings only. We report these values in of a few tens of milliseconds.
Table 11. For RAFT [89] and GMFlow [90], we report the
Number of parameters. We also report the number of
numbers before and after using iterative refinement proce-
trainable parameters in Table 12. This number of param-
dures. Interestingly, CroCo-Flow performs better than these
eters is one order of magnitude higher than most existing
two methods before their refinement. Overall, our ranking
stereo and flow methods. We did not study how this number
is similar to the ones on the MPI-Sintel test set where we
of parameters could be reduced, and we also do not claim
use more training data. This indicates that our finetuning on
that our models are better than existing work for a fixed
geometric downstream tasks do not necessarily need large-
computational budget. Indeed, task-specific approaches
scale training data, despite the size of our architecture.
have the advantage of being more sample efficient, i.e., re-
B.3. Runtime and tiling quiring less data, by leveraging prior knowledge about the
task. They also have the drawback of not being readily com-
Runtime. In Table 12, we report the runtime for different patible with large-scale training on unlabeled data, because
sizes of our model. On one single tile of the same size as of task-dependent components, which limits the use of large
training for stereo, i.e., 704×352, on a NVIDIA A100 GPU. generic models. Existing methods cannot be scale up to a

16
Middlebury v3 ([email protected]↓) MPI-Sintel final (EPE↓) Nb. tiles↓
24 2.3 400
22 2.2 300
20 2.1 200
18 2 100
16 1.9 0
0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Overlap ratio Overlap ratio Overlap ratio
Figure 10: Impact of the overlap ratio between tiles during inference. We plot the stereo performance ([email protected] in
%) on Middlebury v3 (left) and the flow performance on MPI-Sintel in its final rendering (middle) when varying the overlap
ratio during inference with tiling. We also plot the number of tiles it represents for a 1920×1080 image (right), which is
proportional to the total runtime, for a crop size of 704×352 as CroCo-Stereo.

larger number of parameters easily, as training large mod- First image Prediction error Uncertainty
els requires lot of data. In the case of stereo and optical
flow, for which labeled data is limited, this means using
self-supervised learning, which cannot be straightforwardly
applied for models that involve task-specific designs like
cost volumes, image warping, etc. Thus, our contribution
and our aim in this work is to show that pre-training large,
generic architectures and finetuning them for stereo match-
ing and optical flow is a valid path forward.
Impact of the overlap ratio during tiling. In Figure 10,
we report the performance and the number of tiles for a
Full HD image (1920×1080) when varying the overlap ratio
during inference with the tiling strategy. While the perfor-
mance improves with a higher overlap ratio, the number of
tiles can rapidly explodes. With an overlap around 0.5 or Figure 11: Visualization of the uncertainty predicted by
0.7, performance is quite close to the one obtained with 0.9 CroCo-Stereo on a few examples from the SceneFlow test
while the number of tiles remains reasonable. This may thus set. The first column shows the first image, the second col-
be the best trade-off in practical scenarios where inference umn shows the error of the prediction clamped within the
time has to stay small. segment [0, 10], the third column shows the logarithm of
the predicted scale of the Laplacian distribution output by
B.4. Laplacian-based loss the model: green colors denote confident areas while blue
For flow and stereo, we regress a Laplacian distribution: colors denote uncertain areas.
the location parameter corresponds to the disparity or flow
prediction, while the scale parameter could be seen as a
measure of uncertainty. We thus denote here by ‘uncer- We observe that a lower uncertainty clearly corresponds to
tainty’ the logarithm of the predicted scale of the Lapla- pixels with lowest errors, while a high uncertainty corre-
cian distribution that our downstream model outputs, i.e., sponds to pixels with a higher error. On the right plot, we
log(di ) from Equation 3. order pixels from the less uncertain to the more uncertain
Visualization of the uncertainty. We visualize in Fig- and show the percentiles of errors when increasing the ratio
ure 11 this uncertainty for a few examples. We observe of pixels considered. We observe the same behavior, show-
that it is highly linked with the error of the predicted dis- ing the correlation of our uncertainty with the error of the
parity as red areas in the error correspond to blue areas in prediction. Note that 95% of the pixels have an error below
the uncertainty maps. 1, thus the scale of the y-axis of the plot.
Statistics on the uncertainty. To better measure the corre- Comparison with an L1 loss. In Table 13, we quantita-
lation of our predicted uncertainty with the error of the dis- tively evaluate the effect of using a loss on a Laplacian dis-
parity prediction, we plot a few statistics in Figure 12. On tribution (Equation 3) compared to using only an L1 loss.
the left one, we show some percentiles of the error when In the latter case, we cannot leverage the predicted scale of
varying the predicted scale of the Laplacian distribution. a Laplacian distribution for merging overlapping tiles. We

17
60 0.6
40
Error (px)

Error (px)
0.4
20 0.2
0 0.0
2 1 0 1 2 0 20 40 60 80 100
Predicted uncertainty Pixel fraction (%)
Figure 12: Statistics on the uncertainty predicted by CroCo-Stereo. We subsample 1000 points per test image from
SceneFlow in its clean renderings and compute the error of the prediction and the logarithm of the predicted scale of the
Laplacian, i.e., the pixelwise uncertainty. On the left plot, we show the median of the error for a given predicted uncertainty
(orange line), the 25- and 75-percentile in dark blue, and the 10- and 90-percentile in light blue. On the right plot, we sort
pixels according to their predicted uncertainty from the less uncertain to the more uncertain and show the median of the error
over the fractions of pixels considered (orange line), the 25- and 75-percentile in dark blue, and the 10- and 90-percentile in
light blue.

loss
Stereo ([email protected]↓) Flow (EPE↓) C. Training details
Md ETH SF(c) SF(f) FT(c) FT(f) Si.(c) Si.(f)
L1 23.0 0.95 6.1 6.3 3.02 2.69 1.51 2.13 CroCo-Stereo training. We train CroCo-Stereo for 32
Lap. 15.5 0.38 5.0 5.3 2.85 2.45 1.43 1.99 epochs using batches of 6 pairs of 704×352 crops. We de-
Table 13: Impact of the loss. We compare a standard L1 tail the training/validation pairs we use for our ablations in
loss vs. the Laplacian (Lap.) loss. Table 14. We use the AdamW optimizer [52] with a weight
decay of 0.05, a cosine learning rate schedule with a sin-
gle warm-up epoch and a learning rate of 3.10−5 . During
thus follow [31] and use a weights that decrease with the training, we apply standard data augmentations: color jit-
distance to the center of the image. We observe that the tering (asymmetrically with probably 0.2), random vertical
Laplacian loss outperforms the L1 loss on all stereo and flipping with probably 0.1, random scaling with probability
flow benchmarks. A Laplacian loss can be interpreted as 0.8 in the range [2−0.2 , 20.4 ] and stretching (resize different
an L1 term, weighted for each pixel according to an un- along the x and y axis) with probability 0.8 in the range
certainty measure, thus allowing to downweight uncertain [2−0.2 , 20.2 ], and slightly jitter the right image with proba-
pixels in practice. In addition, having access to the scale bility 0.5. When submitting to the official leaderboards, we
of the Laplacian allows a more elegant merging strategy for include the pairs that were kept apart from the training sets
the overlapping tiles. for validation into the training epochs.
CroCo-Flow training. We train CroCo-Flow for 240
B.5. Towards smarter tiling epochs of 30, 000 pairs each, randomly sampled from all
One limitation of our approach mentioned in the main available data, using batches of 8 pairs of crops of size
paper is the tiling-based inference. For instance, CroCo- 384×320. We detail the training/validation pairs we use for
Stereo is based on crops with a width of 704 pixels, this our ablations in Table 15. To better balance the datasets, we
means that for large disparity values, the matching pixels set the probability of choosing a random pair from these
would be out of the scope of the corresponding tile in the datasets, see Table 14. We use the AdamW optimizer,
second image. As an alternative, we have tried a strategy a weight decay of 0.05, a cosine learning rate schedule
where a second tile in the second image is also considered, with linear warm-up over 1 epoch, and a base learning rate
which is shifted by 150 pixels, thus reducing the dispar- of 2.10−5 . During training, we apply standard augmen-
ity value by the same amount. With the model with ViT- tations [90]: random color jittering (asymmetrically with
Base encoder and Base decoder, such a strategy allows to probably 0.2), random scaling with probably 0.8 with a
reduce the [email protected] from 17.1% to 12.0% on Middlebury scale sampled in [2−0.2 , 20.5 ] and stretching with probabil-
v3 validation set, when replacing the predictions over 200px ity 0.8 in the range [2−0.2 , 20.2 ].
from the original tile, with the ones from the secondary tile.
While this strategy seems promising, it is however not really
satisfactory as it multiples the number of tiles to proceed by
2. We hope to find better strategies in the future.

18
stereo dataset # pairs comment
CREStereo [45] 200,000 all training pairs
SceneFlow[53] 70,908 Driving, Monkaa and FlyingThings in both clean and final renderings
4,370 validation pairs from FlyingThings test for each rendering (clean and final)
ETH3D Low Res [67] 30× 24 ‘delivery area 3s’, ‘electro 3l’ ‘playground 3l’ (3 pairs) are kept apart for validation
Middlebury v3 [65] 50× 14 ‘Vintge’ (1 pair) is kept apart for validation, we use the ‘full’ resolution
Middlebury 21 50× 335 ‘traproom1’ and ‘traproom2’ are kept apart for validation (20 pairs)
Middlebury 14 50× 132 ‘Umbrella-umperfect’ and ‘Vintage-perfect’ are kept apart for validation (6 pairs)
Middlebury 06 50× 171 ‘Rocks1’ and ‘Wood2’ are kept apart for validation (18 pairs)
Middlebury 05 50× 45 ‘Reindeer’ is kept apart for validation (9 pairs)
Booster [60] 213 only the ‘balanced’ subset, ‘Vodka’ and ‘Washer’ sequences (15 pairs) kept apart for validation
total 306,691
Table 14: Overview of our stereo training data. We indicate here the train/val split used for the ablations, as well as the
number of training pairs. For ETH3D and Middlebury, we also consider multiple times each pair in each epoch.

flow dataset # pairs prob. comment

FlyingChairs [21] 22,232 12% -
FlyingThings [53] 80,604 40% 40,302 pairs for both ‘clean’ and ‘final’ renderings
we use the same 1,024 validation pairs from the test set as [90]
MPI-Sintel [11] 943 10% sequences ‘temple 2’ and ‘temple 3’ (98 pairs) are kept apart for validation
TartanAir [81] 306,268 38% -
total 410,047 100%
Table 15: Overview of our flow training data. We indicate here the train/val split used for the ablations, as well as the
number of remaining training pairs. During training, we set a number of images per epoch and randomly sample them among
the available datasets with the percentages shown in the column ‘prob.’.