11institutetext: School of Computer Science and Technology, Ocean University of China, Qingdao 266101, China
11email: {liushanglong, gwc1323, xly3385}@stu.ouc.edu.cn
11email: {qilin,dongjunyu}@ouc.edu.cn

Superpixel Cost Volume Excitation for Stereo Matchingthanks: Supported by National Natural Science Foundation of China (Grant No. 41927805).

Shanglong Liu    Lin Qi(✉)    Junyu Dong(✉)    Wenxiang Gu    Liyi Xu
Abstract

In this work, we concentrate on exciting the intrinsic local consistency of stereo matching through the incorporation of superpixel soft constraints, with the objective of mitigating inaccuracies at the boundaries of predicted disparity maps. Our approach capitalizes on the observation that neighboring pixels are predisposed to belong to the same object and exhibit closely similar intensities within the probability volume of superpixels. By incorporating this insight, our method encourages the network to generate consistent probability distributions of disparity within each superpixel, aiming to improve the overall accuracy and coherence of predicted disparity maps. Experimental evaluations on widely-used datasets validate the efficacy of our proposed approach, demonstrating its ability to assist cost volume-based matching networks in restoring competitive performance.

Keywords:
Stereo Matching Superpixel Cross-Entropy.

Refer to caption

Figure 1: Visualization of the real output distribution at boundaries on Scene Flow dataset. (a) is the input image, and its partial enlargement. (b) represents the disparity probability distribution of the superpixel belonging to the brown region. (c) and (d) show the output probability distributions of a given pixel from GwcNet and GwcNet+++Ours. Our proposed methods rectify the incorrect distributions and avoid smoothness bias. Please zoom in to see the details.

1 Introduction

Stereo matching endeavors to establish dense correspondences between rectified stereo pairs, enabling the recovery of scene depth through triangulation[1]. This technique finds broad applications in diverse fields, including robot navigation, augmented reality, and autonomous driving.

Recently, stereo models have demonstrated exceptional performance through the utilization of a cost volume-based architecture[12, 13, 14], typically comprising four key steps: feature extraction, cost volume construction, cost aggregation, and disparity regression. Among these steps, cost aggregation stands out as the most crucial module, responsible for selecting the optimal match from numerous potential pairs and generating probability representations for the cost volume. However, state-of-the-art models face challenges in effectively addressing local ambiguities at boundaries, where definitively determining the pixel’s belonging region is complex. This frequently leads to a multi-peaked distribution in the aggregated probability volume, giving rise to the problem of over-smoothing[16, 17].

In this study, we endeavor to rectify this mismatch and eliminate redundant information by incorporating a pixel relationship prior. Drawing inspiration from the premise that depth transitions smoothly within homologous regions[19, 3, 9], we posit that depth discontinuities solely manifest between distinct regions. Hence, we introduce the concept of superpixels[5], defined as clusters of contiguous and perceptually coherent pixels, offering a more coarse-grained representation of the image. Several recent superpixel segmentation methods have successfully integrated into various low-level tasks, including optical flow estimation, monocular depth estimation[11], and depth completion, etc. They play a crucial role in decreasing the number of primitives in image processing, extracting similar features, and capturing image structure information.

Capitalizing on their inherent clustering and boundary properties, we integrate superpixel segmentation to produce a superpixel-level probability volume. Furthermore, the effectiveness of a strong-constraint disparity filtering strategy is limited due to the coarse-grained nature of superpixel representation, which cannot refine to each disparity level. To address this limitation, we model the ground truth at the superpixel level using a Laplace distribution[4] and apply cross-entropy loss to this representation, to suppress the multi-peaked issue during the cost aggregation into probability. This superpixel training head proves highly effective in aiding aggregation, generating a more accurate probability representation for the cost volume, while simultaneously avoiding the need for additional computations and parameters during the inference stage of such resource-constrained tasks. And we conducted experiments to explore its efficacy, as illustrated in Figure 1, this approach facilitates the convergence of the probability volume within the same superpixel, rectifying outliers through the overall distribution. To maintain color and spatial consistency, we adjust the sub-network’s task orientation towards disparity reconstruction, leveraging the principle that pixels within superpixel blocks from color images share similar disparities. This enables attention weights, derived from the sub-network’s semantic features, to effectively enhance local geometric consistency within the cost volume in the channel dimension, thereby encoding meaningful relationships between pixels.

Refer to caption

Figure 2: The proposed stereo matching framework consists of a stereo matching pipeline and a sub-network for superpixel segmentation. The superpixel branch (cyan) takes the left image as input and assists the stereo branch (black).

2 Related Work

The cost volume-based architecture is designed to enhance the accuracy of depth estimation by constructing and optimizing the cost associated with candidate disparitie. This volume is formed by concatenating or correlating feature maps extracted from the left and right images at various disparity levels. GCNet[12] pioneered the integration of a 3D encoder-decoder structure, utilizing soft-argmin-based disparity regression derived from a probabilistic cost volume. Subsequently, advancements such as the grouped-wise correlation cost volume introduced by GwcNet[14] and the attention-based cost volume proposed by ACVNet[3] aimed to augment the representational capacity of the cost volume. These end-to-end deep learning methods primarily supervise the disparity outputs, neglecting the rationality of their distributions. AcfNet[2] addresses this issue by directly supervising the cost volume with unimodal ground truth distributions. However, due to its reliance on pixel-level operations, the network may struggle to learn scene structural information and could potentially overfit to a single dataset.

Superpixels play a crucial role in local optimization and global consistency in stereo matching. Previous studies [9, 24] demonstrate that α𝛼\alphaitalic_α-expansion, which segments images into larger regions and assumes similar 3D plane labels within each segment, effectively optimizes disparity estimation by propagating consistent plane labels. SFCN[18] effectively preserves object boundaries and fine-grained details by incorporating superpixels, which replaces conventional upsampling methods in the downsampling or upsampling scheme. However, this technique does not contribute to the matching process. In contrast, our approach delves deeper into the pixel relationship information inherent in the cost volume, emphasizing the collective impact of neighboring pixels on disparity estimation.

Refer to caption

Figure 3: Superpixel guided channel excitation module. The multi-scale short connections achieved through 2D convolutional kernels of varying sizes and strides combined with upsampling result in rich object context within the superpixel branch.

3 Methods

3.1 Superpixel Guided Channel Excitation

As shown in Figure 3, superpixel segmentation is implemented using a standard encoder-decoder architecture with skip connections[18]. We argue that object context is crucial for accurate segmentation, as its multi-scale features contain valuable information about object shape and affinity. Therefore, we use the channel excitation to embed the object context into the cost volume. Different from the CoEx[22] method, where only involves excitation of the corresponding scaled cost volume features, we instead fuse hierarchical scales features ϕ(Il)kN×Hk×Wk,k{4,8,16}formulae-sequenceitalic-ϕsubscriptsubscript𝐼𝑙𝑘superscript𝑁𝐻𝑘𝑊𝑘𝑘4816\phi(I_{l})_{k}\in\mathbb{R}^{N\times\frac{H}{k}\times\frac{W}{k}},k\in\{4,8,16\}italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × divide start_ARG italic_H end_ARG start_ARG italic_k end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_k end_ARG end_POSTSUPERSCRIPT , italic_k ∈ { 4 , 8 , 16 } from the sub-network. Through a simple multi-scale short connections denoted as g𝑔gitalic_g, we obtain the superpixel semantic guidance. Before each cost aggregation, the guided cost volume excitation is calculated as:

Ccost=σ(g(ϕ(Il)k))Ccostsubscriptsuperscript𝐶𝑐𝑜𝑠𝑡direct-product𝜎𝑔italic-ϕsubscriptsubscript𝐼𝑙𝑘subscript𝐶𝑐𝑜𝑠𝑡\begin{split}C^{\prime}_{cost}&=\sigma(g(\phi(I_{l})_{k}))\odot C_{cost}\end{split}start_ROW start_CELL italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c italic_o italic_s italic_t end_POSTSUBSCRIPT end_CELL start_CELL = italic_σ ( italic_g ( italic_ϕ ( italic_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ⊙ italic_C start_POSTSUBSCRIPT italic_c italic_o italic_s italic_t end_POSTSUBSCRIPT end_CELL end_ROW (1)

where σ𝜎\sigmaitalic_σ denotes the sigmoid function that converts the guidance into an attention weight map. These attention weights W emphasize both local consistency and discontinuity within the cost volume along the channel dimension. And direct-product\odot represents the Hadamard product after broadcasting the attention across the disparity dimension. CcostsubscriptCcost\textbf{C}_{\textbf{cost}}C start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT (with the size of N×14D×14H×14W𝑁14𝐷14𝐻14𝑊N\times\frac{1}{4}D\times\frac{1}{4}H\times\frac{1}{4}Witalic_N × divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_D × divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_H × divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_W) represents the 4D cost volume constructed from the features of the left and right images. This process generates a geometrically encoded cost volume, CcostsubscriptsuperscriptCcost\textbf{C}^{\prime}_{\textbf{cost}}C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT cost end_POSTSUBSCRIPT, which allows 3D convolutions to aggregate information from neighboring pixels and capture geometric relationships inherent in the data.

Refer to caption

Figure 4: The main components of the joint learning training head, which combines the output results from two branches, consist of a variance estimator for predicting matchbility and a superpixel pooling module, all driven by the cross-entropy loss function.

3.2 Superpixel Pooling of Probability Volume

Laplace Distribution. In an ideal scenario, the disparity probability distribution manifests itself in a unimodal form, where the probability values diminish with the distance from the true matching pixel, peaking at the ground truth disparity value. To more accurately depict the variance in disparity probability distributions across different matching regions, we adaptively model a unimodal distribution akin to AcfNet[2] as follows:

Pgt(d)=softmax(|ddgt|v)superscript𝑃𝑔𝑡𝑑softmax𝑑superscript𝑑𝑔𝑡𝑣P^{gt}(d)=\text{softmax}\left(-\frac{|d-d^{gt}|}{v}\right)italic_P start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ( italic_d ) = softmax ( - divide start_ARG | italic_d - italic_d start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT | end_ARG start_ARG italic_v end_ARG ) (2)

As depicted in Figure 4, the variance v𝑣vitalic_v is computed based on the aggregated cost volume (i.e. probability volume). Challenging pixels often exhibit multi-modal probability distributions, with their variance typically being large. This parameter controls the sharpness of the peak around the true disparity, adjusting it according to the matchbility. Specifically, when a point struggles to distinctly delineate the pixel’s region of belonging or resides within a region characterized by weak textural attributes during stereo matching, it exhibits a comparatively smoother peak. This adaptive modeling enhances the precision of disparaty estimation across various matching scenarios.

Superpixel Pooling. Given the predicted superpixel association probability map Q|Ns|×H×WQsuperscriptsubscript𝑁𝑠𝐻𝑊\textbf{Q}\in\mathbb{R}^{{\left|N_{s}\right|}\times H\times W}Q ∈ blackboard_R start_POSTSUPERSCRIPT | italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | × italic_H × italic_W end_POSTSUPERSCRIPT for image Ilsubscript𝐼𝑙I_{l}italic_I start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, where Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the 9 sets of surrounding initial grid cells associated with each pixel p𝑝pitalic_p, we obtain the superpixel label map m𝑚mitalic_m by assigning each pixel to its most likely superpixel using m=argmaxQ(p)𝑚𝑄𝑝m={{\arg\max}\,Q(p)}italic_m = roman_arg roman_max italic_Q ( italic_p ). And its inverse mapping m~~𝑚\tilde{m}over~ start_ARG italic_m end_ARG, which represents the pixel index of each superpixel label.

To capture the disparity probability distribution within each superpixel, we leverage m𝑚mitalic_m and the aggregated volume CprobsubscriptCprob\textbf{C}_{\textbf{prob}}C start_POSTSUBSCRIPT prob end_POSTSUBSCRIPT (with the size of D×H×W𝐷𝐻𝑊D\times H\times Witalic_D × italic_H × italic_W) to generate a superpixel probability volume Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, defined as:

Ps=(pm~sCprob(p))1nsubscript𝑃𝑠superscriptsubscriptproduct𝑝subscript~𝑚𝑠subscript𝐶𝑝𝑟𝑜𝑏𝑝1𝑛P_{s}=\left(\prod_{p\in\tilde{m}_{s}}C_{prob}(p)\right)^{\frac{1}{n}}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ( ∏ start_POSTSUBSCRIPT italic_p ∈ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_p italic_r italic_o italic_b end_POSTSUBSCRIPT ( italic_p ) ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n end_ARG end_POSTSUPERSCRIPT (3)

where n𝑛nitalic_n denotes the number of pixels within the specific superpixel s𝑠sitalic_s. To ensure numerical stability, circumvent underflow issues arising from probabilistic multiplication, and simplify computational complexity, we conduct the pooling process in logarithmic space:

ln(Ps)=1npm~sln(Cprob(p))subscript𝑃𝑠1𝑛subscript𝑝subscript~𝑚𝑠subscript𝐶𝑝𝑟𝑜𝑏𝑝\ln(P_{s})=\frac{1}{n}\sum_{p\in\tilde{m}_{s}}\ln(C_{prob}(p))roman_ln ( italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_p ∈ over~ start_ARG italic_m end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ln ( italic_C start_POSTSUBSCRIPT italic_p italic_r italic_o italic_b end_POSTSUBSCRIPT ( italic_p ) ) (4)

To recover the original superpixel probability volume from logarithmic space, we apply exponential operations. We can perform superpixel geometric mean pooling over the modeled ground truth from the preceding section or probability volume, to obtain a superpixel-level probability representation. The probability distributions of pixels exhibit a collective influence, where their interactions shape the overall probability distribution within superpixels. Notably, this superpixel probability volume is generated solely for supervision during training, ensuring no added computational or memory demands during inference.

3.3 Training Head

Loss for Single Tasks. After getting the final probability volume, the soft-argmin operation is used to compute disparity for each pixel by taking the expected value[12]. To ensure regression focuses on the most probable mode, we utilize the top k𝑘kitalic_k values from the probability volume:

d^=d{d1,d2,,dk}d×Softmax(Cprob(d))^𝑑subscript𝑑subscript𝑑1subscript𝑑2subscript𝑑𝑘𝑑𝑆𝑜𝑓𝑡𝑚𝑎𝑥subscript𝐶𝑝𝑟𝑜𝑏𝑑\hat{d}=\sum_{d\in\{d_{1},d_{2},...,d_{k}\}}d\times Softmax(C_{prob}(d))over^ start_ARG italic_d end_ARG = ∑ start_POSTSUBSCRIPT italic_d ∈ { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_d × italic_S italic_o italic_f italic_t italic_m italic_a italic_x ( italic_C start_POSTSUBSCRIPT italic_p italic_r italic_o italic_b end_POSTSUBSCRIPT ( italic_d ) ) (5)

The output results of the two branches as shown in Figure 2 are fed into the training head for final supervision. For the disparity estimation task, we mainly use Smooth L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Loss, which has been widely used in various regression tasks:

regression=1NpSmoothL1(dp,d^p)subscript𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛1𝑁subscript𝑝𝑆𝑚𝑜𝑜𝑡subscriptsubscript𝐿1subscript𝑑𝑝subscript^𝑑𝑝\mathcal{L}_{regression}=\frac{1}{N}\sum_{p}Smooth_{L_{1}}(d_{p},\hat{d}_{p})caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g italic_r italic_e italic_s italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_S italic_m italic_o italic_o italic_t italic_h start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ) (6)

In equation 6, d^psubscript^𝑑𝑝\hat{d}_{p}over^ start_ARG italic_d end_ARG start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and dpsubscript𝑑𝑝d_{p}italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are the predicted disparity and corresponding groundtruth respectively, N𝑁Nitalic_N is the number of valid pixels. During training, we supervise the estimation of each regression stage.

As for the superpixel segmentation auxiliary task, to encourage the segmentation network to generate superpixels that effectively represent disparity, we further define a disparity reconstruction loss [18] [8]:

recon=1Npdpdp1+wpp2subscript𝑟𝑒𝑐𝑜𝑛1𝑁subscript𝑝subscriptnormsubscript𝑑𝑝superscriptsubscript𝑑𝑝1𝑤subscriptnorm𝑝superscript𝑝2\mathcal{L}_{recon}=\frac{1}{N}\sum_{p}\left\|d_{p}-d_{p}^{\prime}\right\|_{1}% +w\cdot\left\|p-p^{\prime}\right\|_{2}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT - italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w ⋅ ∥ italic_p - italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (7)

where dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and psuperscript𝑝p^{\prime}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represent the superpixel reconstruction results obtained by left multiplying association map Q~Q^T~𝑄superscript^𝑄𝑇\tilde{Q}\hat{Q}^{T}over~ start_ARG italic_Q end_ARG over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, the row and column-normalizd association map Q𝑄Qitalic_Q, and w𝑤witalic_w controls the compactness of the superpixel.

Superpixel Cross-Entropy Loss. The probability after adaptive unimodal distribution modeling and superpixel pooling incorporates the contributions of neighboring pixels, emphasizing similar distributions that reflect the dominant trend within a superpixel.

sce=1Nsd=0D1Psgt(d)logPs(d)subscript𝑠𝑐𝑒1subscript𝑁𝑠subscriptsuperscript𝐷1𝑑0superscriptsubscript𝑃𝑠𝑔𝑡𝑑subscript𝑃𝑠𝑑\mathcal{L}_{sce}=-\frac{1}{N_{s}}\sum^{D-1}_{d=0}P_{s}^{gt}(d)\cdot\log P_{s}% (d)caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_e end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUPERSCRIPT italic_D - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_d = 0 end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT ( italic_d ) ⋅ roman_log italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_d ) (8)

which measures the similarity between the prediction Pssubscript𝑃𝑠P_{s}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and the constructed ground truth Psgtsuperscriptsubscript𝑃𝑠𝑔𝑡P_{s}^{gt}italic_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_g italic_t end_POSTSUPERSCRIPT. The total loss function is the sum of these three components:

total=regression+λsce+μreconsubscript𝑡𝑜𝑡𝑎𝑙subscript𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛𝜆subscript𝑠𝑐𝑒𝜇subscript𝑟𝑒𝑐𝑜𝑛\mathcal{L}_{total}=\mathcal{L}_{regression}+\lambda\mathcal{L}_{sce}+\mu% \mathcal{L}_{recon}caligraphic_L start_POSTSUBSCRIPT italic_t italic_o italic_t italic_a italic_l end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_g italic_r italic_e italic_s italic_s italic_i italic_o italic_n end_POSTSUBSCRIPT + italic_λ caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_e end_POSTSUBSCRIPT + italic_μ caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT (9)

During the training, We heuristically set λ=1𝜆1\lambda=1italic_λ = 1 and μ=0.1𝜇0.1\mu=0.1italic_μ = 0.1 in our experiments.

4 Experiments

4.1 Implementation Details

We implemented the proposed method using PyTorch and conducted experiments on NVIDIA RTX 3090 GPUs, employing the Adam optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.999subscript𝛽20.999\beta_{2}=0.999italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.999. To facilitate model generalization, we augmented input images during the training phase by employing random cropping to a size of H=256𝐻256H=256italic_H = 256 and W=512𝑊512W=512italic_W = 512.

For the Scene Flow dataset, we trained GwcNet integrated with the proposed techniques for a total of 16 epochs. An initial learning rate of 0.001 was applied, strategically reduced by a factor of 2 after epochs 10, 12, and 14 to ensure model convergence. A batch size of 4 was used to optimize memory utilization. The weighting factor w𝑤witalic_w was set to 5×1035superscript1035\times 10^{-3}5 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for appropriate disparity reconstruction loss contribution and k𝑘kitalic_k was set to 6 for the superior performance observed in our prior work. To further enhance model performance, we fine-tuned the models pre-trained on Scene Flow using the KITTI and Middlebury datasets. This fine-tuning process involved 300 additional epochs with an initial learning rate of 0.001, reduced by a factor of 10 after 200 epochs to facilitate fine-grained adjustments in the later training stages.

To ensure consistency and focus within the defined disparity range, we excluded ground truth disparities falling outside the interval [0,Dmax]0subscript𝐷𝑚𝑎𝑥[0,D_{max}][ 0 , italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] during experiments, where Dmaxsubscript𝐷𝑚𝑎𝑥D_{max}italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT was set to 192.

Table 1: Ablation study on Scene Flow finalpass dataset.
Method cesubscript𝑐𝑒\mathcal{L}_{ce}caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT scesubscript𝑠𝑐𝑒\mathcal{L}_{sce}caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_e end_POSTSUBSCRIPT reconCsubscript𝑟𝑒𝑐𝑜𝑛𝐶\mathcal{L}_{reconC}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n italic_C end_POSTSUBSCRIPT reconDsubscript𝑟𝑒𝑐𝑜𝑛𝐷\mathcal{L}_{reconD}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n italic_D end_POSTSUBSCRIPT EPE (px) 1 px (%) 2 px (%) 3 px (%)
GwcNet - - - - 0.765 8.03 4.47 3.30
- - 0.670 6.50 3.78 2.86
+++ SGCE - - - 0.645 6.60 3.71 2.74
- - - 0.626 6.44 3.63 2.70
- - 0.622 6.49 3.65 2.71
- - 0.596 6.00 3.41 2.54

4.2 Modules Designed

To meticulously evaluate the contributions of individual components within our proposed methodology, we conducted a comprehensive ablation study on the Scene Flow dataset. GwcNet [14] served as the baseline, and we systematically examined the effectiveness of Superpixel Guided Channel Excitation (SGCE), scesubscript𝑠𝑐𝑒\mathcal{L}_{sce}caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_e end_POSTSUBSCRIPT, and reconsubscript𝑟𝑒𝑐𝑜𝑛\mathcal{L}_{recon}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c italic_o italic_n end_POSTSUBSCRIPT by employing various experimental settings.

Initially, we focused on assessing the efficacy of the proposed loss function without introducing any structural modifications to the baseline network. Figure 5 illustrates the improved performance, particularly highlighting the enhancement in object boundary detailing, attributed to scesubscript𝑠𝑐𝑒\mathcal{L}_{sce}caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_e end_POSTSUBSCRIPT. Subsequently, we performed comparisons between superpixel cross entropy loss and regular cross entropy loss[17], as well as investigations into the influence of depth- and color-based reconstruction losses. The results, as presented in Table 1, offer compelling insights: Regular loss functions, when employed in isolation, can potentially exert detrimental effects on performance. Our proposed loss components, in contrast, demonstrate consistent improvements across all evaluated stereo matching error metrics, surpassing the baseline results.

Refer to caption

Figure 5: Qualitative comparisons of ablation study on Scene Flow test set.
Table 2: Universality study on Scene Flow finalpass dataset.(* denotes the finalpass reproduced result)
Method EPE (px) D1 (%) SEE (px)
PSMNet[13] 1.11 2.47 4.42
PSMNet-TH 1.06 2.49 3.07
MobileStereo[23] 1.14 4.40 4.41
MobileStereo-TH 0.92 3.21 3.63
PCWNet[10] 0.84 2.80 3.83
PCWNet-TH 0.74 2.52 3.66

4.3 Universality of the Training Head

To demonstrate the universality of our proposed training head, we seamlessly integrate it into three state-of-the-art models, namely PSMNet[13], MobileStereo[23] and PCWNet[10]. We then compare the performance of the original models with the integrated versions, denoted as PSMNet-TH, MobileStereo-TH, and PCWNet-TH, respectively. The evaluation, as presented in Table III, includes a dedicated metric for quantifying the quality of disparities at boundaries, referred to as SEE (Soft Edge Error). It is important to note that we have not validated the universality and effectiveness of our approach on iterative refinement architectures, such as RAFT-Stereo[21]. This is due to the fact that our loss function is tailored to optimize the probabilistic form of the cost volume.

Table 3: Quantitative evaluation on Scene Flow test set with the popular approaches.
Method PSMNet[13] GwcNet[14] SSPCV-Net[20] EdgeStereo[6] AcfNet[2] ACVNet[3] GwcNet+++Ours
EPE (px) 1.09 0.76 0.87 1.11 0.86 0.48 0.59

Bold: Best, Underline: Secondary

4.4 Performance Evaluation

Scene Flow Dataset. To assess model performance in real-world indoor scenes, we utilized the Middlebury dataset, consisting of 15 training image pairs and 15 test pairs. Experiments were conducted using half-resolution images to align with dataset conventions. Table 3 showcases the outstanding performance of our approach. Notably, it ranks second among all competing algorithms, achieving a remarkable 22% reduction in EPE when integrated with GwcNet. These results emphatically demonstrate the effectiveness of our methodology in enhancing disparity estimation accuracy.

Refer to caption

Figure 6: Qualitative results on the Middlebury test set compared to the top end-to-end deep learning approach ACVNet[3].

Middlebury Dataset. To assess model performance in real-world indoor scenes, we utilized the Middlebury dataset. Figure 6 visually compares the disparity quality of our approach against other leading method on the test dense leaderboard. The results reveal several distinct advantages: sharper transitions at object boundaries, indicating enhanced edge preservation and detail capture; consistent disparity predictions within individual objects, demonstrating robust depth estimation.

Refer to caption

Figure 7: Qualitative results on the KITTI 2012 (top) and KITTI 2015 (bottom) test set. White box highlighted the improvement of details.

KITTI. To evaluate model performance in real-world driving scenarios, we employed the KITTI 2015 and KITTI 2012 datasets, both capturing challenging outdoor scenes. KITTI 2015 offers 200 training stereo image pairs with sparse ground-truth disparities and 200 testing pairs without ground truth, while KITTI 2012 provides 194 training pairs and 195 testing pairs. As presented in Tables 4 and 5, our approach demonstrates competitive performance, aligning with the results of leading networks in the field. Due to the sparse ground truth in the dataset, performance degradation occurs during fine-tuning of superpixel branches. Additionally, in large scenes, segmentation areas may slightly deviate from our principle of disparity consistency. These challenges indicate the potential of our proposed methods for further enhancement when dealing with complex scenes. AcfNet [2], while effective, relies on pixel-level uncertainty supervision and unimodal distribution modeling, potentially limiting its ability to fully leverage contextual information from neighboring pixels. Our approach, in contrast, explicitly addresses this limitation through superpixel-based guidance, resulting in superior performance. Furthermore, comparisons with SSPCV-Net[20] and EdgeStereo[6] highlight the advantages of superpixels. Unlike these methods, which introduce subnetworks for segmentation or edge detection, our superpixel-based approach implicitly considers both semantic classes and boundary information, leading to more comprehensive guidance for stereo matching.

Table 4: Quantitative evaluation on KITTI 2012 test set.
Method 3px (%) 5px (%) EPE (px)
noc all noc all noc all
SSPCV-Net[20] 1.47 1.90 0.87 1.14 0.5 0.6
EdgeStereo-V2[6] 1.46 1.83 0.83 1.04 0.4 0.5
CoEx[22] 1.55 1.93 0.91 1.13 0.5 0.5
AcfNet[2] 1.17 1.54 0.77 1.01 0.5 0.5
RAFT-Stereo[21] 1.30 1.66 0.86 1.11 0.4 0.5
ACVNet[3] 1.13 1.47 0.71 0.91 0.4 0.5
IGEV-Stereo[7] 1.12 1.44 0.73 0.94 0.4 0.4
GwcNet-gc[14] 1.32 1.70 0.80 1.03 0.5 0.5
GwcNet+++Ours 1.18 1.50 0.72 0.93 0.4 0.5
Table 5: Quantitative evaluation on KITTI 2015 test set.
Method NOC (%) ALL (%)
bg fg all bg fg all
SSPCV-Net[20] 1.61 3.40 1.91 1.75 3.89 2.11
DeepPruner-Best[15] 1.71 3.18 1.95 1.87 3.56 2.15
EdgeStereo[6] 1.72 3.41 2.00 1.87 3.61 2.16
CoEx[22] 1.62 3.09 1.86 1.74 3.41 2.02
ACVNet[3] 1.37 3.07 1.65 1.26 2.84 1.52
GwcNet-g[14] 1.61 3.49 1.92 1.74 3.93 2.11
GwcNet+++Ours 1.48 3.20 1.76 1.60 3.59 1.93

5 Conclusion

In this paper, we propose a novel stereo matching approach that combines superpixels and cross-entropy loss, resulting in enhanced accuracy and robustness. Our method utilizes a superpixel probability volume to enable effective learning of regional features and outlier correction. Through seamless integration with classical stereo matching networks, our approach demonstrates significant improvements across various datasets. We anticipate its potential benefits for downstream tasks, such as stereo-based 3D reconstruction.

References

  • [1] Hirschmuller, H.: Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341 (2008)
  • [2] Zhang, Y., Chen, Y., Bai, X., Yu S., Yu K., Li Z., Yang K.: Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12926–12934 (2020). http://dx.doi.org/10.1609/aaai.v34i07.6991
  • [3] Xu G., Cheng J., Guo P., Yang X.: Attention Concatenation Volume for Accurate and Efficient Stereo Matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12981–12990 (2022)
  • [4] Chen L., Wang W., Mordohai P.: Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17235–17244 (2023)
  • [5] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 34, no. 11, pp. 2274–2282 (2012). http://dx.doi.org/10.1109/tpami.2012.120
  • [6] Song, X., Zhao, X., Hu, H., Fang, L.: Edgestereo: A context integrated residual pyramid network for stereo matching. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part V, pp. 20–35. Springer (2019)
  • [7] Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21919–21928 (2023)
  • [8] Jampani, V., Sun, D., Liu, M.-Y., Yang, M.-H., Kautz, J.: Superpixel sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 352–368 (2018)
  • [9] Li, L., Zhang, S., Yu, X., Zhang, L.: PMSC: PatchMatch-based superpixel cut for accurate stereo matching. IEEE Transactions on Circuits and Systems for Video Technology vol. 28, no. 3, pp. 679–692 (2016). http://dx.doi.org/10.1109/tcsvt.2016.2628782
  • [10] Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., Zhang, L.: PCW-Net: Pyramid combination and warping cost volume for stereo matching. In: European Conference on Computer Vision, pp. 280–297. Springer (2022)
  • [11] Chen, J., Hou, J., Ni, Y., Chau, L.-P.: Accurate light field depth estimation with superpixel regularization over partially occluded regions. IEEE Transactions on Image Processing vol. 27, no. 10, pp. 4889–4900 (2018)
  • [12] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75 (2017). http://dx.doi.org/10.1109/iccv.2017.17
  • [13] Chang, J.-R., Chen, Y.-S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018). http://dx.doi.org/10.1109/cvpr.2018.00567
  • [14] Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3273–3282 (2019). http://dx.doi.org/10.1109/cvpr.2019.00339
  • [15] Duggal, S., Wang, S., Ma, W.-C., Hu, R., Urtasun, R.: Deeppruner: Learning efficient stereo matching via differentiable patchmatch. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4384–4393 (2019). http://dx.doi.org/10.1109/iccv.2019.00448
  • [16] Tosi, F., Liao, Y., Schmitt, C., Geiger, A.: Smd-nets: Stereo mixture density networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8942–8952 (2021). http://dx.doi.org/10.1109/cvpr46437.2021.00883
  • [17] Chen, C., Chen, X., Cheng, H.: On the over-smoothing problem of CNN based disparity estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8997–9005 (2019). http://dx.doi.org/10.1109/iccv.2019.00909
  • [18] Yang, F., Sun, Q., Jin, H., Zhou, Z.: Superpixel segmentation with fully convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13964–13973 (2020). http://dx.doi.org/10.1109/cvpr42600.2020.01398
  • [19] Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In: 18th International Conference on Pattern Recognition (ICPR’06), vol. 3, pp. 15–18. IEEE (2006). http://dx.doi.org/10.1109/icpr.2006.1033
  • [20] Wu, Z., Wu, X., Zhang, X., Wang, S., Ju, L.: Semantic stereo matching with pyramid cost volumes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7484–7493 (2019). http://dx.doi.org/10.1109/iccv.2019.00758
  • [21] Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: 2021 International Conference on 3D Vision (3DV), pp. 218–227. IEEE (2021). http://dx.doi.org/10.1109/3dv53792.2021.00032
  • [22] Bangunharcana, A., Cho, J. W., Lee, S., Kweon, I. S., Kim, K.-S., Kim, S.: Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3542–3548. IEEE (2021). http://dx.doi.org/10.1109/iros51168.2021.9635909
  • [23] Shamsafar, F., Woerz, S., Rahim, R., Zell, A.: Mobilestereonet: Towards lightweight deep networks for stereo matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2417–2426 (2022). http://dx.doi.org/10.1109/wacv51458.2022.00075
  • [24] Ji, P., Li, J., Li, H., Liu, X.: Superpixel alpha-expansion and normal adjustment for stereo matching. Journal of Visual Communication and Image Representation, vol. 79, 103238 (2021)