¹¹institutetext: School of Computer Science and Technology, Ocean University of China, Qingdao 266101, China
¹¹email: {liushanglong, gwc1323, xly3385}@stu.ouc.edu.cn
¹¹email: {qilin,dongjunyu}@ouc.edu.cn

Superpixel Cost Volume Excitation for Stereo Matching^†^†thanks: Supported by National Natural Science Foundation of China (Grant No. 41927805).

Shanglong Liu Lin Qi(✉) Junyu Dong(✉) Wenxiang Gu Liyi Xu

Abstract

In this work, we concentrate on exciting the intrinsic local consistency of stereo matching through the incorporation of superpixel soft constraints, with the objective of mitigating inaccuracies at the boundaries of predicted disparity maps. Our approach capitalizes on the observation that neighboring pixels are predisposed to belong to the same object and exhibit closely similar intensities within the probability volume of superpixels. By incorporating this insight, our method encourages the network to generate consistent probability distributions of disparity within each superpixel, aiming to improve the overall accuracy and coherence of predicted disparity maps. Experimental evaluations on widely-used datasets validate the efficacy of our proposed approach, demonstrating its ability to assist cost volume-based matching networks in restoring competitive performance.

Keywords:

Stereo Matching Superpixel Cross-Entropy.

Refer to caption — Figure 1: Visualization of the real output distribution at boundaries on Scene Flow dataset. (a) is the input image, and its partial enlargement. (b) represents the disparity probability distribution of the superpixel belonging to the brown region. (c) and (d) show the output probability distributions of a given pixel from GwcNet and GwcNet $+$ Ours. Our proposed methods rectify the incorrect distributions and avoid smoothness bias. Please zoom in to see the details.

1 Introduction

Stereo matching endeavors to establish dense correspondences between rectified stereo pairs, enabling the recovery of scene depth through triangulation[1]. This technique finds broad applications in diverse fields, including robot navigation, augmented reality, and autonomous driving.

Recently, stereo models have demonstrated exceptional performance through the utilization of a cost volume-based architecture[12, 13, 14], typically comprising four key steps: feature extraction, cost volume construction, cost aggregation, and disparity regression. Among these steps, cost aggregation stands out as the most crucial module, responsible for selecting the optimal match from numerous potential pairs and generating probability representations for the cost volume. However, state-of-the-art models face challenges in effectively addressing local ambiguities at boundaries, where definitively determining the pixel’s belonging region is complex. This frequently leads to a multi-peaked distribution in the aggregated probability volume, giving rise to the problem of over-smoothing[16, 17].

In this study, we endeavor to rectify this mismatch and eliminate redundant information by incorporating a pixel relationship prior. Drawing inspiration from the premise that depth transitions smoothly within homologous regions[19, 3, 9], we posit that depth discontinuities solely manifest between distinct regions. Hence, we introduce the concept of superpixels[5], defined as clusters of contiguous and perceptually coherent pixels, offering a more coarse-grained representation of the image. Several recent superpixel segmentation methods have successfully integrated into various low-level tasks, including optical flow estimation, monocular depth estimation[11], and depth completion, etc. They play a crucial role in decreasing the number of primitives in image processing, extracting similar features, and capturing image structure information.

Capitalizing on their inherent clustering and boundary properties, we integrate superpixel segmentation to produce a superpixel-level probability volume. Furthermore, the effectiveness of a strong-constraint disparity filtering strategy is limited due to the coarse-grained nature of superpixel representation, which cannot refine to each disparity level. To address this limitation, we model the ground truth at the superpixel level using a Laplace distribution[4] and apply cross-entropy loss to this representation, to suppress the multi-peaked issue during the cost aggregation into probability. This superpixel training head proves highly effective in aiding aggregation, generating a more accurate probability representation for the cost volume, while simultaneously avoiding the need for additional computations and parameters during the inference stage of such resource-constrained tasks. And we conducted experiments to explore its efficacy, as illustrated in Figure 1, this approach facilitates the convergence of the probability volume within the same superpixel, rectifying outliers through the overall distribution. To maintain color and spatial consistency, we adjust the sub-network’s task orientation towards disparity reconstruction, leveraging the principle that pixels within superpixel blocks from color images share similar disparities. This enables attention weights, derived from the sub-network’s semantic features, to effectively enhance local geometric consistency within the cost volume in the channel dimension, thereby encoding meaningful relationships between pixels.

2 Related Work

The cost volume-based architecture is designed to enhance the accuracy of depth estimation by constructing and optimizing the cost associated with candidate disparitie. This volume is formed by concatenating or correlating feature maps extracted from the left and right images at various disparity levels. GCNet[12] pioneered the integration of a 3D encoder-decoder structure, utilizing soft-argmin-based disparity regression derived from a probabilistic cost volume. Subsequently, advancements such as the grouped-wise correlation cost volume introduced by GwcNet[14] and the attention-based cost volume proposed by ACVNet[3] aimed to augment the representational capacity of the cost volume. These end-to-end deep learning methods primarily supervise the disparity outputs, neglecting the rationality of their distributions. AcfNet[2] addresses this issue by directly supervising the cost volume with unimodal ground truth distributions. However, due to its reliance on pixel-level operations, the network may struggle to learn scene structural information and could potentially overfit to a single dataset.

Superpixels play a crucial role in local optimization and global consistency in stereo matching. Previous studies [9, 24] demonstrate that $\alpha$ -expansion, which segments images into larger regions and assumes similar 3D plane labels within each segment, effectively optimizes disparity estimation by propagating consistent plane labels. SFCN[18] effectively preserves object boundaries and fine-grained details by incorporating superpixels, which replaces conventional upsampling methods in the downsampling or upsampling scheme. However, this technique does not contribute to the matching process. In contrast, our approach delves deeper into the pixel relationship information inherent in the cost volume, emphasizing the collective impact of neighboring pixels on disparity estimation.

3 Methods

3.1 Superpixel Guided Channel Excitation

As shown in Figure 3, superpixel segmentation is implemented using a standard encoder-decoder architecture with skip connections[18]. We argue that object context is crucial for accurate segmentation, as its multi-scale features contain valuable information about object shape and affinity. Therefore, we use the channel excitation to embed the object context into the cost volume. Different from the CoEx[22] method, where only involves excitation of the corresponding scaled cost volume features, we instead fuse hierarchical scales features $\phi(I_{l})_{k}\in\mathbb{R}^{N\times\frac{H}{k}\times\frac{W}{k}},k\in\{4,8,16\}$ from the sub-network. Through a simple multi-scale short connections denoted as $g$ , we obtain the superpixel semantic guidance. Before each cost aggregation, the guided cost volume excitation is calculated as:

\begin{split}C^{\prime}_{cost}&=\sigma(g(\phi(I_{l})_{k}))\odot C_{cost}\end{split}

(1)

where $\sigma$ denotes the sigmoid function that converts the guidance into an attention weight map. These attention weights W emphasize both local consistency and discontinuity within the cost volume along the channel dimension. And $\odot$ represents the Hadamard product after broadcasting the attention across the disparity dimension. $\textbf{C}_{\textbf{cost}}$ (with the size of $N\times\frac{1}{4}D\times\frac{1}{4}H\times\frac{1}{4}W$ ) represents the 4D cost volume constructed from the features of the left and right images. This process generates a geometrically encoded cost volume, $\textbf{C}^{\prime}_{\textbf{cost}}$ , which allows 3D convolutions to aggregate information from neighboring pixels and capture geometric relationships inherent in the data.

3.2 Superpixel Pooling of Probability Volume

Laplace Distribution. In an ideal scenario, the disparity probability distribution manifests itself in a unimodal form, where the probability values diminish with the distance from the true matching pixel, peaking at the ground truth disparity value. To more accurately depict the variance in disparity probability distributions across different matching regions, we adaptively model a unimodal distribution akin to AcfNet[2] as follows:

P^{gt}(d)=\text{softmax}\left(-\frac{|d-d^{gt}|}{v}\right)

(2)

As depicted in Figure 4, the variance $v$ is computed based on the aggregated cost volume (i.e. probability volume). Challenging pixels often exhibit multi-modal probability distributions, with their variance typically being large. This parameter controls the sharpness of the peak around the true disparity, adjusting it according to the matchbility. Specifically, when a point struggles to distinctly delineate the pixel’s region of belonging or resides within a region characterized by weak textural attributes during stereo matching, it exhibits a comparatively smoother peak. This adaptive modeling enhances the precision of disparaty estimation across various matching scenarios.

Superpixel Pooling. Given the predicted superpixel association probability map $\textbf{Q}\in\mathbb{R}^{{\left|N_{s}\right|}\times H\times W}$ for image $I_{l}$ , where $N_{s}$ represents the 9 sets of surrounding initial grid cells associated with each pixel $p$ , we obtain the superpixel label map $m$ by assigning each pixel to its most likely superpixel using $m={{\arg\max}\,Q(p)}$ . And its inverse mapping $\tilde{m}$ , which represents the pixel index of each superpixel label.

To capture the disparity probability distribution within each superpixel, we leverage $m$ and the aggregated volume $\textbf{C}_{\textbf{prob}}$ (with the size of $D\times H\times W$ ) to generate a superpixel probability volume $P_{s}$ , defined as:

P_{s}=\left(\prod_{p\in\tilde{m}_{s}}C_{prob}(p)\right)^{\frac{1}{n}}

(3)

where $n$ denotes the number of pixels within the specific superpixel $s$ . To ensure numerical stability, circumvent underflow issues arising from probabilistic multiplication, and simplify computational complexity, we conduct the pooling process in logarithmic space:

\ln(P_{s})=\frac{1}{n}\sum_{p\in\tilde{m}_{s}}\ln(C_{prob}(p))

(4)

To recover the original superpixel probability volume from logarithmic space, we apply exponential operations. We can perform superpixel geometric mean pooling over the modeled ground truth from the preceding section or probability volume, to obtain a superpixel-level probability representation. The probability distributions of pixels exhibit a collective influence, where their interactions shape the overall probability distribution within superpixels. Notably, this superpixel probability volume is generated solely for supervision during training, ensuring no added computational or memory demands during inference.

3.3 Training Head

Loss for Single Tasks. After getting the final probability volume, the soft-argmin operation is used to compute disparity for each pixel by taking the expected value[12]. To ensure regression focuses on the most probable mode, we utilize the top $k$ values from the probability volume:

\hat{d}=\sum_{d\in\{d_{1},d_{2},...,d_{k}\}}d\times Softmax(C_{prob}(d))

(5)

The output results of the two branches as shown in Figure 2 are fed into the training head for final supervision. For the disparity estimation task, we mainly use Smooth $L_{1}$ Loss, which has been widely used in various regression tasks:

\mathcal{L}_{regression}=\frac{1}{N}\sum_{p}Smooth_{L_{1}}(d_{p},\hat{d}_{p})

(6)

In equation 6, $\hat{d}_{p}$ and $d_{p}$ are the predicted disparity and corresponding groundtruth respectively, $N$ is the number of valid pixels. During training, we supervise the estimation of each regression stage.

As for the superpixel segmentation auxiliary task, to encourage the segmentation network to generate superpixels that effectively represent disparity, we further define a disparity reconstruction loss [18] [8]:

\mathcal{L}_{recon}=\frac{1}{N}\sum_{p}\left\|d_{p}-d_{p}^{\prime}\right\|_{1}% +w\cdot\left\|p-p^{\prime}\right\|_{2}

(7)

where $d^{\prime}$ and $p^{\prime}$ represent the superpixel reconstruction results obtained by left multiplying association map $\tilde{Q}\hat{Q}^{T}$ , the row and column-normalizd association map $Q$ , and $w$ controls the compactness of the superpixel.

Superpixel Cross-Entropy Loss. The probability after adaptive unimodal distribution modeling and superpixel pooling incorporates the contributions of neighboring pixels, emphasizing similar distributions that reflect the dominant trend within a superpixel.

\mathcal{L}_{sce}=-\frac{1}{N_{s}}\sum^{D-1}_{d=0}P_{s}^{gt}(d)\cdot\log P_{s}% (d)

(8)

which measures the similarity between the prediction $P_{s}$ and the constructed ground truth $P_{s}^{gt}$ . The total loss function is the sum of these three components:

\mathcal{L}_{total}=\mathcal{L}_{regression}+\lambda\mathcal{L}_{sce}+\mu% \mathcal{L}_{recon}

(9)

During the training, We heuristically set $\lambda=1$ and $\mu=0.1$ in our experiments.

4 Experiments

4.1 Implementation Details

We implemented the proposed method using PyTorch and conducted experiments on NVIDIA RTX 3090 GPUs, employing the Adam optimizer with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ . To facilitate model generalization, we augmented input images during the training phase by employing random cropping to a size of $H=256$ and $W=512$ .

For the Scene Flow dataset, we trained GwcNet integrated with the proposed techniques for a total of 16 epochs. An initial learning rate of 0.001 was applied, strategically reduced by a factor of 2 after epochs 10, 12, and 14 to ensure model convergence. A batch size of 4 was used to optimize memory utilization. The weighting factor $w$ was set to $5\times 10^{-3}$ for appropriate disparity reconstruction loss contribution and $k$ was set to 6 for the superior performance observed in our prior work. To further enhance model performance, we fine-tuned the models pre-trained on Scene Flow using the KITTI and Middlebury datasets. This fine-tuning process involved 300 additional epochs with an initial learning rate of 0.001, reduced by a factor of 10 after 200 epochs to facilitate fine-grained adjustments in the later training stages.

To ensure consistency and focus within the defined disparity range, we excluded ground truth disparities falling outside the interval $[0,D_{max}]$ during experiments, where $D_{max}$ was set to 192.

Table 1: Ablation study on Scene Flow finalpass dataset.

Method	$\mathcal{L}_{ce}$	$\mathcal{L}_{sce}$	$\mathcal{L}_{reconC}$	$\mathcal{L}_{reconD}$	EPE (px)	1 px (%)	2 px (%)	3 px (%)
GwcNet	-	-	-	-	0.765	8.03	4.47	3.30
GwcNet	-	✓	-	✓	0.670	6.50	3.78	2.86
$+$ SGCE	-	-	✓	-	0.645	6.60	3.71	2.74
	-	-	-	✓	0.626	6.44	3.63	2.70
	✓	-	-	✓	0.622	6.49	3.65	2.71
	-	✓	-	✓	0.596	6.00	3.41	2.54

4.2 Modules Designed

To meticulously evaluate the contributions of individual components within our proposed methodology, we conducted a comprehensive ablation study on the Scene Flow dataset. GwcNet [14] served as the baseline, and we systematically examined the effectiveness of Superpixel Guided Channel Excitation (SGCE), $\mathcal{L}_{sce}$ , and $\mathcal{L}_{recon}$ by employing various experimental settings.

Initially, we focused on assessing the efficacy of the proposed loss function without introducing any structural modifications to the baseline network. Figure 5 illustrates the improved performance, particularly highlighting the enhancement in object boundary detailing, attributed to $\mathcal{L}_{sce}$ . Subsequently, we performed comparisons between superpixel cross entropy loss and regular cross entropy loss[17], as well as investigations into the influence of depth- and color-based reconstruction losses. The results, as presented in Table 1, offer compelling insights: Regular loss functions, when employed in isolation, can potentially exert detrimental effects on performance. Our proposed loss components, in contrast, demonstrate consistent improvements across all evaluated stereo matching error metrics, surpassing the baseline results.

Table 2: Universality study on Scene Flow finalpass dataset.(

*

denotes the finalpass reproduced result)

Method	EPE (px)	D1 (%)	SEE (px)
PSMNet^∗[13]	1.11	2.47	4.42
PSMNet-TH	1.06	2.49	3.07
MobileStereo[23]	1.14	4.40	4.41
MobileStereo-TH	0.92	3.21	3.63
PCWNet^∗[10]	0.84	2.80	3.83
PCWNet-TH	0.74	2.52	3.66

4.3 Universality of the Training Head

To demonstrate the universality of our proposed training head, we seamlessly integrate it into three state-of-the-art models, namely PSMNet[13], MobileStereo[23] and PCWNet[10]. We then compare the performance of the original models with the integrated versions, denoted as PSMNet-TH, MobileStereo-TH, and PCWNet-TH, respectively. The evaluation, as presented in Table III, includes a dedicated metric for quantifying the quality of disparities at boundaries, referred to as SEE (Soft Edge Error). It is important to note that we have not validated the universality and effectiveness of our approach on iterative refinement architectures, such as RAFT-Stereo[21]. This is due to the fact that our loss function is tailored to optimize the probabilistic form of the cost volume.

Table 3: Quantitative evaluation on Scene Flow test set with the popular approaches.

Method	PSMNet[13]	GwcNet[14]	SSPCV-Net[20]	EdgeStereo[6]	AcfNet[2]	ACVNet[3]	GwcNet $+$ Ours
EPE (px)	1.09	0.76	0.87	1.11	0.86	0.48	0.59

Bold: Best, Underline: Secondary

4.4 Performance Evaluation

Scene Flow Dataset. To assess model performance in real-world indoor scenes, we utilized the Middlebury dataset, consisting of 15 training image pairs and 15 test pairs. Experiments were conducted using half-resolution images to align with dataset conventions. Table 3 showcases the outstanding performance of our approach. Notably, it ranks second among all competing algorithms, achieving a remarkable 22% reduction in EPE when integrated with GwcNet. These results emphatically demonstrate the effectiveness of our methodology in enhancing disparity estimation accuracy.

Middlebury Dataset. To assess model performance in real-world indoor scenes, we utilized the Middlebury dataset. Figure 6 visually compares the disparity quality of our approach against other leading method on the test dense leaderboard. The results reveal several distinct advantages: sharper transitions at object boundaries, indicating enhanced edge preservation and detail capture; consistent disparity predictions within individual objects, demonstrating robust depth estimation.

KITTI. To evaluate model performance in real-world driving scenarios, we employed the KITTI 2015 and KITTI 2012 datasets, both capturing challenging outdoor scenes. KITTI 2015 offers 200 training stereo image pairs with sparse ground-truth disparities and 200 testing pairs without ground truth, while KITTI 2012 provides 194 training pairs and 195 testing pairs. As presented in Tables 4 and 5, our approach demonstrates competitive performance, aligning with the results of leading networks in the field. Due to the sparse ground truth in the dataset, performance degradation occurs during fine-tuning of superpixel branches. Additionally, in large scenes, segmentation areas may slightly deviate from our principle of disparity consistency. These challenges indicate the potential of our proposed methods for further enhancement when dealing with complex scenes. AcfNet [2], while effective, relies on pixel-level uncertainty supervision and unimodal distribution modeling, potentially limiting its ability to fully leverage contextual information from neighboring pixels. Our approach, in contrast, explicitly addresses this limitation through superpixel-based guidance, resulting in superior performance. Furthermore, comparisons with SSPCV-Net[20] and EdgeStereo[6] highlight the advantages of superpixels. Unlike these methods, which introduce subnetworks for segmentation or edge detection, our superpixel-based approach implicitly considers both semantic classes and boundary information, leading to more comprehensive guidance for stereo matching.

Table 4: Quantitative evaluation on KITTI 2012 test set.

Method	3px (%)		5px (%)		EPE (px)
Method	noc	all	noc	all	noc	all
SSPCV-Net[20]	1.47	1.90	0.87	1.14	0.5	0.6
EdgeStereo-V2[6]	1.46	1.83	0.83	1.04	0.4	0.5
CoEx[22]	1.55	1.93	0.91	1.13	0.5	0.5
AcfNet[2]	1.17	1.54	0.77	1.01	0.5	0.5
RAFT-Stereo[21]	1.30	1.66	0.86	1.11	0.4	0.5
ACVNet[3]	1.13	1.47	0.71	0.91	0.4	0.5
IGEV-Stereo[7]	1.12	1.44	0.73	0.94	0.4	0.4
GwcNet-gc[14]	1.32	1.70	0.80	1.03	0.5	0.5
GwcNet $+$ Ours	1.18	1.50	0.72	0.93	0.4	0.5

Table 5: Quantitative evaluation on KITTI 2015 test set.

Method	NOC (%)			ALL (%)
Method	bg	fg	all	bg	fg	all
SSPCV-Net[20]	1.61	3.40	1.91	1.75	3.89	2.11
DeepPruner-Best[15]	1.71	3.18	1.95	1.87	3.56	2.15
EdgeStereo[6]	1.72	3.41	2.00	1.87	3.61	2.16
CoEx[22]	1.62	3.09	1.86	1.74	3.41	2.02
ACVNet[3]	1.37	3.07	1.65	1.26	2.84	1.52
GwcNet-g[14]	1.61	3.49	1.92	1.74	3.93	2.11
GwcNet $+$ Ours	1.48	3.20	1.76	1.60	3.59	1.93

5 Conclusion

In this paper, we propose a novel stereo matching approach that combines superpixels and cross-entropy loss, resulting in enhanced accuracy and robustness. Our method utilizes a superpixel probability volume to enable effective learning of regional features and outlier correction. Through seamless integration with classical stereo matching networks, our approach demonstrates significant improvements across various datasets. We anticipate its potential benefits for downstream tasks, such as stereo-based 3D reconstruction.

References

[1] Hirschmuller, H.: Stereo Processing by Semiglobal Matching and Mutual Information. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 2, pp. 328–341 (2008)
[2] Zhang, Y., Chen, Y., Bai, X., Yu S., Yu K., Li Z., Yang K.: Adaptive Unimodal Cost Volume Filtering for Deep Stereo Matching. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 7, pp. 12926–12934 (2020). http://dx.doi.org/10.1609/aaai.v34i07.6991
[3] Xu G., Cheng J., Guo P., Yang X.: Attention Concatenation Volume for Accurate and Efficient Stereo Matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12981–12990 (2022)
[4] Chen L., Wang W., Mordohai P.: Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17235–17244 (2023)
[5] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., Süsstrunk, S.: SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Transactions on Pattern Analysis and Machine Intelligence vol. 34, no. 11, pp. 2274–2282 (2012). http://dx.doi.org/10.1109/tpami.2012.120
[6] Song, X., Zhao, X., Hu, H., Fang, L.: Edgestereo: A context integrated residual pyramid network for stereo matching. In: Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part V, pp. 20–35. Springer (2019)
[7] Xu, G., Wang, X., Ding, X., Yang, X.: Iterative geometry encoding volume for stereo matching. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21919–21928 (2023)
[8] Jampani, V., Sun, D., Liu, M.-Y., Yang, M.-H., Kautz, J.: Superpixel sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 352–368 (2018)
[9] Li, L., Zhang, S., Yu, X., Zhang, L.: PMSC: PatchMatch-based superpixel cut for accurate stereo matching. IEEE Transactions on Circuits and Systems for Video Technology vol. 28, no. 3, pp. 679–692 (2016). http://dx.doi.org/10.1109/tcsvt.2016.2628782
[10] Shen, Z., Dai, Y., Song, X., Rao, Z., Zhou, D., Zhang, L.: PCW-Net: Pyramid combination and warping cost volume for stereo matching. In: European Conference on Computer Vision, pp. 280–297. Springer (2022)
[11] Chen, J., Hou, J., Ni, Y., Chau, L.-P.: Accurate light field depth estimation with superpixel regularization over partially occluded regions. IEEE Transactions on Image Processing vol. 27, no. 10, pp. 4889–4900 (2018)
[12] Kendall, A., Martirosyan, H., Dasgupta, S., Henry, P., Kennedy, R., Bachrach, A., Bry, A.: End-to-end learning of geometry and context for deep stereo regression. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75 (2017). http://dx.doi.org/10.1109/iccv.2017.17
[13] Chang, J.-R., Chen, Y.-S.: Pyramid stereo matching network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418 (2018). http://dx.doi.org/10.1109/cvpr.2018.00567
[14] Guo, X., Yang, K., Yang, W., Wang, X., Li, H.: Group-wise correlation stereo network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3273–3282 (2019). http://dx.doi.org/10.1109/cvpr.2019.00339
[15] Duggal, S., Wang, S., Ma, W.-C., Hu, R., Urtasun, R.: Deeppruner: Learning efficient stereo matching via differentiable patchmatch. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4384–4393 (2019). http://dx.doi.org/10.1109/iccv.2019.00448
[16] Tosi, F., Liao, Y., Schmitt, C., Geiger, A.: Smd-nets: Stereo mixture density networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8942–8952 (2021). http://dx.doi.org/10.1109/cvpr46437.2021.00883
[17] Chen, C., Chen, X., Cheng, H.: On the over-smoothing problem of CNN based disparity estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8997–9005 (2019). http://dx.doi.org/10.1109/iccv.2019.00909
[18] Yang, F., Sun, Q., Jin, H., Zhou, Z.: Superpixel segmentation with fully convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13964–13973 (2020). http://dx.doi.org/10.1109/cvpr42600.2020.01398
[19] Klaus, A., Sormann, M., Karner, K.: Segment-based stereo matching using belief propagation and a self-adapting dissimilarity measure. In: 18th International Conference on Pattern Recognition (ICPR’06), vol. 3, pp. 15–18. IEEE (2006). http://dx.doi.org/10.1109/icpr.2006.1033
[20] Wu, Z., Wu, X., Zhang, X., Wang, S., Ju, L.: Semantic stereo matching with pyramid cost volumes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7484–7493 (2019). http://dx.doi.org/10.1109/iccv.2019.00758
[21] Lipson, L., Teed, Z., Deng, J.: Raft-stereo: Multilevel recurrent field transforms for stereo matching. In: 2021 International Conference on 3D Vision (3DV), pp. 218–227. IEEE (2021). http://dx.doi.org/10.1109/3dv53792.2021.00032
[22] Bangunharcana, A., Cho, J. W., Lee, S., Kweon, I. S., Kim, K.-S., Kim, S.: Correlate-and-excite: Real-time stereo matching via guided cost volume excitation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 3542–3548. IEEE (2021). http://dx.doi.org/10.1109/iros51168.2021.9635909
[23] Shamsafar, F., Woerz, S., Rahim, R., Zell, A.: Mobilestereonet: Towards lightweight deep networks for stereo matching. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2417–2426 (2022). http://dx.doi.org/10.1109/wacv51458.2022.00075
[24] Ji, P., Li, J., Li, H., Liu, X.: Superpixel alpha-expansion and normal adjustment for stereo matching. Journal of Visual Communication and Image Representation, vol. 79, 103238 (2021)

Superpixel Cost Volume Excitation for Stereo Matching††thanks: Supported by National Natural Science Foundation of China (Grant No. 41927805).