0% found this document useful (0 votes)
16 views11 pages

Dy Simple

The document introduces DySample, a lightweight and efficient dynamic upsampler that improves upon existing kernel-based methods like CARAFE, FADE, and SAPA by avoiding dynamic convolution and high-resolution feature guidance. DySample utilizes point sampling to enhance upsampling performance while requiring fewer resources, resulting in lower latency, memory usage, and computational complexity. The proposed method demonstrates superior performance across various dense prediction tasks, making it a viable alternative to traditional upsampling techniques.

Uploaded by

2794411427
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views11 pages

Dy Simple

The document introduces DySample, a lightweight and efficient dynamic upsampler that improves upon existing kernel-based methods like CARAFE, FADE, and SAPA by avoiding dynamic convolution and high-resolution feature guidance. DySample utilizes point sampling to enhance upsampling performance while requiring fewer resources, resulting in lower latency, memory usage, and computational complexity. The proposed method demonstrates superior performance across various dense prediction tasks, making it a viable alternative to traditional upsampling techniques.

Uploaded by

2794411427
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Learning to Upsample by Learning to Sample

Wenze Liu Hao Lu* Hongtao Fu Zhiguo Cao

School of Artificial Intelligence and Automation,


Huazhong University of Science and Technology, China
arXiv:2308.15085v1 [cs.CV] 29 Aug 2023

{wzliu,hlu}@hust.edu.cn

Abstract 43.5
DySample Series
We present DySample, an ultra-lightweight and effective 43.0 SAPA
dynamic upsampler. While impressive performance gains FADE
have been witnessed from recent kernel-based dynamic up- CARAFE
42.5
samplers such as CARAFE, FADE, and SAPA, they intro- mIoU(%)
duce much workload, mostly due to the time-consuming dy-
42.0
namic convolution and the additional sub-network used to
bilinear
generate dynamic kernels. Further, the need for high-res
41.5 pixel
feature guidance of FADE and SAPA somehow limits their shuffle
application scenarios. To address these concerns, we by-
pass dynamic convolution and formulate upsampling from 41.0
the perspective of point sampling, which is more resource- deconv
efficient and can be easily implemented with the standard 5 10 15 20 25
built-in function in PyTorch. We first showcase a naive de- latency(ms)
sign, and then demonstrate how to strengthen its upsam- Figure 1. Comparison of performance, inference speed,
pling behavior step by step towards our new upsampler, and GFLOPs of different upsamplers. The circle size in-
DySample. Compared with former kernel-based dynamic dicates the GFLOPs cost. The inference time is tested by
upsamplers, DySample requires no customized CUDA pack- ×2 upsampling a feature map of size 256 × 120 × 120. The
age and has much fewer parameters, FLOPs, GPU mem- mIoU performance and additional GFLOPs are tested with
ory, and latency. Besides the light-weight characteristics, SegFormer-B1 [40] on the ADE20K data set [42].
DySample outperforms other upsamplers across five dense
prediction tasks, including semantic segmentation, object
detection, instance segmentation, panoptic segmentation,
and monocular depth estimation. Code is available at However, they either suffer from checkerboard artifacts [32]
https://github.com/tiny-smart/dysample. or seem not friendly to high-level tasks. With the popular-
ity of dynamic networks [14], some dynamic upsamplers
have shown great potential on several tasks. CARAFE [37]
1. Introduction generates content-aware upsampling kernels to upsample
the feature by dynamic convolution. The following work
Feature upsampling is a crucial ingredient in dense pre- FADE [29] and SAPA [30] propose to combine both the
diction models for gradually recovering the feature reso- high-res guiding feature and the low-res input feature to
lution. The most commonly used upsamplers are nearest generate dynamic kernels, such that the upsampling process
neighbor (NN) and bilinear interpolation, which follows could be guided by the higher-res structure. These dynamic
fixed rules to interpolate upsampled values. To increase upsamplers are often of complicated structures, require cus-
flexibility, learnable upsamplers are introduced in some spe- tomized CUDA implementation, and cost much more infer-
cific tasks, e.g., deconvolution in instance segmentation [13] ence time than bilinear interpolation. Particularly for FADE
and pixel shuffle [34] in image super-resolution [31, 12, 22]. and SAPA, the higher-res guiding feature introduces even
* Corresponding author more computational workload and narrows their application
scenarios (higher-res features must be available). Different such as semantic/instance/panoptic segmentation [2, 39, 40,
from the early plain network [27], multi-scale features are 8, 7, 13, 11, 16, 19], object detection [33, 4, 24, 36], and
often used in modern architectures; therefore the higher-res monocular depth estimation [38, 18, 3, 21]. Different tasks
feature as an input into upsamplers may not be necessary. often exhibit distinct characteristics and difficulties. For
For example in Feature Pyramid Network (FPN) [23], the example, it is hard to predict both smooth interior regions
higher-res feature would add into the low-res feature after and sharp edges in semantic segmentation and also difficult
upsampling. As a result, we believe that a well-designed to distinguish different objects in instance-aware tasks. In
single-input dynamic upsampler would be sufficient. depth estimation, pixels with the same semantic meaning
Considering the heavy workload introduced by dynamic may have rather different depths, and vice versa. One often
convolution, we bypass the kernel-based paradigm and re- has to customize different architectures for different tasks.
turn to the essence of upsampling, i.e., point sampling, to Though model structure varies, upsampling operators are
reformulate the upsampling process. Specifically, we hy- essential ingredients in dense prediction models. Since a
pothesize that the input feature is interpolated to a continu- backbone typically outputs multi-scale features, the low-res
ous one with bilinear interpolation, and content-aware sam- ones need to be upsampled to higher resolution. Therefore a
pling points are generated to re-sample the continuous map. light-weight, effective upsampler would benefit many dense
From this perspective, we first present a simple design, prediction models. We will show our new upsampler design
where point-wise offsets are generated by linear projection brings a consistent performance boost on SegFormer [40]
and used to re-sample point values with the grid sample and MaskFormer [8] for semantic segmentaion, on Faster
function in PyTorch. Then we showcase how to improve it R-CNN [33] for object detection, on Mask R-CNN [13]
with step-by-step tweaks by i) controlling the initial sam- for instance segmentaion, on Panoptic FPN [16] for panop-
pling position, ii) adjusting the moving scope of the offsets, tic segmentation, and on DepthFormer [21] for monocular
and iii) dividing the upsampling process into several inde- depth estimation, while introducing negligible workload.
pendent groups, and obtain our new upsampler, DySample.
At each step, we explain why the tweak is required and con- Feature Upsampling. The commonly used feature up-
duct experiments to verify the performance gain. samplers are NN and bilinear interpolation. They apply
Compared with other dynamic upsamplers, DySample fixed rules to interpolate the low-res feature, ignoring the
i) does not need high-res guiding features as input nor ii) semantic meaning in the feature map. Max unpooling has
any extra CUDA packages other than PyTorch, and partic- been adopted in semantic segmentation by SegNet [2] to
ularly, iii) has much less inference latency, memory foot- preserve the edge information, but the introduction of noise
print, FLOPs, and number of parameters, as shown in Fig. 1 and zero filling destroy the semantic consistency in smooth
and Fig. 8. For example, on semantic segmentation with areas. Similar to convolution, some learnable upsamplers
MaskFormer-SwinB [8] as the baseline, DySample invites introduce learnable parameters in upsampling. For exam-
46% more performance improvement than CARAFE, but ple, deconvolution upsamples features in a reverse fashion
requires only 3% number of parameters and 20% FLOPs of of convolution. Pixel Shuffle [34] uses convolution to in-
CARAFE. Thanks to the highly optimized PyTorch built-in crease the channel number ahead and then reshapes the fea-
function, the inference time of DySample also approaches ture map to increase the resolution.
to that of bilinear interpolation (6.2 ms vs. 1.6 ms when Recently, some dynamic upsampling operators conduct
upsampling a 256 × 120 × 120 feature map). Besides content-aware upsampling. CARAFE [37] uses a sub-
these appealing light-weight characteristics, DySample re- network to generate content-aware dynamic convolution
ports better performance compared with other upsamplers kernels to reassemble the input feature. FADE [29] pro-
across five dense prediction tasks, including semantic seg- poses to combine the high-res and low-res feature to gen-
mentation, object detection, instance segmentation, panop- erate dynamic kernels, for the sake of using the high-res
tic segmentation, and monocular depth estimation. structure. SAPA [30] further introduces the concept of
In a nutshell, we think DySample can safely replace point affiliation and computes similarity-aware kernels be-
NN/bilinear interpolation in existing dense prediction mod- tween high-res and low-res features. Being model plugins,
els, in light of not only effectiveness but also efficiency. these dynamic upsamplers increase more complexity than
expected, especially for FADE and SAPA that require high-
res feature input. Hence, our goal is to contribute a simple,
2. Related Work
fast, low-cost, and universal upsampler, while reserving the
We review dense prediction tasks, feature upsampling effectiveness of dynamic upsampling.
operators and dynamic sampling in deep learning.
Dynamic Sampling. Upsampling is about modeling ge-
Dense Prediction Tasks. Dense prediction refers to a ometric information. A stream of work also models geo-
branch of tasks that require point-wise label prediction, metric information by dynamically sampling an image or a
sampling set
feature map, as a substitution of the standard grid sampling.
Dai et al. [9] and Zhu et al. [43] propose deformable con- sampling point 𝒔𝑯
generator
volutional networks, where the rectangular window sam-
𝟐𝒈
pling in standard convolution is replaced with shifted point 𝒔𝑾 𝒔𝑯
grid sample 𝒳′
sampling. Deformable DETR [44] follows this manner and 𝑯 𝒳
𝑪 𝑪
samples key points relative to a certain query to conduct de- 𝑾 𝒔𝑾
formable attention. Similar practices also take place when (a) Sampling based dynamic upsampling
images are downsampled to low-res ones for content-aware
image resizing, a.k.a. seam carving [1]. E.g., Zhang et Static Scope Factor
al. [41] propose to learn to downsample an image with 𝒔𝑯 𝒢 𝜎 sigmoid
elementwise mu
saliency guidance, in order to preserve more information 𝒔𝑾 𝟐𝒈 elementwise sum
of the original image, and Jin et al. [15] also set a learnable
𝑯 pixel 𝒔𝑯
deformation module to downsample the images. 𝒳 linear
0.25
shuffle 𝒪 𝒮
𝟐𝒈𝒔𝟐
Different from recent kernel-based upsamplers, we inter- 𝑾
𝒔𝑾 𝟐𝒈
pret the essence of upsampling as point re-sampling. There-
Dynamic Scope Factor
fore in feature upsampling, we tend to follow the same
spirit as the work above and use simple designs to achieve 𝒔𝑯 𝒢
a strong and efficient dynamic upsampler. 𝒔𝑾 𝟐𝒈
𝑯
0.5σ
linear 𝑯 pixel 𝒔𝑯
𝟐𝒈𝒔𝟐
3. Learning to Sample and Upsample 𝑾
shuffle 𝒪 𝒮
𝑯 𝑾 𝟐𝒈𝒔𝟐
𝒳 linear 𝒔𝑾 𝟐𝒈
In this section we elaborate the designs of DySample and 𝑾 𝟐𝒈𝒔𝟐
its variants. We first present a naive implementation and (b) Sampling point generator in DySample
then show how to improve it step by step. 27.8
43.6 22.9
Figure 2. Sampling based dynamic upsampling 43.1 and
43.2 43.243.3
42.8
43.2
12.6
6.8 7.6 6
3.1. Preliminary module designs in DySample. The input feature, upsam-
mIoU latency (ms)
We return to the essence of upsampling, i.e., point sam- pled feature, generated offset, and original grid are denoted
pling, in the light of modeling geometric information. With by X , X ′ , O, and G, respectively. (a) The sampling set is
the built-in function in PyTorch, we first provide a naive generated by the sampling point generator, with which the
implementation to demonstrate the feasibility of sampling input feature is re-sampled by the grid sample function.
based dynamic upsampling (Fig. 2(a)). In the generator (b), the sampling set is the sum of the gen-
erated offset and the original grid position. The upper box
Grid Sampling. Given a feature map X of size C × H1 × shows the version with the ‘static scope factor’, where the
W1 , and a sampling set S of size 2 × H2 × W2 , where 2 offset is generated with a linear layer. The bottom one de-
of the first dimension denotes the x and y coordinates, the scribes the version with ‘dynamic scope factor’, where the
grid sample function uses the positions in S to re-sample a scope factor is first generated and then is used to modulate
the hypothetical bilinear-interpolated X into X ′ of size C × the offset. ‘σ’ denotes the sigmoid function.
H2 × W2 . This process is defined by

\label {eq:sampling} \mathcal {X}'=\tt {grid\_sample}(\mathcal {X},\mathcal {S})\,. (1) where the reshaping operation is omitted. Finally the up-
sampled feature map X ′ of size C × sH × sW can be gen-
erated with the sampling set by grid sample as Eq. (1).
Naive Implementation. Given an upsampling scale fac-
This preliminary design obtains 37.9 AP with Faster R-
tor of s and a feature map X of size C × H × W , a linear
CNN [33] on object detection [25] and 41.9 mIoU with
layer, whose input and output channel numbers are C and
SegFormer-B1 [40] on semantic segmentation [42] (cf.
2s2 , is used to generate the offset O of size 2s2 × H × W ,
CARAFE: 38.6 AP and 42.8 mIoU). Next we present
which is then reshaped to 2 × sH × sW by Pixel Shuf-
DySample upon this naive implementation.
fling [34]. Then the sampling set S is the sum of the offset
O and the original sampling grid G, i.e.,
3.2. DySample: Upsampling by Dynamic Sampling
\label {eq:offset} \mathcal {O}=\tt {linear}(\mathcal {X})\,, (2) By studying the naive implementation, we observe that
shared initial offset position among the s2 upsampled points
neglects the position relation, and that the unconstrained
\label {eq:sample_set} \mathcal {S}=\mathcal {G}+\mathcal {O}\,, (3) walking scope of offsets can cause disordered point sam-
Sampling Initialization mIoU AP
Nearest Initialization 41.9 37.9
Bilinear Initialization 42.1 38.1

Table 1. Ablation study on initial sampling position.


(a) (b) (c)

Figure 3. Initial sampling positions and offset scopes. Factor mIoU AP Groups Dynamic mIoU AP
The points and the colored masks represent the initial sam- 0.1 42.2 38.1 1 42.4 38.3
pling positions and the offset scopes, respectively. Consid- 0.25 42.4 38.3 1 ✓ 42.6 38.4
ering sampling four points (s = 2), (a) in the case of nearest 0.5 42.2 38.1 4 43.2 38.6
initialization, the four offsets share the same initial position 1 42.1 38.1 4 ✓ 43.3 38.7
but ignore position relation; in bilinear initialization (b), we
separate the initial positions such that they distribute evenly. Table 2. Ablation Table 3. Ablation study on the ef-
Without offset modulation (b), the offset scope would typi- study on the effect of fect of dynamic scope factor.
cally overlap, so in (c) we locally constrain the offset scope static scope factor.
to reduce the overlap.
43.5 43.5
43.2 mIoU
AP
43.2
43.0 42.9 43.0
43.0 43.0 42.8
42.6 42.7
42.5 42.4 42.5 42.4

38.5 38.6 38.5 38.5 38.5


38.5 38.3 38.4 38.4 38.3
38.5 38.3
(a) Overlap sampling (b) Boundary disorder (c) Semantic artifacts
38.0 38.0
1 2 4 8 16 1 2 4 8 16
#groups #groups
Figure 4. Prediction artifacts due to offset overlap. If the
(a) LP style (b) PL style
offsets overlap (a), the point value near boundaries may be
in disorder (b), and the error would propagate layer by layer Figure 5. Ablation study on the number of feature groups.
and finally cause prediction artifacts (c).

significantly, as shown in Fig. 4(a). The overlap would eas-


pling. We first discuss the two issues. We will also study ily influence the prediction near boundaries (Fig. 4(b)), and
implementation details such as feature groups and dynamic such errors would propagate stage by stage and cause out-
offset scope. put artifacts (Fig. 4(c)). To alleviate this, we multiply the
Initial Sampling Position. In the preliminary version, the offset by a factor of 0.25, which just meets the theoretical
s2 sampling positions w.r.t. one point in X are all made marginal condition between overlap and non-overlap. This
fixed at the same initial position (the standard grid points in factor is called the ‘static scope factor’, such that the walk-
X ), as shown in Fig. 3 (a). This practice ignores the position ing scope of the sampling positions is locally constrained,
relation among the s2 neighboring points such that the ini- as shown in Fig. 3(c). Here we rewrite Eq. (2) as
tial sampling positions distribute unevenly. If the generated
\label {eq:offset0.25} \mathcal {O}=0.25~\tt {linear}(\mathcal {X})\,. (4)
offsets are all zeros, the upsampled feature is equivalent to
the NN interpolated one. Hence, this preliminary initial- By setting the scope factor to 0.25, performance improves
ization can be called ‘nearest initialization’. Targeting this to 38.3 (+0.2) AP and 42.4 (+0.3) mIoU. We also test other
problem, we alter the initial position to ‘bilinear initializa- possible factors, as shown in Table 2.
tion’ as in Fig. 3(b), where zero offsets would bring the bi- Remark. Multiplying the factor is a soft solution of the
linearly interpolated feature map. problem; it cannot completely solve it. We have also tried to
After changing the initial sampling position, the perfor- strictly constrain the offset scope in [−0.25, 0.25] with tanh
mance improves to 38.1 (+0.2) AP and 42.1 (+0.2) mIoU, function, but it works worse. Perhaps the explicit constraint
as shown in Table 1. limits the representation power, e.g., the explicit constraint
Offset Scope. Due to the existence of normalization lay- version cannot handle the situation where some certain po-
ers, the values of one certain output feature are typically sition expects a shift lager than 0.25.
in the range of [−1, 1], centered at 0. Therefore, the walk- Grouping. Here we study group-wise upsampling, where
ing scope of the local s2 sampling positions could overlap features share the same sampling set in each group. Specif-
linear pixel 𝒔𝑯 pixel 𝒔𝑾 linear 3.3. How DySample works
𝒪 𝒳 shuffle 𝒪
𝑯 𝒳 𝑯 shuffle 𝑯
𝑪
𝒔𝑯
𝑪 𝑾 𝟐𝒈𝒔𝟐
𝑾 𝒔𝑾 𝟐𝒈
𝑾 𝒔𝑯 𝑪/𝒔𝟐
𝒔𝑾 𝟐𝒈 The sampling process of DySample is visualized in
(a) linear + pixel shuffle (b) pixel shuffle + linear Fig. 9. We highlight a (red boxed) local region to show how
DySample divides one point on the edge to four to make
Figure 6. Offset generation styles. While the (a) ‘lin-
the edge clearer. For the yellow boxed point, it generates
ear+pixel shuffle’ (LP) version requires more parameters
four offsets pointing to the four upsampled points in sense
than the (b) ‘pixel shuffle + linear’ (PL) version, the for-
of bilinear interpolation. In this example, the top left point
mer is more flexible, consumes smaller memory footprint,
is divided to the ‘sky’ (lighter), while the other three are
and has faster inference speed.
divided to the ‘house’ (darker). The rightmost subplot indi-
cates how the bottom right upsampled point is formed.
ically, one can divide the feature map into g groups along 3.4. Complexity Analysis
the channel dimension and generate g groups of offsets.
We use a random feature map of size 256 × 120 × 120
According to Fig. 5, grouping works. When g = 4, per-
(and a guidance map of size 256 × 240 × 240 if required) as
formance reaches to 38.6 (+0.3) AP and 43.2 (+0.8) mIoU.
the input to test the inference latency. We use SegFormer-
Dynamic Scope Factor. To increase the flexibility of the B1 to compare the performance, training memory, training
offset, we further generate point-wise ‘dynamic scope fac- time, GFLOPs, and number of parameters when bilinear in-
tors’ by linear projecting the input feature. By using the terpolation (default) is replaced by other upsamplers.
sigmoid function and a 0.5 static factor, the dynamic scope The quantitative results are shown in Fig. 8. Besides
takes the value in the range of [0, 0.5], centered at 0.25 as the best performances, DySample series cost the least in-
the static ones. The dynamic scope operation can refer to ference latency, training memory, training time, GFLOPs,
Fig. 2(b). Here we rewrite Eq. (4) as and number of parameters than all previous strong dynamic
upsamplers. For the inference time, DySample series cost
\label {eq:offset0.5sigma} \mathcal {O}=0.5~\tt {sigmoid}(\tt {linear}_1(\mathcal {X}))\cdot \tt {linear}_2(\mathcal {X})\,. (5) 6.2 ∼ 7.6 ms to upsample a 256 × 120 × 120 feature map,
which approaches to that of bilienar interpolation (1.6ms).
Per Table 3, the dynamic scope factor further boosts the per- Particularly, due to the use of highly optimized PyTorch
formance to 38.7 (+0.1) AP and 43.3 (+0.1) mIoU. built-in function, the backward propagation of DySample
is rather fast; the increased training time is negligible.
Offset Generation Styles. In the design above, linear Among DySample series, the ‘-S’ versions cost less pa-
projection is first used to produce s2 offset sets. The sets rameters and GFLOPs, but more memory footprint and la-
are then reshaped to satisfy the spatial size. We call this pro- tency, because PL needs an extra storage of X . The ‘+’
cess as ‘linear+pixel shuffle’ (LP). To save parameters and versions also introduce a bit more computational amount.
GFLOPs, we can execute the reshaping operation ahead,
i.e., first reshaping the feature X to the size of sC2 ×sH ×sW 3.5. Discussion on Related Work
and then linearly projecting it to 2g × sH × sW . Sim- Here we compare DySample with CARAFE [37],
ilarly, we call this procedure ‘pixel shuffle+linear’ (PL). SAPA [30] and deformable attention [44].
With other hyper parameters fixed, the number of param-
Relation to CARAFE. CARAFE generates content-
eters can be reduced to 1/s4 under the PL setting. Through
aware upsampling kernels to reassemble the input feature.
experiments, we empirically set the group number as 4 and
In DySample, we generate upsampling positions instead
8 for the LP and PL version respectively according to Fig. 5.
of kernels. Under the kernel-based view, DySample uses
Further, we find that the PL version works better than the LP
2 × 2 bilinear kernels, while CARAFE uses 5 × 5 ones. In
version on SegFormer (Table 4) and MaskFormer (Table 5),
CARAFE if placing a kernel centered at a point, the ker-
but slightly worse on other tested models.
nel size must at least be 3 × 3, so the GFLOPs is at least
DySample Series. According to the form of scope factor 2.25 times larger than DySample. Besides, the upsampling
(static/dynamic) and offset generation styles (LP/PL), we kernel weights in CARAFE are learned, but in DySample
investigate four variants: they are conditioned on the x and y position. Therefore, to
maintain a single kernel DySample only needs a 2-channel
i) DySample: LP-style with the static scope factor; feature map (given that the group number g = 1), but
CARAFE requires a K × K-channel one, which explains
ii) DySample+: LP-style with the dynamic scope factor;
why DySample is more efficient.
iii) DySample-S: PL-style with the static scope factor; Relation to SAPA. SAPA introduces the concept of se-
iv) DySample-S+: PL-style with dynamic scope factor. mantic cluster into feature upsampling and views the up-
𝑣
𝑢

Input feature Predicted offsets Upsampled feature


(1 − 𝑢)(1 − 𝑣) '
+ 𝑢(1 − 𝑣) '
+ 1−𝑢 𝑣'
+ 𝑢𝑣 '
=
Image Input local region Upsampled local region Bilinear interpolation

Figure 7. Visualization of the upsampling process in DySample. A part of the boundary in red box is highlighted for
a close view. We generate content-aware offsets to construct new sampling points to resample the input feature map with
bilinear interpolation. The new sampling positions are indicated by the arrowheads. The yellow boxed point in the low-res
feature is selected to illustrate the bilinear interpolation process.

27.8
43.6 22.9 bilinear 1.6 also be seen as seeking for a semantically similar region for
43.1 43.2 43.2 43.243.3
42.8 12.6 each point. However, DySample does not need the guidance
6.8 7.6 6.2 7.1
map and thus is more efficient and easy-to-use.
mIoU latency (ms) Relation to Deformable Attention. Deformable atten-
2960 7.08
tion [44] mainly enhances features; it samples many points
3.87 at each position to aggregate them to form a new point. But
2.88
664 944 452 544 260 256
0.10 0.52 0.02 0.02 DySample is tailored for upsampling; it samples a single
point for each upsampled position to divide one point to s2
+memory (M) +training time (hours)
upsampled points. DySample reveals that sampling a single
2.7 0.4
1.5 1.7 0.3 point for each upsampled position is enough as long as the
0.2
0.2 0.3 0.3 0.4 0.05 0.1 upsampled s2 points can be dynamically divided.
0.0060.012

+GFLOPs +parameters (M) 4. Applications


CARAFE FADE SAPA-B
DySample-S DySample-S+ DySample
DySample+
Here we apply DySample on five dense prediction tasks,
including semantic segmentation, object detection, instance
Figure 8. Complexity analysis. DySample series achieve segmentation, panoptic segmentation, and depth estimation.
the overall best performances on SegFormer-B1 [40], and Among the upsampler competitors, in bilinear interpola-
cost the least latency, memory footprint, training time, tion, we set the scale factor as 2 and ‘align corners’ as
GFLOPs, and number of parameters among the recent False. For deconvolution, we set the kernel size as 3, the
strong dynamic upsamplers. The inference time is tested stride as 2, the padding as 1 and the output padding as 1.
by upsampling a 256 × 120 × 120 feature map (and a For pixel shuffle [34], we first use a 3-kernel size convolu-
256 × 240 × 240 guidance feature if needed) with a single tion to increase the channel number to 4 times of the orig-
Nvidia GTX 3090 GPU on a server. ‘+’ means the addi- inal one, and then apply the ‘pixel shuffle’ function. For
tional amount compared with bilinear interpolation. CARAFE [37], we adopt its default setting. The ‘HIN’ ver-
sion of IndexNet [28] and the ‘dynamic-cs-d†’ version of
A2U [10] are used. FADE [29] without gating mechanism
sampling process as finding a correct semantic cluster for and SAPA-B [30] are used because they are more stable
each upsampling point. In DySample, offset generation can across all the dense prediction tasks.
Image Ground Truth CARAFE IndexNet A2U FADE SAPA DySample

Figure 9. Qualitative visualizations. From top to bottom: semantic segmentation, object detection, instance segmentation,
panoptic segmentation, and monocular depth estimation.

4.1. Semantic Segmentation SegFormer-B1 FLOPs Params mIoU bIoU

Semantic segmentation infers per-pixel class labels. Up- Bilinear 15.9 13.7M 41.68 27.80
samplers are often adopted several times to obtain the high- Deconv +34.4 +3.5M 40.71 25.94
PixelShuffle [34] +34.4 +14.2M 41.50 26.58
res output in typical models. The precise per-pixel predic-
CARAFE [37] +1.5 +0.4M 42.82 29.84
tion is largely dependent on the upsampling quality. IndexNet [28] +30.7 +12.6M 41.50 28.27
A2U [10] +0.4 +0.1M 41.45 27.31
Experimental Protocols. We use the ADE20K [42] data FADE [29] +2.7 +0.3M 43.06 31.68
set. Besides the commonly used mIoU metric, we also SAPA-B [30] +1.0 +0.1M 43.20 30.96
report the bIoU [6] metric to evaluate the boundary qual- DySample-S +0.2 +6.1K 43.23 29.53
ity. We first use a light-weight baseline SegFormer-B1 [40], DySample-S+ +0.3 +12.3K 43.58 29.93
where 3 + 2 + 1 = 6 upsampling stages are involved, and DySample +0.3 +49.2K 43.21 29.12
then test DySample on a stronger baseline MaskFormer [8], DySample+ +0.4 +0.1M 43.28 29.23
with Swin-B [26] and Swin-L as the backbone, where 3 up-
sampling stages are involved in FPN. We use the official Table 4. Semantic segmentation results with SegFormer-B1
codebase provided by the authors and follow all the training on ADE20K. Best performance is in boldface and second
settings except for only modifying the upsampling stages. best is underlined.

Semantic Segmentation Results. Quantitative results are


shown in Tables 4 and 5. We can see DySample achieves the have wrong predictions on interior regions. For the stronger
best mIoU metric of 43.58 on SegFormer-B1, but the bIoU baseline MaskFormer, DySample also improves the mIoU
metric is lower than those guided upsamplers such as FADE metric from 52.70 to 53.91 (+1.21) with Swin-B and from
and SAPA. Therefore we can infer that DySample improves 54.10 to 54.90 (+0.80) with Swin-L.
the performance mainly from the interior regions, and the
4.2. Object Detection and Instance Segmentation
guided upsamplers mainly improve boundary quality. As
shown in Fig. 9 row 1, the output of DySample is simi- Being instance-level tasks, object detection aims to lo-
lar to that of CARAFE, but more distinctive near bound- calize and classify objects, while instance segmentation
aries; the guided upsamplers predict sharper boundaries, but need to further segment the objects. The quality of the up-
Backbone Upsampler mIoU Mask R-CNN Task Backbone AP AP50 AP75 APS APM APL
Swin-B Nearest 52.70 Nearest Bbox R50 38.3 58.7 42.0 21.9 41.8 50.2
CARAFE 53.53 Deconv R50 37.9 58.5 41.0 22.0 41.6 49.0
DySample-S+ 53.91 PixelShuffle [34] R50 38.5 59.4 41.9 22.0 42.3 49.8
Swin-L Nearest 54.10 CARAFE [37] R50 39.2 60.0 43.0 23.0 42.8 50.8
CARAFE 54.61 IndexNet [28] R50 38.4 59.2 41.7 22.1 41.7 50.3
DySample-S+ 54.90 A2U [10] R50 38.2 59.2 41.4 22.3 41.7 49.6
FADE [29] R50 39.1 60.3 42.4 23.6 42.3 51.0
Table 5. Semantic segmentation results with MaskFormer SAPA-B [30] R50 38.7 59.7 42.2 23.1 41.8 49.9
on ADE20K. Best performance is in boldface and second DySample-S R50 39.3 60.4 43.0 23.2 42.7 51.1
DySample-S+ R50 39.3 60.3 42.8 23.2 42.7 50.8
best is underlined.
DySample R50 39.2 60.3 43.0 23.5 42.5 51.0
DySample+ R50 39.6 60.4 43.5 23.4 42.9 51.7
Faster R-CNN Backbone Params AP AP50 AP75 APS APM APL Nearest R101 40.0 60.4 43.7 22.8 43.7 52.0
DySample+ R101 41.0 61.9 44.9 24.3 45.0 53.5
Nearest R50 46.8M 37.5 58.2 40.8 21.3 41.1 48.9
Deconv R50 +2.4M 37.3 57.8 40.3 21.3 41.1 48.0 Nearest Segm R50 34.7 55.8 37.2 16.1 37.3 50.8
PixelShuffle [34] R50 +9.4M 37.5 58.5 40.4 21.5 41.5 48.3 Deconv R50 34.5 55.5 36.8 16.4 37.0 49.5
CARAFE [37] R50 +0.3M 38.6 59.9 42.2 23.3 42.2 49.7 PixelShuffle [34] R50 34.8 56.0 37.3 16.3 37.5 50.4
IndexNet [28] R50 +8.4M 37.6 58.4 40.9 21.5 41.3 49.2 CARAFE [37] R50 35.4 56.7 37.6 16.9 38.1 51.3
A2U [10] R50 +38.9K 37.3 58.7 40.0 21.7 41.1 48.5 IndexNet [28] R50 34.7 55.9 37.1 16.0 37.0 51.1
FADE [29] R50 +0.2M 38.5 59.6 41.8 23.1 42.2 49.3 A2U [10] R50 34.6 56.0 36.8 16.1 37.4 50.3
SAPA-B [30] R50 +0.1M 37.8 59.2 40.6 22.4 41.4 49.1 FADE [29] R50 35.1 56.7 37.2 16.7 37.5 51.4
DySample-S R50 +4.1K 38.5 59.5 42.1 22.7 41.9 50.2 SAPA-B [30] R50 35.1 56.5 37.4 16.7 37.6 50.6
DySample-S+ R50 +8.2K 38.6 59.8 42.1 22.5 42.1 50.0 DySample-S R50 35.4 56.8 37.8 16.7 38.0 51.4
DySample R50 +32.7K 38.6 59.9 42.0 22.9 42.1 50.2 DySample-S+ R50 35.5 56.8 37.8 17.0 37.9 51.9
DySample+ R50 +65.5K 38.7 60.0 42.2 22.5 42.4 50.2 DySample R50 35.4 56.9 37.8 17.1 37.7 51.1
Nearest R101 65.8M 39.4 60.1 43.1 22.4 43.7 51.1 DySample+ R50 35.7 57.3 38.2 17.3 38.2 51.8
DySample+ R101 +65.5K 40.5 61.6 43.8 24.2 44.5 52.3 Nearest R101 36.0 57.6 38.5 16.5 39.3 52.2
DySample+ R101 36.8 58.7 39.5 17.5 40.0 53.8
Table 6. Object detection results with Faster R-CNN on MS
COCO. Best performance is in boldface and second best is Table 7. Instance segmentation results with Mask R-CNN
underlined. on MS COCO. The parameter increment is identical as in
Faster R-CNN. Best performance is in boldface and second
best is underlined.
sampled features can have large effect on the classification,
localization, and segmentation accuracy. Panoptic FPN Backbone Params P Q P Qth P Qst SQ RQ
Nearest R50 46.0M 40.2 47.8 28.9 77.8 49.3
Deconv R50 +1.8M 39.6 47.0 28.4 77.1 48.5
Experimental Protocols. We use the MS COCO [25] PixelShuffle [34] R50 +7.1M 40.0 47.4 28.8 77.1 49.1
data set. The AP series metrics are reported. Faster R- CARAFE [37] R50 +0.2M 40.8 47.7 30.4 78.2 50.0
CNN [33] and Mask R-CNN [13] are chosen as the base- IndexNet [28] R50 +6.3M 40.2 47.6 28.9 77.1 49.3
lines. We modify the upsamplers in the FPN architec- A2U [10] R50 +29.2K 40.1 47.6 28.7 77.3 48.0
ture for performance comparison. There are four and three FADE [29] R50 +0.1M 40.9 48.0 30.3 78.1 50.1
upsampling stages in the FPN of Faster R-CNN and of SAPA-B [30] R50 +0.1M 40.6 47.7 29.8 78.0 49.6
DySample-S R50 +3.1K 40.6 48.0 29.6 78.0 49.8
Mask R-CNN, respectively. We use the code provided by
DySample-S+ R50 +6.2K 41.1 48.1 30.5 78.2 50.2
mmdetection [5] and follow the 1× training settings. DySample R50 +24.6K 41.4 48.5 30.7 78.6 50.7
DySample+ R50 +49.2K 41.5 48.5 30.8 78.3 50.7
Object Detection and Instance Segmentation Results. Nearest R101 65.0M 42.2 50.1 30.3 78.3 51.4
Quantitative results are shown in Tables 6 and 7. Results DySample+ R101 +49.2K 43.0 50.2 32.1 78.6 52.4
show that DySample outperforms all compared upsamplers.
With R50, DySample achieves the best performance among Table 8. Panoptic segmentation results with Panoptic FPN
all tested upsamplers. When a stronger backbone is used, on MS COCO. Best performance is in boldface and second
notable improvements can also be witnessed (R50 +1.2 vs. best is underlined.
R101 +1.1 box AP on Faster R-CNN, and R50 +1.0 vs.
R101 +0.8 mask AP on Mask R-CNN).
DepthFormer Params δ < 1.25 δ < 1.252 δ < 1.253 Abs Rel RMS log10 RMS(log) Sq Rel
Bilinear 47.6M 0.873 0.978 0.994 0.120 0.402 0.050 0.148 0.071
Deconv +7.1M 0.872 0.980 0.995 0.117 0.401 0.050 0.147 0.067
PixelShuffle +28.2M 0.874 0.979 0.995 0.117 0.395 0.049 0.146 0.068
CARAFE [37] +0.3M 0.877 0.978 0.995 0.116 0.397 0.049 0.146 0.069
IndexNet [28] +6.3M 0.873 0.980 0.995 0.117 0.401 0.049 0.147 0.067
A2U [10] +30.0K 0.874 0.979 0.995 0.118 0.397 0.049 0.147 0.068
FADE [29] +0.2M 0.874 0.978 0.994 0.118 0.399 0.049 0.147 0.071
SAPA-B +0.1M 0.870 0.978 0.995 0.117 0.406 0.050 0.149 0.069
DySample-S +5.8K 0.871 0.979 0.995 0.118 0.402 0.050 0.148 0.069
DySample-S+ +11.5K 0.872 0.978 0.994 0.119 0.398 0.050 0.148 0.070
DySample +46.1K 0.872 0.979 0.995 0.117 0.400 0.050 0.147 0.068
DySample+ +92.2K 0.878 0.980 0.995 0.116 0.393 0.049 0.145 0.068

Table 9. Monocular depth estimation results with DepthFormer (Swin-T) on NYU Depth V2. Best performance is in boldface
and second best is underlined.

4.3. Panoptic Segmentation provided by monocular depth estimation tool


box [20] and follow its recommended training settings,
Panoptic segmentation is the joint task of semantic seg-
while only modifying the upsamplers.
mentation and instance segmentation. In this context,
the upsamplers face the difficulty to discriminate instance
boundaries, which places high demands on good semantic
perception and discriminative ability of the upsamplers. Monocular Depth Estimation Results. Quantitative re-
sults are shown in Table 9. Among all upsamplers, DySam-
Experimental Protocols. We also conduct experiments ple+ achieves the best performance, with an increase of 0.05
on the MS COCO [25] data set and report the PQ, SQ, in δ < 1.25 accuracy, a decrease of 0.04 in Abs Rel, and a
and RQ metrics [17]. We adopt Panoptic FPN [16] as the decrease of 0.09 in RMS compared with bilinear upsam-
baseline and mmdetection as our codebase. The default pling. Further, the qualitative comparison in Fig. 9 row 5
training setting is used to ensure a fair comparison. We only also verifies the superiority of DySample, e.g., the accurate,
modify the total three upsampling stages in FPN. consistent depth map of the chair.
Panoptic Segmentation Results. The quantitative results
shown in Table 8 demonstrate that DySample invites con-
sistent performance gains, i.e., 1.2 and 0.8 PQ improvement 5. Conclusion
for R50 and R101 backbone respectively.
We propose DySample, a fast, effective, and universal
4.4. Monocular Depth Estimation dynamic upsampler. Different from common kernel based
Monocular depth estimation requires a model to estimate dynamic upsampling, DySample is designed from the per-
a per-pixel depth map from a single image. A high-quality spective of point sampling. We start from a naive design
upsampler for depth estimation should simultaneously re- and show how to gradually improve its performance from
cover the details, maintain the consistency of the depth our deep insight of upsampling. Compared with other dy-
value in a plain region, and also tackles gradually changed namic upsamplers, DySample not only reports the best per-
depth values. formance but also gets rid of customised CUDA packages
and consumes the least computational resources, showing
Experimental Protocols. We conduct the experiments on superiority across latency, training memory, training time,
the NYU Depth V2 data set [35] and report the δ < 1.25, GFLOPs, and number of parameters. For future work, we
δ < 1.252 and δ < 1.253 accuracy, absolute relative error plan to apply DySample to low-level tasks and study joint
(Abs Rel), root mean squared error (RMS) and its log ver- modeling of upsampling and downsampling.
sion (RMS(log)), average log10 error (log10), and squared
relative error (Sq Rel). We adopt DepthFormer-SwinT [21] Acknowledgement. This work is supported by the Na-
as the baseline including four upsampling stages in the tional Natural Science Foundation of China under Grant No.
fusion module. For reproducibility, we use the codebase 62106080.
References nual Conference on Neural Information Processing Systems
(NeurIPS), pages 667–675, 2016.
[1] Shai Avidan and Ariel Shamir. Seam carving for content-
[15] Chen Jin, Ryutaro Tanno, Thomy Mertzanidou, Eleftheria
aware image resizing. In ACM SIGGRAPH 2007 papers,
Panagiotaki, and Daniel C Alexander. Learning to down-
pages 10–es. 2007.
sample for segmentation of ultra-high resolution images. In
[2] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.
Proceedings of International Conference on Learning Rep-
SegNet: A deep convolutional encoder-decoder architecture
resentations, 2022.
for image segmentation. IEEE Transactions on Pattern Anal-
[16] Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr
ysis and Machine Intelligence, 39(12):2481–2495, 2017.
Dollár. Panoptic feature pyramid networks. In Proceedings
[3] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka.
of IEEE Conference on Computer Vision Pattern Recognition
Adabins: Depth estimation using adaptive bins. In Proceed-
(CVPR), pages 6399–6408, 2019.
ings of IEEE Conference on Computer Vision Pattern Recog-
[17] Alexander Kirillov, Kaiming He, Ross Girshick, Carsten
nition (CVPR), pages 4009–4018, 2021.
Rother, and Piotr Dollár. Panoptic segmentation. In Pro-
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Delv-
ceedings of IEEE Conference on Computer Vision Pattern
ing into high quality object detection. In Proceedings of
Recognition (CVPR), pages 9404–9413, 2019.
IEEE Conference on Computer Vision Pattern Recognition
(CVPR), pages 6154–6162, 2018. [18] Jin Han Lee, Myung-Kyu Han, Dong Wook Ko, and Il Hong
Suh. From big to small: Multi-scale local planar guidance
[5] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu
for monocular depth estimation. arXiv Computer Research
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,
Repository, 2019.
Jiarui Xu, et al. Mmdetection: Open mmlab detection tool-
box and benchmark. arXiv Computer Research Repository, [19] Yanwei Li, Hengshuang Zhao, Xiaojuan Qi, Liwei Wang,
2019. Zeming Li, Jian Sun, and Jiaya Jia. Fully convolutional
[6] Bowen Cheng, Ross Girshick, Piotr Dollár, Alexander C networks for panoptic segmentation. In Proceedings of
Berg, and Alexander Kirillov. Boundary iou: Improving IEEE Conference on Computer Vision Pattern Recognition
object-centric image segmentation evaluation. In Proceed- (CVPR), pages 214–223, 2021.
ings of IEEE Conference on Computer Vision Pattern Recog- [20] Zhenyu Li. Monocular depth estimation tool-
nition (CVPR), pages 15334–15342, 2021. box. https://github.com/zhyever/
[7] Bowen Cheng, Ishan Misra, Alexander G. Schwing, Alexan- Monocular-Depth-Estimation-Toolbox, 2022.
der Kirillov, and Rohit Girdhar. Masked-attention mask [21] Zhenyu Li, Zehui Chen, Xianming Liu, and Junjun Jiang.
transformer for universal image segmentation. In Proceed- Depthformer: Exploiting long-range correlation and local in-
ings of IEEE Conference on Computer Vision Pattern Recog- formation for accurate monocular depth estimation. arXiv
nition (CVPR), 2022. Computer Research Repository, 2022.
[8] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- [22] Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc
pixel classification is not all you need for semantic segmen- Van Gool, and Radu Timofte. Swinir: Image restoration us-
tation. In Proceedings of Annual Conference on Neural In- ing swin transformer. In Proceedings of IEEE International
formation Processing Systems (NeurIPS), volume 34, 2021. Conference on Computer Vision (ICCV), pages 1833–1844,
[9] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong 2021.
Zhang, Han Hu, and Yichen Wei. Deformable convolutional [23] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
networks. In Proceedings of IEEE International Conference Bharath Hariharan, and Serge Belongie. Feature pyramid
on Computer Vision (ICCV), pages 764–773, 2017. networks for object detection. In Proceedings of IEEE Con-
[10] Yutong Dai, Hao Lu, and Chunhua Shen. Learning affinity- ference on Computer Vision Pattern Recognition (CVPR),
aware upsampling for deep image matting. In Proceedings pages 2117–2125, 2017.
of IEEE Conference on Computer Vision Pattern Recognition [24] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
(CVPR), pages 6841–6850, 2021. Piotr Dollár. Focal loss for dense object detection. In Pro-
[11] Yuxin Fang, Shusheng Yang, Xinggang Wang, Yu Li, Chen ceedings of IEEE International Conference on Computer Vi-
Fang, Ying Shan, Bin Feng, and Wenyu Liu. Instances as sion (ICCV), pages 2980–2988, 2017.
queries. In Proceedings of IEEE International Conference [25] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
on Computer Vision (ICCV), pages 6910–6919, 2021. Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[12] Muhammad Haris, Gregory Shakhnarovich, and Norimichi Zitnick. Microsoft coco: Common objects in context. In
Ukita. Deep back-projection networks for super-resolution. Proceedings of European Conference on Computer Vision
In Proceedings of IEEE Conference on Computer Vision Pat- (ECCV), pages 740–755, 2014.
tern Recognition (CVPR), pages 1664–1673, 2018. [26] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
[13] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- Zhang, Stephen Lin, and Baining Guo. Swin transformer:
shick. Mask R-CNN. In Proceedings of IEEE International Hierarchical vision transformer using shifted windows. In
Conference on Computer Vision (ICCV), pages 2961–2969, Proceedings of IEEE International Conference on Computer
2017. Vision (ICCV), pages 10012–10022, 2021.
[14] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V [27] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional
Gool. Dynamic filter networks. In Proceedings of An- networks for semantic segmentation. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 39(4):640–651, [41] Xuaner Zhang, Qifeng Chen, Ren Ng, and Vladlen Koltun.
2015. Zoom to learn, learn to zoom. In Proceedings of IEEE Con-
[28] Hao Lu, Yutong Dai, Chunhua Shen, and Songcen Xu. In- ference on Computer Vision Pattern Recognition (CVPR),
dices matter: Learning to index for deep image matting. In pages 3762–3770, 2019.
Proceedings of IEEE International Conference on Computer [42] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela
Vision (ICCV), pages 3266–3275, 2019. Barriuso, and Antonio Torralba. Scene parsing through
[29] Hao Lu, Wenze Liu, Hongtao Fu, and Zhiguo Cao. FADE: ade20k dataset. In Proceedings of IEEE Conference on Com-
Fusing the assets of decoder and encoder for task-agnostic puter Vision Pattern Recognition (CVPR), 2017.
upsampling. In Proceedings of European Conference on [43] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-
Computer Vision (ECCV), 2022. formable convnets v2: More deformable, better results. In
[30] Hao Lu, Wenze Liu, Zixuan Ye, Hongtao Fu, Yuliang Liu, Proceedings of IEEE Conference on Computer Vision Pat-
and Zhiguo Cao. SAPA: Similarity-aware point affiliation for tern Recognition (CVPR), pages 9308–9316, 2019.
feature upsampling. In Proceedings of Annual Conference on [44] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang,
Neural Information Processing Systems (NeurIPS), 2022. and Jifeng Dai. Deformable DETR: Deformable transform-
[31] Yiqun Mei, Yuchen Fan, and Yuqian Zhou. Image super- ers for end-to-end object detection. In Proceedings of Inter-
resolution with non-local sparse attention. In Proceedings of national Conference on Learning Representations, 2021.
IEEE Conference on Computer Vision Pattern Recognition
(CVPR), pages 3517–3526, 2021.
[32] Augustus Odena, Vincent Dumoulin, and Chris Olah. De-
convolution and checkerboard artifacts. Distill, 1(10), 2016.
[33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Faster R-CNN: Towards real-time object detection with re-
gion proposal networks. Proceedings of Annual Conference
on Neural Information Processing Systems (NeurIPS), 28,
2015.
[34] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz,
Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan
Wang. Real-time single image and video super-resolution
using an efficient sub-pixel convolutional neural network. In
Proceedings of IEEE Conference on Computer Vision Pat-
tern Recognition (CVPR), pages 1874–1883, 2016.
[35] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from
RGBD images. In Proceedings of European Conference on
Computer Vision (ECCV), pages 746–760, 2012.
[36] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In Proceed-
ings of IEEE International Conference on Computer Vision
(ICCV), pages 9627–9636, 2019.
[37] Jiaqi Wang, Kai Chen, Rui Xu, Ziwei Liu, Chen Change Loy,
and Dahua Lin. CARAFE: Context-aware reassembly of fea-
tures. In Proceedings of IEEE International Conference on
Computer Vision (ICCV), pages 3007–3016, 2019.
[38] Diana Wofk, Fangchang Ma, Tien-Ju Yang, Sertac Karaman,
and Vivienne Sze. FastDepth: Fast monocular depth esti-
mation on embedded systems. In Proceedings of IEEE In-
ternational Conference on Robotics and Automation (ICRA),
2019.
[39] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understand-
ing. In Proceedings of European Conference on Computer
Vision (ECCV), pages 418–434, 2018.
[40] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Jose M Alvarez, and Ping Luo. Segformer: Simple and ef-
ficient design for semantic segmentation with transformers.
In Proceedings of Annual Conference on Neural Information
Processing Systems (NeurIPS), 2021.

You might also like