Dy Simple
Dy Simple
{wzliu,hlu}@hust.edu.cn
Abstract 43.5
DySample Series
We present DySample, an ultra-lightweight and effective 43.0 SAPA
dynamic upsampler. While impressive performance gains FADE
have been witnessed from recent kernel-based dynamic up- CARAFE
42.5
samplers such as CARAFE, FADE, and SAPA, they intro- mIoU(%)
duce much workload, mostly due to the time-consuming dy-
42.0
namic convolution and the additional sub-network used to
bilinear
generate dynamic kernels. Further, the need for high-res
41.5 pixel
feature guidance of FADE and SAPA somehow limits their shuffle
application scenarios. To address these concerns, we by-
pass dynamic convolution and formulate upsampling from 41.0
the perspective of point sampling, which is more resource- deconv
efficient and can be easily implemented with the standard 5 10 15 20 25
built-in function in PyTorch. We first showcase a naive de- latency(ms)
sign, and then demonstrate how to strengthen its upsam- Figure 1. Comparison of performance, inference speed,
pling behavior step by step towards our new upsampler, and GFLOPs of different upsamplers. The circle size in-
DySample. Compared with former kernel-based dynamic dicates the GFLOPs cost. The inference time is tested by
upsamplers, DySample requires no customized CUDA pack- ×2 upsampling a feature map of size 256 × 120 × 120. The
age and has much fewer parameters, FLOPs, GPU mem- mIoU performance and additional GFLOPs are tested with
ory, and latency. Besides the light-weight characteristics, SegFormer-B1 [40] on the ADE20K data set [42].
DySample outperforms other upsamplers across five dense
prediction tasks, including semantic segmentation, object
detection, instance segmentation, panoptic segmentation,
and monocular depth estimation. Code is available at However, they either suffer from checkerboard artifacts [32]
https://github.com/tiny-smart/dysample. or seem not friendly to high-level tasks. With the popular-
ity of dynamic networks [14], some dynamic upsamplers
have shown great potential on several tasks. CARAFE [37]
1. Introduction generates content-aware upsampling kernels to upsample
the feature by dynamic convolution. The following work
Feature upsampling is a crucial ingredient in dense pre- FADE [29] and SAPA [30] propose to combine both the
diction models for gradually recovering the feature reso- high-res guiding feature and the low-res input feature to
lution. The most commonly used upsamplers are nearest generate dynamic kernels, such that the upsampling process
neighbor (NN) and bilinear interpolation, which follows could be guided by the higher-res structure. These dynamic
fixed rules to interpolate upsampled values. To increase upsamplers are often of complicated structures, require cus-
flexibility, learnable upsamplers are introduced in some spe- tomized CUDA implementation, and cost much more infer-
cific tasks, e.g., deconvolution in instance segmentation [13] ence time than bilinear interpolation. Particularly for FADE
and pixel shuffle [34] in image super-resolution [31, 12, 22]. and SAPA, the higher-res guiding feature introduces even
* Corresponding author more computational workload and narrows their application
scenarios (higher-res features must be available). Different such as semantic/instance/panoptic segmentation [2, 39, 40,
from the early plain network [27], multi-scale features are 8, 7, 13, 11, 16, 19], object detection [33, 4, 24, 36], and
often used in modern architectures; therefore the higher-res monocular depth estimation [38, 18, 3, 21]. Different tasks
feature as an input into upsamplers may not be necessary. often exhibit distinct characteristics and difficulties. For
For example in Feature Pyramid Network (FPN) [23], the example, it is hard to predict both smooth interior regions
higher-res feature would add into the low-res feature after and sharp edges in semantic segmentation and also difficult
upsampling. As a result, we believe that a well-designed to distinguish different objects in instance-aware tasks. In
single-input dynamic upsampler would be sufficient. depth estimation, pixels with the same semantic meaning
Considering the heavy workload introduced by dynamic may have rather different depths, and vice versa. One often
convolution, we bypass the kernel-based paradigm and re- has to customize different architectures for different tasks.
turn to the essence of upsampling, i.e., point sampling, to Though model structure varies, upsampling operators are
reformulate the upsampling process. Specifically, we hy- essential ingredients in dense prediction models. Since a
pothesize that the input feature is interpolated to a continu- backbone typically outputs multi-scale features, the low-res
ous one with bilinear interpolation, and content-aware sam- ones need to be upsampled to higher resolution. Therefore a
pling points are generated to re-sample the continuous map. light-weight, effective upsampler would benefit many dense
From this perspective, we first present a simple design, prediction models. We will show our new upsampler design
where point-wise offsets are generated by linear projection brings a consistent performance boost on SegFormer [40]
and used to re-sample point values with the grid sample and MaskFormer [8] for semantic segmentaion, on Faster
function in PyTorch. Then we showcase how to improve it R-CNN [33] for object detection, on Mask R-CNN [13]
with step-by-step tweaks by i) controlling the initial sam- for instance segmentaion, on Panoptic FPN [16] for panop-
pling position, ii) adjusting the moving scope of the offsets, tic segmentation, and on DepthFormer [21] for monocular
and iii) dividing the upsampling process into several inde- depth estimation, while introducing negligible workload.
pendent groups, and obtain our new upsampler, DySample.
At each step, we explain why the tweak is required and con- Feature Upsampling. The commonly used feature up-
duct experiments to verify the performance gain. samplers are NN and bilinear interpolation. They apply
Compared with other dynamic upsamplers, DySample fixed rules to interpolate the low-res feature, ignoring the
i) does not need high-res guiding features as input nor ii) semantic meaning in the feature map. Max unpooling has
any extra CUDA packages other than PyTorch, and partic- been adopted in semantic segmentation by SegNet [2] to
ularly, iii) has much less inference latency, memory foot- preserve the edge information, but the introduction of noise
print, FLOPs, and number of parameters, as shown in Fig. 1 and zero filling destroy the semantic consistency in smooth
and Fig. 8. For example, on semantic segmentation with areas. Similar to convolution, some learnable upsamplers
MaskFormer-SwinB [8] as the baseline, DySample invites introduce learnable parameters in upsampling. For exam-
46% more performance improvement than CARAFE, but ple, deconvolution upsamples features in a reverse fashion
requires only 3% number of parameters and 20% FLOPs of of convolution. Pixel Shuffle [34] uses convolution to in-
CARAFE. Thanks to the highly optimized PyTorch built-in crease the channel number ahead and then reshapes the fea-
function, the inference time of DySample also approaches ture map to increase the resolution.
to that of bilinear interpolation (6.2 ms vs. 1.6 ms when Recently, some dynamic upsampling operators conduct
upsampling a 256 × 120 × 120 feature map). Besides content-aware upsampling. CARAFE [37] uses a sub-
these appealing light-weight characteristics, DySample re- network to generate content-aware dynamic convolution
ports better performance compared with other upsamplers kernels to reassemble the input feature. FADE [29] pro-
across five dense prediction tasks, including semantic seg- poses to combine the high-res and low-res feature to gen-
mentation, object detection, instance segmentation, panop- erate dynamic kernels, for the sake of using the high-res
tic segmentation, and monocular depth estimation. structure. SAPA [30] further introduces the concept of
In a nutshell, we think DySample can safely replace point affiliation and computes similarity-aware kernels be-
NN/bilinear interpolation in existing dense prediction mod- tween high-res and low-res features. Being model plugins,
els, in light of not only effectiveness but also efficiency. these dynamic upsamplers increase more complexity than
expected, especially for FADE and SAPA that require high-
res feature input. Hence, our goal is to contribute a simple,
2. Related Work
fast, low-cost, and universal upsampler, while reserving the
We review dense prediction tasks, feature upsampling effectiveness of dynamic upsampling.
operators and dynamic sampling in deep learning.
Dynamic Sampling. Upsampling is about modeling ge-
Dense Prediction Tasks. Dense prediction refers to a ometric information. A stream of work also models geo-
branch of tasks that require point-wise label prediction, metric information by dynamically sampling an image or a
sampling set
feature map, as a substitution of the standard grid sampling.
Dai et al. [9] and Zhu et al. [43] propose deformable con- sampling point 𝒔𝑯
generator
volutional networks, where the rectangular window sam-
𝟐𝒈
pling in standard convolution is replaced with shifted point 𝒔𝑾 𝒔𝑯
grid sample 𝒳′
sampling. Deformable DETR [44] follows this manner and 𝑯 𝒳
𝑪 𝑪
samples key points relative to a certain query to conduct de- 𝑾 𝒔𝑾
formable attention. Similar practices also take place when (a) Sampling based dynamic upsampling
images are downsampled to low-res ones for content-aware
image resizing, a.k.a. seam carving [1]. E.g., Zhang et Static Scope Factor
al. [41] propose to learn to downsample an image with 𝒔𝑯 𝒢 𝜎 sigmoid
elementwise mu
saliency guidance, in order to preserve more information 𝒔𝑾 𝟐𝒈 elementwise sum
of the original image, and Jin et al. [15] also set a learnable
𝑯 pixel 𝒔𝑯
deformation module to downsample the images. 𝒳 linear
0.25
shuffle 𝒪 𝒮
𝟐𝒈𝒔𝟐
Different from recent kernel-based upsamplers, we inter- 𝑾
𝒔𝑾 𝟐𝒈
pret the essence of upsampling as point re-sampling. There-
Dynamic Scope Factor
fore in feature upsampling, we tend to follow the same
spirit as the work above and use simple designs to achieve 𝒔𝑯 𝒢
a strong and efficient dynamic upsampler. 𝒔𝑾 𝟐𝒈
𝑯
0.5σ
linear 𝑯 pixel 𝒔𝑯
𝟐𝒈𝒔𝟐
3. Learning to Sample and Upsample 𝑾
shuffle 𝒪 𝒮
𝑯 𝑾 𝟐𝒈𝒔𝟐
𝒳 linear 𝒔𝑾 𝟐𝒈
In this section we elaborate the designs of DySample and 𝑾 𝟐𝒈𝒔𝟐
its variants. We first present a naive implementation and (b) Sampling point generator in DySample
then show how to improve it step by step. 27.8
43.6 22.9
Figure 2. Sampling based dynamic upsampling 43.1 and
43.2 43.243.3
42.8
43.2
12.6
6.8 7.6 6
3.1. Preliminary module designs in DySample. The input feature, upsam-
mIoU latency (ms)
We return to the essence of upsampling, i.e., point sam- pled feature, generated offset, and original grid are denoted
pling, in the light of modeling geometric information. With by X , X ′ , O, and G, respectively. (a) The sampling set is
the built-in function in PyTorch, we first provide a naive generated by the sampling point generator, with which the
implementation to demonstrate the feasibility of sampling input feature is re-sampled by the grid sample function.
based dynamic upsampling (Fig. 2(a)). In the generator (b), the sampling set is the sum of the gen-
erated offset and the original grid position. The upper box
Grid Sampling. Given a feature map X of size C × H1 × shows the version with the ‘static scope factor’, where the
W1 , and a sampling set S of size 2 × H2 × W2 , where 2 offset is generated with a linear layer. The bottom one de-
of the first dimension denotes the x and y coordinates, the scribes the version with ‘dynamic scope factor’, where the
grid sample function uses the positions in S to re-sample a scope factor is first generated and then is used to modulate
the hypothetical bilinear-interpolated X into X ′ of size C × the offset. ‘σ’ denotes the sigmoid function.
H2 × W2 . This process is defined by
\label {eq:sampling} \mathcal {X}'=\tt {grid\_sample}(\mathcal {X},\mathcal {S})\,. (1) where the reshaping operation is omitted. Finally the up-
sampled feature map X ′ of size C × sH × sW can be gen-
erated with the sampling set by grid sample as Eq. (1).
Naive Implementation. Given an upsampling scale fac-
This preliminary design obtains 37.9 AP with Faster R-
tor of s and a feature map X of size C × H × W , a linear
CNN [33] on object detection [25] and 41.9 mIoU with
layer, whose input and output channel numbers are C and
SegFormer-B1 [40] on semantic segmentation [42] (cf.
2s2 , is used to generate the offset O of size 2s2 × H × W ,
CARAFE: 38.6 AP and 42.8 mIoU). Next we present
which is then reshaped to 2 × sH × sW by Pixel Shuf-
DySample upon this naive implementation.
fling [34]. Then the sampling set S is the sum of the offset
O and the original sampling grid G, i.e.,
3.2. DySample: Upsampling by Dynamic Sampling
\label {eq:offset} \mathcal {O}=\tt {linear}(\mathcal {X})\,, (2) By studying the naive implementation, we observe that
shared initial offset position among the s2 upsampled points
neglects the position relation, and that the unconstrained
\label {eq:sample_set} \mathcal {S}=\mathcal {G}+\mathcal {O}\,, (3) walking scope of offsets can cause disordered point sam-
Sampling Initialization mIoU AP
Nearest Initialization 41.9 37.9
Bilinear Initialization 42.1 38.1
Figure 3. Initial sampling positions and offset scopes. Factor mIoU AP Groups Dynamic mIoU AP
The points and the colored masks represent the initial sam- 0.1 42.2 38.1 1 42.4 38.3
pling positions and the offset scopes, respectively. Consid- 0.25 42.4 38.3 1 ✓ 42.6 38.4
ering sampling four points (s = 2), (a) in the case of nearest 0.5 42.2 38.1 4 43.2 38.6
initialization, the four offsets share the same initial position 1 42.1 38.1 4 ✓ 43.3 38.7
but ignore position relation; in bilinear initialization (b), we
separate the initial positions such that they distribute evenly. Table 2. Ablation Table 3. Ablation study on the ef-
Without offset modulation (b), the offset scope would typi- study on the effect of fect of dynamic scope factor.
cally overlap, so in (c) we locally constrain the offset scope static scope factor.
to reduce the overlap.
43.5 43.5
43.2 mIoU
AP
43.2
43.0 42.9 43.0
43.0 43.0 42.8
42.6 42.7
42.5 42.4 42.5 42.4
Figure 7. Visualization of the upsampling process in DySample. A part of the boundary in red box is highlighted for
a close view. We generate content-aware offsets to construct new sampling points to resample the input feature map with
bilinear interpolation. The new sampling positions are indicated by the arrowheads. The yellow boxed point in the low-res
feature is selected to illustrate the bilinear interpolation process.
27.8
43.6 22.9 bilinear 1.6 also be seen as seeking for a semantically similar region for
43.1 43.2 43.2 43.243.3
42.8 12.6 each point. However, DySample does not need the guidance
6.8 7.6 6.2 7.1
map and thus is more efficient and easy-to-use.
mIoU latency (ms) Relation to Deformable Attention. Deformable atten-
2960 7.08
tion [44] mainly enhances features; it samples many points
3.87 at each position to aggregate them to form a new point. But
2.88
664 944 452 544 260 256
0.10 0.52 0.02 0.02 DySample is tailored for upsampling; it samples a single
point for each upsampled position to divide one point to s2
+memory (M) +training time (hours)
upsampled points. DySample reveals that sampling a single
2.7 0.4
1.5 1.7 0.3 point for each upsampled position is enough as long as the
0.2
0.2 0.3 0.3 0.4 0.05 0.1 upsampled s2 points can be dynamically divided.
0.0060.012
Figure 9. Qualitative visualizations. From top to bottom: semantic segmentation, object detection, instance segmentation,
panoptic segmentation, and monocular depth estimation.
Semantic segmentation infers per-pixel class labels. Up- Bilinear 15.9 13.7M 41.68 27.80
samplers are often adopted several times to obtain the high- Deconv +34.4 +3.5M 40.71 25.94
PixelShuffle [34] +34.4 +14.2M 41.50 26.58
res output in typical models. The precise per-pixel predic-
CARAFE [37] +1.5 +0.4M 42.82 29.84
tion is largely dependent on the upsampling quality. IndexNet [28] +30.7 +12.6M 41.50 28.27
A2U [10] +0.4 +0.1M 41.45 27.31
Experimental Protocols. We use the ADE20K [42] data FADE [29] +2.7 +0.3M 43.06 31.68
set. Besides the commonly used mIoU metric, we also SAPA-B [30] +1.0 +0.1M 43.20 30.96
report the bIoU [6] metric to evaluate the boundary qual- DySample-S +0.2 +6.1K 43.23 29.53
ity. We first use a light-weight baseline SegFormer-B1 [40], DySample-S+ +0.3 +12.3K 43.58 29.93
where 3 + 2 + 1 = 6 upsampling stages are involved, and DySample +0.3 +49.2K 43.21 29.12
then test DySample on a stronger baseline MaskFormer [8], DySample+ +0.4 +0.1M 43.28 29.23
with Swin-B [26] and Swin-L as the backbone, where 3 up-
sampling stages are involved in FPN. We use the official Table 4. Semantic segmentation results with SegFormer-B1
codebase provided by the authors and follow all the training on ADE20K. Best performance is in boldface and second
settings except for only modifying the upsampling stages. best is underlined.
Table 9. Monocular depth estimation results with DepthFormer (Swin-T) on NYU Depth V2. Best performance is in boldface
and second best is underlined.