Making Convolutional Networks Shift-Invariant Again
Making Convolutional Networks Shift-Invariant Again
Richard Zhang 1
MaxPool
Baseline (baseline) MaxBlurPool
Anti-aliased (ours)
network (ours) MaxPool
Baseline (baseline) MaxBlurPool
Anti-aliased (ours)
network (ours)
1.0 1.0
AlexNet on ImageNet
0.8 0.8
Prob of correct class
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Diagonal shift Diagonal shift
Figure 1. Classification stability for selected images. Predicted probability of the correct class changes when shifting the image. The
baseline (black) exhibits chaotic behavior, which is stabilized by our method (blue). We find this behavior across networks and datasets.
Here, we show selected examples using AlexNet on ImageNet (top) and VGG on CIFAR10 (bottom). Code and anti-aliased versions of
popular networks are available at https://richzhang.github.io/antialiased-cnns/.
max( ) max( )
Baseline
max( )
(MaxPool) max( )
Anti-aliased conv
max( )
(MaxBlurPool)
Figure 3. Anti-aliased max-pooling. (Top) Pooling does not preserve shift-equivariance. It is functionally equivalent to densely-evaluated
pooling, followed by subsampling. The latter ignores the Nyquist sampling theorem and loses shift-equivariance. (Bottom) We low-pass
filter between the operations. This keeps the first operation, while anti-aliasing the appropriate signal. Anti-aliasing and subsampling can
be combined into one operation, which we refer to as BlurPool.
comes at great computation and memory cost. Our work by translation invariance, named Convolutional Kernel Net-
investigates improving shift-equivariance with minimal ad- works. While theoretically interesting (Bietti & Mairal,
ditional computation, by blurring before subsampling. 2017), CKNs perform at lower accuracy than contempo-
raries, resulting in limited usage. Interestingly, a byproduct
Early networks employed average pooling (LeCun et al.,
of the derivation is a standard Gaussian filter; however, no
1990), which is equivalent to blurred-downsampling with a
guidance is provided on its proper integration with existing
box filter. However, work (Scherer et al., 2010) has found
network components. Instead, we demonstrate practical
max-pooling to be more effective, which has consequently
integration with any strided layer, and empirically show per-
become the predominant method for downsampling. While
formance increases on a challenging benchmark – ImageNet
previous work (Scherer et al., 2010; Hénaff & Simoncelli,
classification – on widely-used networks.
2016; Azulay & Weiss, 2018) acknowledges the drawbacks
of max-pooling and benefits of blurred-downsampling, they 3. Methods
are viewed as separate, discrete choices, preventing their
combination. Interestingly, Lee et al. (2016) does not ex- 3.1. Preliminaries
plore low-pass filters, but does propose to softly gate be- Deep convolutional networks as feature extractors Let
tween max and average pooling. However, this does not an image with resolution H × W be represented by
fully utilize the anti-aliasing capability of average pooling. X ∈ RH×W ×3 . An L-layer CNN can be expressed
Mairal et al. (2014) derive a network architecture, motivated as a feature extractor Fl (X) ∈ RHl ×Wl ×Cl , with layer
Making Convolutional Networks Shift-Invariant Again
Signal
0.5 0.5
0.0 0.0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Spatial Position Spatial Position
Figure 4. Illustrative 1-D example of sensitivity to shifts. We illustrate how downsampling affects shift-equivariance with a toy example.
(Left) An input signal is in light gray line. Max-pooled (k = 2, s = 2) signal is in blue squares. Simply shifting the input and then
max-pooling provides a completely different answer (red diamonds). (Right) The blue and red points are subsampled from a densely
max-pooled (k = 2, s = 1) intermediate signal (thick black line). We low-pass filter this intermediate signal and then subsample from it,
shown with green and magenta triangles, better preserving shift-equivariance.
l ∈ {0, 1, ..., L}, spatial resolution Hl × Wl and Cl chan- An alternative is to take a shifted crop from a larger image.
nels. Each feature map can also be upsampled to original We use this approach for ImageNet experiments, as it more
resolution, Fel (X) ∈ RH×W ×Cl . closely matches standard train and test procedures.
Shift-equivariance and invariance A function Fe is shift-
equivariant if shifting the input equally shifts the output, 3.2. Anti-aliasing to improve shift-equivariance
meaning shifting and feature extraction are commutable. Conventional methods for reducing spatial resolution – max-
pooling, average pooling, and strided convolution – all break
Shift∆h,∆w (F(X))
e = F(Shift
e ∆h,∆w (X)) ∀ (∆h, ∆w) shift-equivariance. We propose improvements, shown in
(1) Figure 2. We start by analyzing max-pooling.
A representation is shift-invariant if shifting the input results MaxPool→MaxBlurPool Consider the example
in an identical representation. [0, 0, 1, 1, 0, 0, 1, 1] signal in Figure 4 (left). Max-
pooling (kernel k=2, stride s=2) will result in [0, 1, 0, 1].
F(X)
e = F(Shift
e ∆h,∆w (X)) ∀ (∆h, ∆w) (2) Simply shifting the input results in a dramatically different
answer of [1, 1, 1, 1]. Shift-equivariance is lost. These
Periodic-N shift-equivariance/invariance In some cases, results are subsampling from an intermediate signal – the
the definitions in Eqns. 1, 2 may hold only when shifts input densely max-pooled (stride-1), which we simply
(∆h, ∆w) are integer multiples of N. We refer to such sce- refer to as “max”. As illustrated in Figure 3 (top), we
narios as periodic shift-equivariance/invariance. For exam- can write max-pooling as a composition of two functions:
ple, periodic-2 shift-invariance means that even-pixel shifts MaxPoolk,s = Subsamples ◦ Maxk .
produce an identical output, but odd-pixel shifts may not.
The Max operation preserves shift-equivariance, as it is
Circular convolution and shifting Edge artifacts are an densely evaluated in a sliding window fashion, but subse-
important consideration. When shifting, information is lost quent subsampling does not. We simply propose to add an
on one side and has to be filled in on the other. anti-aliasing filter with kernel m × m, denoted as Blurm , as
In our CIFAR10 classification experiments, we use circular shown in Figure 4 (right). During implementation, blurring
shifting and convolution. When the convolutional kernel and subsampling are combined, as commonplace in image
hits the edge, it “rolls” to the other side. Similarly, when processing. We call this function BlurPoolm,s .
shifting, pixels are rolled off one edge to the other. MaxPoolk,s → Subsamples ◦ Blurm ◦ Maxk
(4)
[Shift∆h,∆w (X)]h,w,c = X(h−∆h)%H,(w−∆w)%W,c , = BlurPoolm,s ◦ Maxk
(3)
where % is the modulus function Sampling after low-pass filtering gives [.5, 1, .5, 1] and
[.75, .75, .75, .75]. These are closer to each other and better
The modification minorly affects performance and could be representations of the intermediate signal.
potentially mitigated by additional padding, at the expense
of memory and computation. But importantly, this affords StridedConv→ConvBlurPool Strided-convolutions suffer
us a clean testbed. Any loss in shift-equivariance is purely from the same issue, and the same method applies.
due to characteristics of the feature extractor. Relu ◦ Convk,s → BlurPoolm,s ◦ Relu ◦ Convk,1 (5)
Making Convolutional Networks Shift-Invariant Again
Generated window
Baseline
Difference from
unshifted generation
Input
Generated window
Ours
Difference from
unshifted generation
Δw = 0 Δw = 0 Δw = 7
Increasing horizontal shift
Figure 7. Selected example of generation instability. The left two images are generated facades from label maps. For the baseline
method (top), input shifts cause different window patterns to emerge, due to naive downsampling and upsampling. Our method (bottom)
stabilizes the output, generating the same window pattern, regardless the input shift.
Baseline Rect-2 Tri-3 Bin-4 Bin-5 generating high-quality imagery. Quantitatively, in Table 3,
Stability [dB] 29.0 30.1 30.8 31.2 34.4 we compute the total variation (TV) norm of the generated
TV Norm ×100 7.48 7.07 6.25 5.84 6.28 images. Qualitatively, we observe that generation quality
Table 3. Generation stability PSNR (higher is better) between
typically holds with the Tri-3 filter and subsequently de-
generated facades, given two horizontally shifted inputs. More ag- grades. In the supplemental material, we show examples of
gressive filtering in the down and upsampling layers leads to a more applying increasingly aggressive filters. We observe a boost
shift-equivariant generator. Total variation (TV) of generated in shift-equivariance while maintaining generation quality,
images (closer to ground truth images 7.80 is better). Increased and then a tradeoff between the two factors.
filtering decreases the frequency content of generated images.
These experiments demonstrate that the technique can make
a drastically different architecture (U-Net) for a different
two bars, to a single bar, and eventually oscillates back to task (generating pixels) more shift-equivariant.
two bars. A shift-equivariant network would provide the
same resulting facade, no matter the shift. 5. Conclusions and Discussion
Applying anti-aliasing We augment the strided-
Shift-equivariance is lost in modern deep networks, as com-
convolution downsampling by blurring. The U-Net
monly used downsampling layers ignore Nyquist sampling
also uses upsampling layers, without any smoothing.
and alias. We integrate low-pass filtering to anti-alias, a
Similar to the subsampling case, this leads to aliasing,
common signal processing technique. The simple modifica-
in the form of grid artifacts (Odena et al., 2016). We
tion achieves higher consistency, across architectures and
mirror the downsampling by applying the same filter after
downsampling techniques. In addition, in classification, we
upsampling. Note that applying the Rect-2 and Tri-3 filters
observe surprising boosts in accuracy and robustness.
while upsampling correspond to “nearest” and “bilinear”
upsampling, respectively. By using the Tri-3 filter, the same Anti-aliasing for shift-equivariance is well-understood. A
window pattern is generated, regardless of input shift, as future direction is to better understand how it affects and
seen in Figure 7 (bottom). improves generalization, as we observed empirically. Other
directions include the potential benefit to downstream ap-
We measure similarity using peak signal-to-noise ratio be-
plications, such as nearest-neighbor retrieval, improving
tween generated facades with shifted and non-shifted inputs:
temporal consistency in video models, robustness to adver-
EX,∆w PSNR(Shift0,∆w (F (X)), F (Shift0,∆w (X)))). In
sarial examples, and high-level vision tasks such as detec-
Table 3, we show that the smoother the filter, the more
tion. Adding the inductive bias of shift-invariance serves
shift-equivariant the output.
as “built-in” shift-based data augmentation. This is poten-
A concern with adding low-pass filtering is the loss of abil- tially applicable to online learning scenarios, where the data
ity to generate high-frequency content, which is critical for distribution is changing.
Making Convolutional Networks Shift-Invariant Again
Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to- Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., and
image translation with conditional adversarial networks. Yosinski, J. Plug & play generative networks: Conditional
In CVPR, 2017. iterative generation of images in latent space. In CVPR,
2017.
Kanazawa, A., Sharma, A., and Jacobs, D. Locally scale-
invariant convolutional neural networks. In NIPS Work- Nyquist, H. Certain topics in telegraph transmission the-
shop, 2014. ory. Transactions of the American Institute of Electrical
Engineers, pp. 617–644, 1928.
Karras, T., Laine, S., and Aila, T. A style-based generator
architecture for generative adversarial networks. ICLR, Odena, A., Dumoulin, V., and Olah, C. Deconvolution and
2019. checkerboard artifacts. Distill, 2016. doi: 10.23915/
distill.00003. URL http://distill.pub/2016/
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, deconv-checkerboard.
R., Torralba, A., and Fidler, S. Skip-thought vectors. In
NIPS, 2015. Oppenheim, A. V., Schafer, R. W., and Buck, J. R. Discrete-
Time Signal Processing. Pearson, 2nd edition, 1999.
Krizhevsky, A. and Hinton, G. Learning multiple layers
of features from tiny images. Technical report, Citeseer, Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
2009. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu-
In NIPS, 2012. tional networks for biomedical image segmentation. In
MICCAI, 2015.
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W. E., and Jackel, L. D. Hand- Ruderman, A., Rabinowitz, N. C., Morcos, A. S., and Zo-
written digit recognition with a back-propagation network. ran, D. Pooling is neither necessary nor sufficient for
In NIPS, 1990. appropriate deformation stability in cnns. In arXiv, 2018.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,
based learning applied to document recognition. Proceed- S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
ings of the IEEE, 86(11):2278–2324, 1998. stein, M., et al. Imagenet large scale visual recognition
challenge. IJCV, 115(3):211–252, 2015.
Lee, C.-Y., Gallagher, P. W., and Tu, Z. Generalizing pool-
ing functions in convolutional neural networks: Mixed, Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
gated, and tree. In AISTATS, 2016. Chen, L.-C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In CVPR, 2018.
Lenc, K. and Vedaldi, A. Understanding image represen-
tations by measuring their equivariance and equivalence. Scherer, D., Muller, A., and Behnke, S. Evaluation of pool-
In CVPR, 2015. ing operations in convolutional architectures for object
recognition. In ICANN. 2010.
Leung, T. and Malik, J. Representing and recognizing the
visual appearance of materials using three-dimensional Sifre, L. and Mallat, S. Rotation, scaling and deformation
textons. IJCV, 2001. invariant scattering for texture discrimination. In CVPR,
2013.
Lowe, D. G. Object recognition from local scale-invariant
features. In ICCV, 1999. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., and
Heeger, D. J. Shiftable multiscale transforms. IEEE trans-
Mahendran, A. and Vedaldi, A. Understanding deep image actions on Information Theory, 38(2):587–607, 1992.
representations by inverting them. In CVPR, 2015.
Simonyan, K. and Zisserman, A. Very deep convolutional
Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. Con- networks for large-scale image recognition. In ICLR,
volutional kernel networks. In NIPS, 2014. 2015.
Mordvintsev, A., Olah, C., and Tyka, M. Deepdream-a Su, J., Vargas, D. V., and Sakurai, K. One pixel attack
code example for visualizing neural networks. Google for fooling deep neural networks. IEEE Transactions on
Research, 2:5, 2015. Evolutionary Computation, 2019.
Making Convolutional Networks Shift-Invariant Again
Tyleček, R. and Šára, R. Spatial pattern templates for recog- Classification (CIFAR)
Architecture
nition of objects with regular structure. In German Con- VGG13-bn DenseNet-40-12
ference on Pattern Recognition, pp. 364–374. Springer,
StridedConv – –
2013.
MaxPool 5 –
Vedaldi, A. and Fulkerson, B. VLFeat: An open and AvgPool – 2
portable library of computer vision algorithms. http:
Table 4. Testbeds (CIFAR10 Architectures). We use slightly dif-
//www.vlfeat.org/, 2008. ferent architectures for VGG (Simonyan & Zisserman, 2015) and
DenseNet (Huang et al., 2017) than the ImageNet counterparts.
Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and
Brostow, G. J. Harmonic networks: Deep translation and
rotation equivariance. In CVPR, 2017.
network, with the MaxBlurPool operator, increases consis-
Xiao, C., Zhu, J.-Y., Li, B., He, W., Liu, M., and Song, D. tency. The larger the filter, the more consistent the output
Spatially transformed adversarial examples. ICLR, 2018. classifications. This result agrees with our expectation and
theory – improving shift-equivariance throughout the net-
Yu, F. and Koltun, V. Multi-scale context aggregation by work should result in more consistent classifications across
dilated convolutions. ICLR, 2016. shifts, even when such shifts are not seen at training.
Yu, F., Koltun, V., and Funkhouser, T. Dilated residual In this regime, accuracy clearly increases with consistency,
networks. In CVPR, 2017. as seen with the blue markers in Figure 8. Filtering does
not destroy the signal or make learning harder. On the con-
Zeiler, M. D. and Fergus, R. Visualizing and understanding trary, shift-equivariance serves as “built-in” augmentation,
convolutional networks. In ECCV, 2014. indicating more efficient data usage.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, Training with data augmentation In principle, networks
O. The unreasonable effectiveness of deep features as a can learn to be shift-invariant from data. Is data augmen-
perceptual metric. In CVPR, 2018. tation all that is needed to achieve shift-invariance? By
applying the Rect-2 filter, a large increase in consistency,
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, 96.6 → 97.6, can be had at a small decrease in accuracy
A. Object detectors emerge in deep scene cnns. In ICLR, 93.8 → 93.7. Even when seeing shifts at training, antialias-
2015. ing increases consistency. From there, stronger filters can
increase consistency, at the expense of accuracy.
Table 5. CIFAR Classification accuracy and consistency Results across blurring filters and training scenarios (without and with data
augmentation). We evaluate classification accuracy without shifts (Accuracy – None) and on random shifts (Accuracy – Random), as
well as classification consistency.
0.94 0.94
0.92 0.92
0.90 0.90
0.88 0.88
0.91 0.92 0.93 0.94 0.95 0.91 0.92 0.93 0.94 0.95
Accuracy Accuracy
Figure 8. CIFAR10 Classification consistency vs. accuracy. VGG (left) and DenseNet (right) networks. Up (more consistent) and to
the right (more accurate) is better. Number of sides corresponds to number of filter taps used (e.g., diamond for 4-tap filter); colors
correspond to filters trained without (blue) and with (pink) shift-based data augmentation, using various filters. We show accuracy for no
shift when training without shifts, and a random shift when training with shifts.
Making Convolutional Networks Shift-Invariant Again
Accuracy
0.90
15
Binomial-5 Binomial-6 Binomial-7 0.88 0.88
20
0.86
25 0.86
0.84
30 16 12 8 4 0 4 8 12
0 5 10 15 20 25 30
0.84 Diagonal Shift [pix]
Figure 9. Average accuracy as a function of shift. (Left) We show classification accuracy across the test set as a function of shift, given
different filters. (Right) We plot accuracy vs diagonal shift in the input image, across different filters. Note that accuracy degrades quickly
with the baseline, but as increased filtering is added, classifications become consistent across spatial positions.
1.6
passes in an ensembling approach (1024× computation to
evaluate every shift), or evaluating each layer more densely
1.4 by exchanging striding for dilation (4×, 16×, 64×, 256×
1.2 computation for conv2-conv5, respectively). Given com-
Average Filter TV
400
200
200
0 0
10 5 10 4 10 3 10 2 10 1 100 10 5 10 4 10 3 10 2 10 1 100
Variation in probability of correct classification Variation in probability of correct classification
Figure 11. Distribution of per-image classification variation. We show the distribution of classification variation in the test set, (left)
without and (right) with data augmentation at training. Lower variation means more consistent classifications (and increased shift-
invariance). Training with data augmentation drastically reduces variation in classification. Adding filtering further decreases variation.
Train without Data Augmentation Train with Data Augmentation
1.0 1.0
0.9 0.9
0.8 0.8
0.7 0.7
Accuracy
0.6 0.6
ation in both scenarios. More aggressive filtering further of 0 means that there is no adversary. Conversely, a max
decreases variation. shift of 16 means the image must be correctly classified at
all 32 × 32 = 1024 positions.
Robustness to shift-based adversary In the main paper,
we show that anti-aliased the networks increases the classi- Our primary observations are as follows:
fication consistency, while maintaining accuracy. A logical
consequence is increased accuracy in presence of a shift-
• As seen in Figure 12 (left), the baseline network (gray) is
based adversary. We empirically confirm this in Figure 12
very sensitive to the adversary.
for VGG13 on CIFAR10. We compute classification accu-
• Adding larger Binomial filters (from red to purple) in-
racy as a function of maximum adversarial shift. A max
creases robustness to the adversary. In fact, Bin-7 filter
shift of 2 means the adversary can choose any of the 25
(purple) without augmentation outperforms the baseline
positions within a 5 × 5 window. For the classifier to “win”,
(black) with augmentation.
it must correctly classify all of them correctly. Max shift
• As seen in Figure 12 (right), adding larger Binomial filters
Making Convolutional Networks Shift-Invariant Again
also increases adversarial robustness, even when training can learn convolution, but does not do so in practice.
with augmentation.
B. ImageNet Classification
These results corroborate the findings in the main paper, and We show expanded results and visualizations.
demonstrate a use case: increased robustness to a shift-based
adversarial attack. Classification and shift-invariance results In Table 6,
we show expanded results. These results are plotted
A.2. Alternatives to MaxBlurPool in Figure 6 in the main paper. All pretrained models
are available at https://richzhang.github.io/
In the paper, we follow signal processing first principles, to antialiased-cnns/.
arrive at our solution of MaxBlurPool, with a fixed blurring
kernel. Here, we explore possible alternatives – swapping Robustness results In the main paper, we show aggre-
max and blur operations, combining max and blur in parallel gated results for robustness tests on the Imagenet-C/P
through soft-gating, and learning the blur filter. datasets (Hendrycks et al., 2019). In Tables 8 and 7 we
show expanded results, separated by each corruption and
Swapping max and blur We blur after max, immediately perturbation type.
before subsampling, which has solid theoretical backing
in sampling theory. What happens when the operations Antialiasing is motivated by shift-invariance. Indeed, using
are swapped? The signal before the max operator is un- the Bin-5 antialiasing filter reduces flip rate by 22.3% to
doubtedly related to the signal after. Thus, blurring before translations. Table 8 indicates increased stability to other
max provides “second-hand” anti-aliasing and still increases perturbation types as well. We observe higher stability to
shift-invariance over the baseline. However, switching the geometric perturbations – rotation, tilting, and scaling. In
order is worse than max and blurring in the correct, proposed addition, antialiasing also helps stability to noise. This is
order. For example, for Bin-7, accuracy (93.2 → 92.6) and somewhat expected, as adding low-pass filtering helps can
consistency (98.8 → 98.6) both decrease. We consistently average away spurious noise. Surprisingly, adding blurring
observe this across filters. within the network also increases resilience to blurred im-
ages. In total, antialiasing increases stability almost across
Softly gating between max-pool and average-pool Lee the board – 9 of the 10 perturbations are reliably stabilized.
et al. (2016) investigate combining MaxPool and AvgPool
in parallel, with a soft-gating mechanism, called “Mixed” We also observe increased accuracy, in the face of corrup-
Max-AvgPool. We instead combine them in series. We con- tions, as shown in Table 7. Again, adding low-pass filtering
duct additional experiments here. On CIFAR (VGG w/ aug, helps smooth away spurious noise on the input, helping
see Tab 5), MixedPool can offer improvements over Max- better maintain performance. Other high-frequency pertur-
Pool baseline (96.6→97.2 consistency). However, by softly bations, such as pixelation and jpeg compression, are also
weighting AvgPool, some antialiasing capability is left on consistency improved with antialiasing. Overall, antialias-
the table. MaxBlurPool provides higher invariance (97.6). ing increases robustness to perturbations – 13 of the 15
All have similar accuracy – 93.8, 93.7, and 93.7 for baseline corruptions are reliably improved.
MaxPool, MixedPool, and our MaxBlurPool, respectively. In total, these results indicate that adding antialiasing pro-
We use our Rect-2 variant here for clean comparison. vides a smoother feature extractor, which is more stable and
Importantly, our paper proposes a methodology, not a pool- robust to out-of-distribution perturbations.
ing layer. The same technique to modify MaxPool (reduce
stride, then BlurPool) applies to the MixedPool layer, in- C. Qualitative examples for Labels→Facades
creasing its shift-invariance (97.2→97.8).
In the main paper, we discussed the tension between needing
Learning the blur filter We have shown that adding anti- to generate high-frequency content and low-pass filtering
aliasing filtering improves shift-equivariance. What if the for shift-invariance. Here, we show an example of applying
blur kernel were learned? We initialize the filters with our increasingly aggressive filters. In general, generation quality
fixed weights, Tri-3 and Bin-5, and allow them to be adjusted is maintained with the Rect-2 and Tri-3 filters, and then
during training (while constraining the kernel symmetrical). degrades with additional filtering.
The function space has more degrees of freedom and is
strictly more general. However, we find that while accuracy
holds, consistency decreases: relative to the fixed filters, we
see 98.0 → 97.5 for length-3 and 98.4 → 97.3 for length-5.
While shift-invariance can be learned, there is no explicit
incentive to do so. Analogously, a fully connected network
Making Convolutional Networks Shift-Invariant Again
Table 7. Generalization to Corruptions. (Top) Corruption error rate (lower is better) of Resnet50 on the Imagenet-C. With antialiasing,
the error rate decreases, often times significantly, on most corruptions. (Bottom) The percentage reduction relative to the baseline
ResNet50 (higher is better). The right two columns show mean across corruptions. “Unnorm” is the raw average. “Norm” is normalized
to errors made from AlexNet, as proposed in (Hendrycks et al., 2019).
Making Convolutional Networks Shift-Invariant Again
Generated Facades
Input Label Map
Delta (Baseline) [1] Rectangle-2 [1 1] Triangle-3 [1 2 1] Binomial-4 [1 3 3 1] Binomial-5 [1 4 6 4 1]
Figure 13. Example generations. We show generations with U-Nets trained with 5 different filters. In general, generation quality is
well-maintained to Tri-3 filter, but decreases noticeably with Bin-4 and Bin-5 filters due to oversmoothing.