0% found this document useful (0 votes)
89 views17 pages

Making Convolutional Networks Shift-Invariant Again

This document proposes a method to make convolutional neural networks shift-invariant by inserting low-pass filtering before downsampling layers like max-pooling and strided convolutions. The method is compatible with existing architectures and improves their shift-equivariance without degrading accuracy. On ImageNet, networks with this anti-aliasing technique achieved higher classification accuracy and better generalization, being more robust to input corruptions.

Uploaded by

NguyễnHuyHùng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views17 pages

Making Convolutional Networks Shift-Invariant Again

This document proposes a method to make convolutional neural networks shift-invariant by inserting low-pass filtering before downsampling layers like max-pooling and strided convolutions. The method is compatible with existing architectures and improves their shift-equivariance without degrading accuracy. On ImageNet, networks with this anti-aliasing technique achieved higher classification accuracy and better generalization, being more robust to input corruptions.

Uploaded by

NguyễnHuyHùng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Making Convolutional Networks Shift-Invariant Again

Richard Zhang 1

Abstract pirical evidence suggests max-pooling provides stronger


Modern convolutional networks are not shift- task performance (Scherer et al., 2010), leading to its
invariant, as small input shifts or translations widespread adoption. Unfortunately, max-pooling does not
provide the same anti-aliasing capability, and a curious, re-
arXiv:1904.11486v2 [cs.CV] 9 Jun 2019

can cause drastic changes in the output. Com-


monly used downsampling methods, such as cently uncovered phenomenon emerges – small shifts in
max-pooling, strided-convolution, and average- the input can drastically change the output (Engstrom et al.,
pooling, ignore the sampling theorem. The well- 2019; Azulay & Weiss, 2018). As seen in Figure 1, network
known signal processing fix is anti-aliasing by outputs can oscillate depending on the input position.
low-pass filtering before downsampling. How- Blurred-downsampling and max-pooling are commonly
ever, simply inserting this module into deep net- viewed as competing downsampling strategies (Scherer
works degrades performance; as a result, it is et al., 2010). However, we show that they are compatible.
seldomly used today. We show that when inte- Our simple observation is that max-pooling is inherently
grated correctly, it is compatible with existing ar- composed of two operations: (1) evaluating the max opera-
chitectural components, such as max-pooling and tor densely and (2) naive subsampling. We propose to low-
strided-convolution. We observe increased ac- pass filter between them as a means of anti-aliasing. This
curacy in ImageNet classification, across several viewpoint enables low-pass filtering to augment, rather than
commonly-used architectures, such as ResNet, replace max-pooling. As a result, shifts in the input leave
DenseNet, and MobileNet, indicating effective the output relatively unaffected (shift-invariance) and more
regularization. Furthermore, we observe better closely shift the internal feature maps (shift-equivariance).
generalization, in terms of stability and robust-
ness to input corruptions. Our results demonstrate Furthermore, this enables proper placement of the low-pass
that this classical signal processing technique has filter, directly before subsampling. With this methodology,
been undeservingly overlooked in modern deep practical anti-aliasing can be achieved with any existing
networks. strided layer, such as strided-convolution, which is used in
more modern networks such as ResNet (He et al., 2016) and
MobileNet (Sandler et al., 2018).
1. Introduction
A potential concern is that overaggressive filtering can result
When downsampling a signal, such an image, the textbook
in heavy loss of information, degrading performance. How-
solution is to anti-alias by low-pass filtering the signal (Op-
ever, we actually observe increased accuracy in ImageNet
penheim et al., 1999; Gonzalez & Woods, 1992). Without
classification (Russakovsky et al., 2015) across architectures,
it, high-frequency components of the signal alias into lower-
as well as increased robustness and stability to corruptions
frequencies. This phenomenon is commonly illustrated in
and perturbations (Hendrycks et al., 2019). In summary:
movies, where wheels appear to spin backwards, known as
the Stroboscopic effect, due to the frame rate not meeting • We integrate classic anti-aliasing to improve shift-
the classical sampling criterion (Nyquist, 1928). Interest- equivariance of deep networks. Critically, the method
ingly, most modern convolutional networks do not worry is compatible with existing downsampling strategies.
about anti-aliasing. • We validate on common downsampling strategies – max-
Early networks did employ a form of blurred-downsampling pooling, average-pooling, strided-convolution – in differ-
– average pooling (LeCun et al., 1990). However, ample em- ent architectures. We test across multiple tasks – image
1
classification and image-to-image translation.
Adobe Research, San Francisco, CA. Correspondence to: • For ImageNet classification, we find, surprisingly, that
Richard Zhang <[email protected]>.
accuracy increases, indicating effective regularization.
Proceedings of the 36 th International Conference on Machine • Furthermore, we observe better generalization. Perfor-
Learning, Long Beach, California, PMLR 97, 2019. Copyright mance is more robust and stable to corruptions such as
2019 by the author(s). rotation, scaling, blurring, and noise variants.
Making Convolutional Networks Shift-Invariant Again

MaxPool
Baseline (baseline) MaxBlurPool
Anti-aliased (ours)
network (ours) MaxPool
Baseline (baseline) MaxBlurPool
Anti-aliased (ours)
network (ours)
1.0 1.0
AlexNet on ImageNet

0.8 0.8
Prob of correct class

Prob of correct class


0.6 0.6
1.0 1.0
0.4 0.4
0.2 0.5 0.2 0.5

0.0 MaxPool 0.0 MaxBlurPool 0.0 MaxPool 0.0 MaxBlurPool


1.0 0 2 4 6 8 10 12 14 16 18 0.0 20 22 240.526
26 28 301.0
28 30 1.0 0 2 4 6 8 10 12 14 16 18 0.0 20 22 240.526
26 28 301.0
28 30
Diagonal shift Diagonal shift
0.8 0.8
Prob of correct class

Prob of correct class


VGG on CIFAR

0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Diagonal shift Diagonal shift
Figure 1. Classification stability for selected images. Predicted probability of the correct class changes when shifting the image. The
baseline (black) exhibits chaotic behavior, which is stabilized by our method (blue). We find this behavior across networks and datasets.
Here, we show selected examples using AlexNet on ImageNet (top) and VGG on CIFAR10 (bottom). Code and anti-aliased versions of
popular networks are available at https://richzhang.github.io/antialiased-cnns/.

2. Related Work changes) in response to manually generated perturbations to


the input, such as image transformations (Goodfellow et al.,
Local connectivity and weight sharing have been a cen- 2009; Lenc & Vedaldi, 2015; Azulay & Weiss, 2018), geo-
tral tenet of neural networks, including the Neocogni- metric transforms (Fawzi & Frossard, 2015; Ruderman et al.,
tron (Fukushima & Miyake, 1982), LeNet (LeCun et al., 2018), and CG renderings with various shape, poses, and
1998) and modern networks such as Alexnet (Krizhevsky colors (Aubry & Russell, 2015). A related line of work is ad-
et al., 2012), VGG (Simonyan & Zisserman, 2015), versarial examples, where input perturbations are purposely
ResNet (He et al., 2016), and DenseNet (Huang et al., 2017). directed to produce large changes in the output. These per-
In biological systems, local connectivity was famously dis- turbations can be on pixels (Goodfellow et al., 2014a;b),
covered in a cat’s visual system (Hubel & Wiesel, 1962). a single pixel (Su et al., 2019), small deformations (Xiao
Recent work has strived to add additional invariances, such et al., 2018), or even affine transformations (Engstrom et al.,
as rotation, reflection, and scaling (Sifre & Mallat, 2013; 2019). We aim to make the network robust to the simplest
Bruna & Mallat, 2013; Kanazawa et al., 2014; Cohen & of these types of attacks and perturbations: shifts. In doing
Welling, 2016; Worrall et al., 2017; Esteves et al., 2018). We so, we also observe increased robustness across other types
focus on shift-invariance, which is often taken for granted. of corruptions and perturbations (Hendrycks et al., 2019).
Though different properties have been engineered into net- Classic hand-engineered computer vision and image pro-
works, what factors and invariances does an emergent rep- cessing representations, such as SIFT (Lowe, 1999),
resentation actually learn? Qualitative analysis of deep wavelets, and image pyramids (Adelson et al., 1984; Burt
networks have included showing patches which activate hid- & Adelson, 1987) also extract features in a sliding win-
den units (Girshick et al., 2014; Zhou et al., 2015), actively dow manner, often with some subsampling factor. As dis-
maximizing hidden units (Mordvintsev et al., 2015), and cussed in Simoncelli et al. (1992), literal shift-equivariance
mapping features back into pixel space (Zeiler & Fergus, cannot hold when subsampling. Shift-equivariance can be
2014; Hénaff & Simoncelli, 2016; Mahendran & Vedaldi, recovered if features are extracted densely, for example tex-
2015; Dosovitskiy & Brox, 2016a;b; Nguyen et al., 2017). tons (Leung & Malik, 2001), the Stationary Wavelet Trans-
Our analysis is focused on a specific, low-level property and form (Fowler, 2005), and DenseSIFT (Vedaldi & Fulkerson,
is complementary to these approaches. 2008). Deep networks can also be evaluated densely, by
A more quantitative approach for analyzing networks is mea- removing striding and making appropriate changes to sub-
suring representation or output changes (or robustness to sequent layers by using á trous/dilated convolutions (Chen
et al., 2015; 2018; Yu & Koltun, 2016; Yu et al., 2017). This
(stride 2) (stride 2) (stride 2)

Antialiased modifications Max BlurPool Conv ReLU BlurPool BlurPool


(stride 1) (stride 2) (stride 1) (stride 2) (stride 2)

Making Convolutional Networks Shift-Invariant Again

Baseline MaxPool Conv ReLU AvgPool


(stride 2) (stride 2) (stride 2)

Anti-aliased Max BlurPool Conv ReLU BlurPool BlurPool


(stride 1) (stride 2) (stride 1) (stride 2) (stride 2)

Max Pooling Strided-Convolution Average Pooling


Figure 2. Anti-aliasing common downsampling layers. (Top) Max-pooling, strided-convolution, and average-pooling can each be
better antialiased (bottom) with our proposed architectural modification. An example on max-pooling is shown below.

max( ) max( )

Baseline
max( )
(MaxPool) max( )

Shift-equivariance lost; (1) Max (dense evaluation) (2) Subsampling


heavy aliasing Preserves shift-equivariance Shift-eq. lost; heavy aliasing

max( ) Blur kernel

Anti-aliased conv
max( )
(MaxBlurPool)

(1) Max (dense evaluation) (2) Anti-aliasing filter (3) Subsampling


Preserves shift-eq. Preserves shift-eq. Shift eq. lost, but with reduced aliasing

Figure 3. Anti-aliased max-pooling. (Top) Pooling does not preserve shift-equivariance. It is functionally equivalent to densely-evaluated
pooling, followed by subsampling. The latter ignores the Nyquist sampling theorem and loses shift-equivariance. (Bottom) We low-pass
filter between the operations. This keeps the first operation, while anti-aliasing the appropriate signal. Anti-aliasing and subsampling can
be combined into one operation, which we refer to as BlurPool.

comes at great computation and memory cost. Our work by translation invariance, named Convolutional Kernel Net-
investigates improving shift-equivariance with minimal ad- works. While theoretically interesting (Bietti & Mairal,
ditional computation, by blurring before subsampling. 2017), CKNs perform at lower accuracy than contempo-
raries, resulting in limited usage. Interestingly, a byproduct
Early networks employed average pooling (LeCun et al.,
of the derivation is a standard Gaussian filter; however, no
1990), which is equivalent to blurred-downsampling with a
guidance is provided on its proper integration with existing
box filter. However, work (Scherer et al., 2010) has found
network components. Instead, we demonstrate practical
max-pooling to be more effective, which has consequently
integration with any strided layer, and empirically show per-
become the predominant method for downsampling. While
formance increases on a challenging benchmark – ImageNet
previous work (Scherer et al., 2010; Hénaff & Simoncelli,
classification – on widely-used networks.
2016; Azulay & Weiss, 2018) acknowledges the drawbacks
of max-pooling and benefits of blurred-downsampling, they 3. Methods
are viewed as separate, discrete choices, preventing their
combination. Interestingly, Lee et al. (2016) does not ex- 3.1. Preliminaries
plore low-pass filters, but does propose to softly gate be- Deep convolutional networks as feature extractors Let
tween max and average pooling. However, this does not an image with resolution H × W be represented by
fully utilize the anti-aliasing capability of average pooling. X ∈ RH×W ×3 . An L-layer CNN can be expressed
Mairal et al. (2014) derive a network architecture, motivated as a feature extractor Fl (X) ∈ RHl ×Wl ×Cl , with layer
Making Convolutional Networks Shift-Invariant Again

Baseline (MaxPool) Anti-aliased (MaxBlurPool)


1.5 Input MaxPool (Shift-0) MaxPool (Shift-1) 1.5 Max MaxBlurPool (Shift-0)
Max+Blur MaxBlurPool (Shift-1)
1.0 1.0
Signal

Signal
0.5 0.5

0.0 0.0
0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 6 7 8
Spatial Position Spatial Position
Figure 4. Illustrative 1-D example of sensitivity to shifts. We illustrate how downsampling affects shift-equivariance with a toy example.
(Left) An input signal is in light gray line. Max-pooled (k = 2, s = 2) signal is in blue squares. Simply shifting the input and then
max-pooling provides a completely different answer (red diamonds). (Right) The blue and red points are subsampled from a densely
max-pooled (k = 2, s = 1) intermediate signal (thick black line). We low-pass filter this intermediate signal and then subsample from it,
shown with green and magenta triangles, better preserving shift-equivariance.

l ∈ {0, 1, ..., L}, spatial resolution Hl × Wl and Cl chan- An alternative is to take a shifted crop from a larger image.
nels. Each feature map can also be upsampled to original We use this approach for ImageNet experiments, as it more
resolution, Fel (X) ∈ RH×W ×Cl . closely matches standard train and test procedures.
Shift-equivariance and invariance A function Fe is shift-
equivariant if shifting the input equally shifts the output, 3.2. Anti-aliasing to improve shift-equivariance
meaning shifting and feature extraction are commutable. Conventional methods for reducing spatial resolution – max-
pooling, average pooling, and strided convolution – all break
Shift∆h,∆w (F(X))
e = F(Shift
e ∆h,∆w (X)) ∀ (∆h, ∆w) shift-equivariance. We propose improvements, shown in
(1) Figure 2. We start by analyzing max-pooling.
A representation is shift-invariant if shifting the input results MaxPool→MaxBlurPool Consider the example
in an identical representation. [0, 0, 1, 1, 0, 0, 1, 1] signal in Figure 4 (left). Max-
pooling (kernel k=2, stride s=2) will result in [0, 1, 0, 1].
F(X)
e = F(Shift
e ∆h,∆w (X)) ∀ (∆h, ∆w) (2) Simply shifting the input results in a dramatically different
answer of [1, 1, 1, 1]. Shift-equivariance is lost. These
Periodic-N shift-equivariance/invariance In some cases, results are subsampling from an intermediate signal – the
the definitions in Eqns. 1, 2 may hold only when shifts input densely max-pooled (stride-1), which we simply
(∆h, ∆w) are integer multiples of N. We refer to such sce- refer to as “max”. As illustrated in Figure 3 (top), we
narios as periodic shift-equivariance/invariance. For exam- can write max-pooling as a composition of two functions:
ple, periodic-2 shift-invariance means that even-pixel shifts MaxPoolk,s = Subsamples ◦ Maxk .
produce an identical output, but odd-pixel shifts may not.
The Max operation preserves shift-equivariance, as it is
Circular convolution and shifting Edge artifacts are an densely evaluated in a sliding window fashion, but subse-
important consideration. When shifting, information is lost quent subsampling does not. We simply propose to add an
on one side and has to be filled in on the other. anti-aliasing filter with kernel m × m, denoted as Blurm , as
In our CIFAR10 classification experiments, we use circular shown in Figure 4 (right). During implementation, blurring
shifting and convolution. When the convolutional kernel and subsampling are combined, as commonplace in image
hits the edge, it “rolls” to the other side. Similarly, when processing. We call this function BlurPoolm,s .
shifting, pixels are rolled off one edge to the other. MaxPoolk,s → Subsamples ◦ Blurm ◦ Maxk
(4)
[Shift∆h,∆w (X)]h,w,c = X(h−∆h)%H,(w−∆w)%W,c , = BlurPoolm,s ◦ Maxk
(3)
where % is the modulus function Sampling after low-pass filtering gives [.5, 1, .5, 1] and
[.75, .75, .75, .75]. These are closer to each other and better
The modification minorly affects performance and could be representations of the intermediate signal.
potentially mitigated by additional padding, at the expense
of memory and computation. But importantly, this affords StridedConv→ConvBlurPool Strided-convolutions suffer
us a clean testbed. Any loss in shift-equivariance is purely from the same issue, and the same method applies.
due to characteristics of the feature extractor. Relu ◦ Convk,s → BlurPoolm,s ◦ Relu ◦ Convk,1 (5)
Making Convolutional Networks Shift-Invariant Again

Importantly, this analogous modification applies conceptu- ImageNet Classification Generation


ally to any strided layer, meaning the network designer can Alex- VGG Res- Dense- Mobile- U-
keep their original operation of choice. Net Net Net Netv2 Net
StridedConv 1 – 4‡ 1‡ 5‡ 8
AveragePool→BlurPool Blurred downsampling with a box MaxPool 3 5 1 1 – –
filter is the same as average pooling. Replacing it with a AvgPool – – – 3 – –
stronger filter provides better shift-equivariance. We exam- Table 1. Testbeds. We test across tasks (ImageNet classification
ine such filters next. and Labels→Facades) and network architectures. Each architec-
ture employs different downsampling strategies. We list how often
each is used here. We can antialias each variant.  This convolution
AvgPoolk,s → BlurPoolm,s (6)
uses stride 4 (all others use 2). We only apply the antialiasing at
stride 2. Evaluating the convolution at stride 1 would require large
Anti-aliasing filter selection The method allows for a computation at full-resolution. ‡ For the same reason, we do not
choice of blur kernel. We test m × m filters ranging from antialias the first strided-convolution in these networks.
size 2 to 5, with increasing smoothing. The weights are nor-
malized. The filters are the outer product of the following 4.2. Shift-Invariance/Equivariance Metrics
vectors with themselves.
Ideally, a shift in the input would result in equally shifted
• Rectangle-2 [1, 1]: moving average or box filter; equiva-
feature maps internally:
lent to average pooling or “nearest” downsampling
• Triangle-3 [1, 2, 1]: two box filters convolved together; Internal feature distance. We examine internal fea-
equivalent to bilinear downsampling ture maps with d(Shift∆h,∆w (F(X)),
e F(Shift
e ∆h,∆w (X)))
• Binomial-5 [1, 4, 6, 4, 1]: the box filter convolved with (left & right-hand sides of Eqn. 1). We use cosine distance,
itself repeatedly; the standard filter used in Laplacian pyra- as common for deep features (Kiros et al., 2015; Zhang
mids (Burt & Adelson, 1987) et al., 2018).
We can also measure the stability of the output:
4. Experiments
Classification consistency. For classification, we
4.1. Testbeds check how often the network outputs the same clas-
sification, given the same image with two different
CIFAR Classification To begin, we test classification of shifts: EX,h1 ,w1 ,h2 ,w2 1{arg max P (Shifth1 ,w1 (X)) =
low-resolution 32 × 32 images. The dataset contains 50k arg max P (Shifth2 ,w2 (X))}.
training and 10k validation images, classified into one of
10 categories. We dissect the VGG architecture (Simonyan Generation stability. For image translation, we test if
& Zisserman, 2015), showing that shift-equivariance is a a shift in the input image generates a correspondingly
signal-processing property, progressively lost in each down- shifted output. For simplicity, we test horizontal shifts.
sampling layer. EX,∆w PSNR Shift0,∆w (F(X))), F(Shift0,∆w (X)) .
ImageNet Classification We then test on large-scale clas-
4.3. Internal shift-equivariance
sification on 224 × 224 resolution images. The dataset
contains 1.2M training and 50k validation images, clas- We first test on the CIFAR dataset using the VGG13-bn (Si-
sified into one of 1000 categories. We test across differ- monyan & Zisserman, 2015) architecture.
ent architecture families – AlexNet (Krizhevsky & Hinton,
We dissect the progressive loss of shift-equivariance by in-
2009), VGG (Simonyan & Zisserman, 2015), ResNet (He
vestigating the VGG architecture internally. The network
et al., 2016), DenseNet (Huang et al., 2017), and MobileNet-
contains 5 blocks of convolutions, each followed by max-
v2 (Sandler et al., 2018) – with different downsampling
pooling (with stride 2), followed by a linear classifier. For
strategies, as described in Table 1. Furthermore, we test the
purposes of our understanding, MaxPool layers are broken
classifier robustness using the Imagenet-C and ImageNet-P
into two components – before and after subsampling, e.g.,
datasets (Hendrycks et al., 2019).
max1 and pool1, respectively. In Figure 5 (top), we show
Conditional Image Generation Finally, we show that the internal feature distance, as a function of all possible shift-
same aliasing issues in classification networks are also offsets (∆h, ∆w) and layers. All layers before the first
present in conditional image generation networks. We test downsampling, max1, are shift-equivariant. Once down-
on the Labels→Facades (Tyleček & Šára, 2013; Isola et al., sampling occurs in pool1, shift-equivariance is lost. How-
2017) dataset, where a network is tasked to generated a ever, periodic-N shift-equivariance still holds, as indicated
256×256 photorealistic image from a label map. There are by the stippling pattern in pool1, and each subsequent
400 training and 100 validation images. subsampling doubles the factor N.
Making Convolutional Networks Shift-Invariant Again

(a) Baseline VGG13bn (using MaxPool)

(b) Anti-aliased VGG13bn (using MaxBlurPool, Bin-5)


Figure 5. Deviation from perfect shift-equivariance, throughout VGG. Feature distance between left & right-hand sides of the shift-
equivariance condition (Eqn 1). Each pixel in each heatmap is a shift (∆h, ∆w). Blue indicates perfect shift-equivariance; red indicates
large deviation. Note that the dynamic ranges of distances are different per layer. For visualization, we calibrate by calculating the mean
distance between two different images, and mapping red to half the value. Accumulated downsampling factor is in [brackets]; in layers
pool5, classifier, and softmax, shift-equivariance and shift-invariance are equivalent, as features have no spatial extent. Layers
up to max1 have perfect equivariance, as no downsampling yet occurs. (a) On the baseline network, shift-equivariance is reduced each
time downsampling takes place. Periodic-N shift-equivariance holds, with N doubling with each downsampling. (b) With our antialiased
network, shift-equivariance is better maintained, and the resulting output is more shift-invariant.

In Figure 5 (bottom), we plot shift-equivariance maps Improved shift-invariance We apply progressively


with our anti-aliased network, using MaxBlurPool. Shift- stronger filters – Rect-2, Tri-3, Bin-5. Doing so increases
equivariance is clearly better preserved. In particular, the ResNet50 stability by +0.8%, +1.7%, and +2.1%, respec-
severe drop-offs in downsampling layers do not occur. Im- tively. Note that doubling layers – going to ResNet101 –
proved shift-equivariance throughout the network cascades only increases stability by +0.6%. Even a simple, small
into more consistent classifications in the output, as shown low-pass filter, directly applied to ResNet50, outpaces this.
by some selected examples in Figure 1. This study uses a As intended, stability increases across architectures (points
Bin-5 filter, trained without data augmentation. The trend move upwards in Figure 6).
holds for other filters and when training with augmentation.
Improved classification Filtering improves the shift-
invariance. How does it affect absolute classification perfor-
4.4. Large-scale ImageNet classification mance? We find that across the board, performance actually
4.4.1. S HIFT- INVARIANCE AND ACCURACY increases (points move to the right in Figure 6). The filters
improve ResNet50 by +0.7% to +0.9%. For reference, dou-
We next test on large-scale image classification of Ima- bling the layers to ResNet101 increases accuracy by +1.2%.
geNet (Russakovsky et al., 2015). In Figure 6, we show A low-pass filter makes up much of this ground, without
classification accuracy and consistency, across variants adding any learnable parameters. This is a surprising, unex-
of several architectures – VGG, ResNet, DenseNet, and pected result, as low-pass filtering removes information, and
MobileNet-v2. The off-the-shelf networks are labeled as could be expected to reduce performance. On the contrary,
Baseline, and we use standard training schedules from the we find that it serves as effective regularization, and these
publicly available PyTorch (Paszke et al., 2017) repository widely-used methods improve with simple anti-aliasing. As
for our anti-aliased networks. Each architecture has a differ- ImageNet-trained nets often serve as the backbone for down-
ent downsampling strategy, shown in Table 1. We typically stream tuning, this improvement may be observed across
refer to the popular ResNet50 as a running example; note other applications as well.
that we see similar trends across network architectures.
Making Convolutional Networks Shift-Invariant Again

93 Normalized average Unnormalized average


92 ImNet-C ImNet-P ImNet-C ImNet-P

91 mCE mFR mCE mFR


Baseline 76.4 58.0 60.6 7.92
90 ResNet101 Rect-2 75.2 56.3 59.5 7.71
ResNet50
Consistency

Tri-3 73.7 51.9 58.4 7.05


89 VGG16bn
DenseNet121 Bin-5 73.4 51.2 58.1 6.90
VGG16
88
ResNet34 Table 2. Accuracy and stability robustness. Accuracy in
87 ImageNet-C, which contains systematically corrupted ImageNet
86 Mobilenet-v2 images, measured by mean corruption error mCE (lower is bet-
Baseline ter). Stability on ImageNet-P, which contains perturbed image
85 Anti-aliased (Rect-2) sequences, measured by mean flip rate mFR (lower is better). We
ResNet18 Anti-aliased (Tri-3)
show raw, unnormalized scores, as well as scores normalized to
Anti-aliased (Bin-5)
84 AlexNet, as used in Hendrycks et al. (2019). Anti-aliasing im-
70 72 74 76 78 80
Accuracy proves both accuracy and stability over the baseline. All networks
Figure 6. ImageNet Classification consistency vs. accuracy. Up are variants of ResNet50.
(more consistent to shifts) and to the right (more accurate) is better.
Different shapes correspond to the baseline (circle) or variants of to the previously explored corruptions, ImageNet-C con-
our anti-aliased networks (bar, triangle, pentagon for length 2, 3, tains impulse noise, defocus and glass blur, simulated frost
5 filters, respectively). We test across network architectures. As and fog, and various digital alterations of contrast, elastic
expected, low-pass filtering helps shift-invariance. Surprisingly, transformation, pixelation and jpeg compression. The geo-
classification accuracy is also improved. metric perturbations are not used. ResNet50 has mean error
rate of 60.6%. Anti-aliasing with Bin-5 reduces the error
The best performing filter varies by architecture, but all rate by 2.5%. As expected, the more “high-frequency” cor-
filters improve over the baseline. We recommend using the ruptions, such as adding noise and pixelation, show greater
Tri-3 or Bin-5 filter. If shift-invariance is especially desired, improvement. Interestingly, we see improvements even with
stronger filters can be used. “low-frequency” corruptions, such defocus blur and zoom
blur operations as well.
4.4.2. O UT- OF - DISTRIBUTION ROBUSTNESS Together, these results indicate that a byproduct of antialias-
We have shown increased stability (to shifts), as well as accu- ing is a more robust, generalizable network. Though mo-
racy. Next, we test the generalization capability the classifier tivated by shift-invariance, we actually observe increased
in these two aspects, using datasets from Hendrycks et al. stability to other perturbation types, as well as increased
(2019). We test stability to perturbations other than shifts. accuracy, both on clean and corrupted images.
We then test accuracy on systematically corrupted images.
Results are shown in Table 2, averaged across corruption 4.5. Conditional image generation (Label→Facades)
types. We show the raw, unnormalized average, along with
We test on image generation, outputting an image of a facade
a weighted “normalized” average, as recommended.
given its semantic label map (Tyleček & Šára, 2013), in a
Stability to perturbations The ImageNet-P GAN setup (Goodfellow et al., 2014a; Isola et al., 2017).
dataset (Hendrycks et al., 2019) contains short video Our classification experiments indicate that anti-aliasing is
clips of a single image with small perturbations added, a natural choice for the discriminator, and is used in the
such as variants of noise (Gaussian and shot), blur (motion recent StyleGAN method (Karras et al., 2019). Here, we
and zoom), simulated weather (snow and brightness), and explore its use in the generator, for the purposes of obtaining
geometric changes (rotation, scaling, and tilt). Stability is a shift-equivariant image-to-image translation network.
measured by flip rate (mFR) – how often the top-1 classifi-
Baseline We use the pix2pix method (Isola et al., 2017).
cation changes, on average, in consecutive frames. Baseline
The method uses U-Net (Ronneberger et al., 2015), which
ResNet50 flips 7.9% of the time; adding anti-aliasing Bin-5
contains 8 downsampling and 8 upsampling layers, with
reduces by 1.0%. While antialiasing provides increased
skip connections to preserve local information. No anti-
stability to shifts by design, a “free”, emergent property is
aliasing filtering is applied in down or upsampling layers in
increased stability to other perturbation types.
the baseline. In Figure 7, we show a qualitative example,
Robustness to corruptions We observed increased accu- focusing in on a specific window. In the baseline (top), as
racy on clean ImageNet. Here, we also observe more grace- the input X shifts horizontally by ∆w, the vertical bars on
ful degradation when images are corrupted. In addition the generated window also shift. The generations start with
Making Convolutional Networks Shift-Invariant Again

Generated image Generated windows (different input shifts)


2 vertical bars 1 vertical bar, shifting to the left

Generated window

Baseline
Difference from
unshifted generation
Input

Consistent window pattern generated

Generated window
Ours

Difference from
unshifted generation

Δw = 0 Δw = 0 Δw = 7
Increasing horizontal shift
Figure 7. Selected example of generation instability. The left two images are generated facades from label maps. For the baseline
method (top), input shifts cause different window patterns to emerge, due to naive downsampling and upsampling. Our method (bottom)
stabilizes the output, generating the same window pattern, regardless the input shift.

Baseline Rect-2 Tri-3 Bin-4 Bin-5 generating high-quality imagery. Quantitatively, in Table 3,
Stability [dB] 29.0 30.1 30.8 31.2 34.4 we compute the total variation (TV) norm of the generated
TV Norm ×100 7.48 7.07 6.25 5.84 6.28 images. Qualitatively, we observe that generation quality
Table 3. Generation stability PSNR (higher is better) between
typically holds with the Tri-3 filter and subsequently de-
generated facades, given two horizontally shifted inputs. More ag- grades. In the supplemental material, we show examples of
gressive filtering in the down and upsampling layers leads to a more applying increasingly aggressive filters. We observe a boost
shift-equivariant generator. Total variation (TV) of generated in shift-equivariance while maintaining generation quality,
images (closer to ground truth images 7.80 is better). Increased and then a tradeoff between the two factors.
filtering decreases the frequency content of generated images.
These experiments demonstrate that the technique can make
a drastically different architecture (U-Net) for a different
two bars, to a single bar, and eventually oscillates back to task (generating pixels) more shift-equivariant.
two bars. A shift-equivariant network would provide the
same resulting facade, no matter the shift. 5. Conclusions and Discussion
Applying anti-aliasing We augment the strided-
Shift-equivariance is lost in modern deep networks, as com-
convolution downsampling by blurring. The U-Net
monly used downsampling layers ignore Nyquist sampling
also uses upsampling layers, without any smoothing.
and alias. We integrate low-pass filtering to anti-alias, a
Similar to the subsampling case, this leads to aliasing,
common signal processing technique. The simple modifica-
in the form of grid artifacts (Odena et al., 2016). We
tion achieves higher consistency, across architectures and
mirror the downsampling by applying the same filter after
downsampling techniques. In addition, in classification, we
upsampling. Note that applying the Rect-2 and Tri-3 filters
observe surprising boosts in accuracy and robustness.
while upsampling correspond to “nearest” and “bilinear”
upsampling, respectively. By using the Tri-3 filter, the same Anti-aliasing for shift-equivariance is well-understood. A
window pattern is generated, regardless of input shift, as future direction is to better understand how it affects and
seen in Figure 7 (bottom). improves generalization, as we observed empirically. Other
directions include the potential benefit to downstream ap-
We measure similarity using peak signal-to-noise ratio be-
plications, such as nearest-neighbor retrieval, improving
tween generated facades with shifted and non-shifted inputs:
temporal consistency in video models, robustness to adver-
EX,∆w PSNR(Shift0,∆w (F (X)), F (Shift0,∆w (X)))). In
sarial examples, and high-level vision tasks such as detec-
Table 3, we show that the smoother the filter, the more
tion. Adding the inductive bias of shift-invariance serves
shift-equivariant the output.
as “built-in” shift-based data augmentation. This is poten-
A concern with adding low-pass filtering is the loss of abil- tially applicable to online learning scenarios, where the data
ity to generate high-frequency content, which is critical for distribution is changing.
Making Convolutional Networks Shift-Invariant Again

ACKNOWLEDGMENTS Dosovitskiy, A. and Brox, T. Generating images with per-


ceptual similarity metrics based on deep networks. In
I am especially grateful to Eli Shechtman for helpful discus-
NIPS, 2016a.
sion and guidance. Michaël Gharbi, Andrew Owens, and
anonymous reviewers provided beneficial feedback on ear- Dosovitskiy, A. and Brox, T. Inverting visual representations
lier drafts. I thank labmates and mentors, past and present with convolutional networks. In CVPR, 2016b.
– Sylvain Paris, Oliver Wang, Alexei A. Efros, Angjoo
Kanazawa, Taesung Park, and Phillip Isola – for their help- Engstrom, L., Tsipras, D., Schmidt, L., and Madry, A. A ro-
ful comments and encouragement. I thank Dan Hendrycks tation and a translation suffice: Fooling cnns with simple
for discussion about robustness tests on ImageNet-C/P. transformations. In ICML, 2019.
Esteves, C., Allen-Blanchette, C., Zhou, X., and Daniilidis,
C HANGELOG K. Polar transformer networks. In ICLR, 2018.
v1 ArXiv preprint. Paper accepted to ICML 2019.
Fawzi, A. and Frossard, P. Manitest: Are classifiers really
v2 ICML camera ready. Added additional networks. Added invariant? In BMVC, 2015.
robustness measures. ImageNet consistency numbers and
AlexNet results re-evaluated; small fluctuations but no Fowler, J. E. The redundant discrete wavelet transform and
changes in general trends. Compressed main paper to 8 additive noise. IEEE Signal Processing Letters, 12(9):
pages. Cifar results moved to supplemental. Small changes 629–632, 2005.
to text. Fukushima, K. and Miyake, S. Neocognitron: A self-
organizing neural network model for a mechanism of
References visual pattern recognition. In Competition and coopera-
tion in neural nets, pp. 267–285. Springer, 1982.
Adelson, E. H., Anderson, C. H., Bergen, J. R., Burt, P. J.,
and Ogden, J. M. Pyramid methods in image processing. Girshick, R., Donahue, J., Darrell, T., and Malik, J. Rich
RCA engineer, 29(6):33–41, 1984. feature hierarchies for accurate object detection and se-
mantic segmentation. In CVPR, 2014.
Aubry, M. and Russell, B. C. Understanding deep features
with computer-generated imagery. In ICCV, 2015. Gonzalez, R. C. and Woods, R. E. Digital Image Processing.
Pearson, 2nd edition, 1992.
Azulay, A. and Weiss, Y. Why do deep convolutional net-
works generalize so poorly to small image transforma- Goodfellow, I., Lee, H., Le, Q. V., Saxe, A., and Ng, A. Y.
tions? In arXiv, 2018. Measuring invariances in deep networks. In NIPS, 2009.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,
Bietti, A. and Mairal, J. Invariance and stability of deep Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.
convolutional representations. In NIPS, 2017. Generative adversarial nets. In NIPS, 2014a.
Bruna, J. and Mallat, S. Invariant scattering convolution Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining
networks. TPAMI, 2013. and harnessing adversarial examples. ICLR, 2014b.
Burt, P. J. and Adelson, E. H. The laplacian pyramid as a He, K., Zhang, X., Ren, S., and Sun, J. Deep residual
compact image code. In Readings in Computer Vision, learning for image recognition. In CVPR, 2016.
pp. 671–679. Elsevier, 1987.
Hénaff, O. J. and Simoncelli, E. P. Geodesics of learned
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and representations. In ICLR, 2016.
Yuille, A. L. Semantic image segmentation with deep
Hendrycks, D., Lee, K., and Mazeika, M. Using pre-training
convolutional nets and fully connected crfs. In ICLR,
can improve model robustness and uncertainty. In ICLR,
2015.
2019.
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger,
Yuille, A. L. Deeplab: Semantic image segmentation K. Q. Densely connected convolutional networks. In
with deep convolutional nets, atrous convolution, and CVPR, 2017.
fully connected crfs. TPAMI, 2018.
Hubel, D. H. and Wiesel, T. N. Receptive fields, binocular
Cohen, T. and Welling, M. Group equivariant convolutional interaction and functional architecture in the cat’s visual
networks. In ICML, 2016. cortex. The Journal of physiology, 160(1):106–154, 1962.
Making Convolutional Networks Shift-Invariant Again

Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. Image-to- Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., and
image translation with conditional adversarial networks. Yosinski, J. Plug & play generative networks: Conditional
In CVPR, 2017. iterative generation of images in latent space. In CVPR,
2017.
Kanazawa, A., Sharma, A., and Jacobs, D. Locally scale-
invariant convolutional neural networks. In NIPS Work- Nyquist, H. Certain topics in telegraph transmission the-
shop, 2014. ory. Transactions of the American Institute of Electrical
Engineers, pp. 617–644, 1928.
Karras, T., Laine, S., and Aila, T. A style-based generator
architecture for generative adversarial networks. ICLR, Odena, A., Dumoulin, V., and Olah, C. Deconvolution and
2019. checkerboard artifacts. Distill, 2016. doi: 10.23915/
distill.00003. URL http://distill.pub/2016/
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, deconv-checkerboard.
R., Torralba, A., and Fidler, S. Skip-thought vectors. In
NIPS, 2015. Oppenheim, A. V., Schafer, R. W., and Buck, J. R. Discrete-
Time Signal Processing. Pearson, 2nd edition, 1999.
Krizhevsky, A. and Hinton, G. Learning multiple layers
of features from tiny images. Technical report, Citeseer, Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
2009. DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer,
A. Automatic differentiation in pytorch. 2017.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
classification with deep convolutional neural networks. Ronneberger, O., Fischer, P., and Brox, T. U-net: Convolu-
In NIPS, 2012. tional networks for biomedical image segmentation. In
MICCAI, 2015.
LeCun, Y., Boser, B. E., Denker, J. S., Henderson, D.,
Howard, R. E., Hubbard, W. E., and Jackel, L. D. Hand- Ruderman, A., Rabinowitz, N. C., Morcos, A. S., and Zo-
written digit recognition with a back-propagation network. ran, D. Pooling is neither necessary nor sufficient for
In NIPS, 1990. appropriate deformation stability in cnns. In arXiv, 2018.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh,
based learning applied to document recognition. Proceed- S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bern-
ings of the IEEE, 86(11):2278–2324, 1998. stein, M., et al. Imagenet large scale visual recognition
challenge. IJCV, 115(3):211–252, 2015.
Lee, C.-Y., Gallagher, P. W., and Tu, Z. Generalizing pool-
ing functions in convolutional neural networks: Mixed, Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., and
gated, and tree. In AISTATS, 2016. Chen, L.-C. Mobilenetv2: Inverted residuals and linear
bottlenecks. In CVPR, 2018.
Lenc, K. and Vedaldi, A. Understanding image represen-
tations by measuring their equivariance and equivalence. Scherer, D., Muller, A., and Behnke, S. Evaluation of pool-
In CVPR, 2015. ing operations in convolutional architectures for object
recognition. In ICANN. 2010.
Leung, T. and Malik, J. Representing and recognizing the
visual appearance of materials using three-dimensional Sifre, L. and Mallat, S. Rotation, scaling and deformation
textons. IJCV, 2001. invariant scattering for texture discrimination. In CVPR,
2013.
Lowe, D. G. Object recognition from local scale-invariant
features. In ICCV, 1999. Simoncelli, E. P., Freeman, W. T., Adelson, E. H., and
Heeger, D. J. Shiftable multiscale transforms. IEEE trans-
Mahendran, A. and Vedaldi, A. Understanding deep image actions on Information Theory, 38(2):587–607, 1992.
representations by inverting them. In CVPR, 2015.
Simonyan, K. and Zisserman, A. Very deep convolutional
Mairal, J., Koniusz, P., Harchaoui, Z., and Schmid, C. Con- networks for large-scale image recognition. In ICLR,
volutional kernel networks. In NIPS, 2014. 2015.

Mordvintsev, A., Olah, C., and Tyka, M. Deepdream-a Su, J., Vargas, D. V., and Sakurai, K. One pixel attack
code example for visualizing neural networks. Google for fooling deep neural networks. IEEE Transactions on
Research, 2:5, 2015. Evolutionary Computation, 2019.
Making Convolutional Networks Shift-Invariant Again

Tyleček, R. and Šára, R. Spatial pattern templates for recog- Classification (CIFAR)
Architecture
nition of objects with regular structure. In German Con- VGG13-bn DenseNet-40-12
ference on Pattern Recognition, pp. 364–374. Springer,
StridedConv – –
2013.
MaxPool 5 –
Vedaldi, A. and Fulkerson, B. VLFeat: An open and AvgPool – 2
portable library of computer vision algorithms. http:
Table 4. Testbeds (CIFAR10 Architectures). We use slightly dif-
//www.vlfeat.org/, 2008. ferent architectures for VGG (Simonyan & Zisserman, 2015) and
DenseNet (Huang et al., 2017) than the ImageNet counterparts.
Worrall, D. E., Garbin, S. J., Turmukhambetov, D., and
Brostow, G. J. Harmonic networks: Deep translation and
rotation equivariance. In CVPR, 2017.
network, with the MaxBlurPool operator, increases consis-
Xiao, C., Zhu, J.-Y., Li, B., He, W., Liu, M., and Song, D. tency. The larger the filter, the more consistent the output
Spatially transformed adversarial examples. ICLR, 2018. classifications. This result agrees with our expectation and
theory – improving shift-equivariance throughout the net-
Yu, F. and Koltun, V. Multi-scale context aggregation by work should result in more consistent classifications across
dilated convolutions. ICLR, 2016. shifts, even when such shifts are not seen at training.
Yu, F., Koltun, V., and Funkhouser, T. Dilated residual In this regime, accuracy clearly increases with consistency,
networks. In CVPR, 2017. as seen with the blue markers in Figure 8. Filtering does
not destroy the signal or make learning harder. On the con-
Zeiler, M. D. and Fergus, R. Visualizing and understanding trary, shift-equivariance serves as “built-in” augmentation,
convolutional networks. In ECCV, 2014. indicating more efficient data usage.
Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, Training with data augmentation In principle, networks
O. The unreasonable effectiveness of deep features as a can learn to be shift-invariant from data. Is data augmen-
perceptual metric. In CVPR, 2018. tation all that is needed to achieve shift-invariance? By
applying the Rect-2 filter, a large increase in consistency,
Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., and Torralba, 96.6 → 97.6, can be had at a small decrease in accuracy
A. Object detectors emerge in deep scene cnns. In ICLR, 93.8 → 93.7. Even when seeing shifts at training, antialias-
2015. ing increases consistency. From there, stronger filters can
increase consistency, at the expense of accuracy.

Supplementary Material DenseNet results We show a summary of VGG and


DenseNet in Table 4. DenseNet uses comparatively fewer
Here, we show additional results and experiments for CI- downsampling layers – 2 average-pooling layers instead of
FAR classification, ImageNet expanded results, and condi- 5 max-pooling layers. With just two downsampling lay-
tional image generation. ers, the baseline still loses shift-invariance. Even when
training with data augmentation, replacing average-pooling
A. CIFAR Classification with blurred-pooling increases both consistency and even
minorly improves accuracy. Note that the DenseNet archi-
A.1. Classification results tecture performs stronger than VGG to begin with. In this
setting, the Bin-7 BlurPool operator works best for both con-
We train both without and with shift-based data augmenta-
sistency and accuracy. Again, applying the operator serves
tion. We evaluate on classification accuracy and consistency.
are “built-in” data augmentation, performing strongly even
The results are shown in Table 5 and Figure 8.
without shifts at train time.
In the main paper, we showed internal activations on
How do the learned convolutional filters change? Our
the older VGG (Simonyan & Zisserman, 2015) network.
proposed change smooths the internal feature maps for pur-
Here, we also present classification accuracy and consis-
poses of downsampling. How does training with this layer
tency results on the output, along with the more modern
affect the learned convolutional layers? We measure spa-
DenseNet (Huang et al., 2017) architecture.
tial smoothness using the normalized Total Variation (TV)
Training without data augmentation Without the bene- metric proposed in Ruderman et al. (2018). A higher value
fit of seeing shifts at training time, the baseline network indicates a filter with more high-frequency components. A
produces inconsistent classifications – random shifts of the lower value indicates a smoother filter. As shown in Fig-
same image only agree 88.1% of the time. Our anti-aliased ure 10, the anti-aliased networks (red-purple) actually learn
Making Convolutional Networks Shift-Invariant Again

Train w/o augmentation Train w/ augmentation


Net Filter shape # Taps Weights Accuracy Accuracy
Consistency Consistency
None Rand None Rand
Delta (baseline) 1 [1] 91.6 87.4 88.1 93.4 93.8 96.6
Rectangle 2 [1, 1] 92.8 89.3 90.5 93.9 93.7 97.6
Triangle 3 [1, 2, 1] 93.1 91.4 93.9 93.6 93.6 98.0
VGG
Binomial 4 [1, 3, 3, 1] 93.0 91.1 93.2 93.4 93.2 98.1
Binomial 5 [1, 4, 6, 4, 1] 93.2 92.6 96.3 93.1 93.2 98.4
Binomial 6 [1, 5, 10, 10, 5, 1] 93.0 92.4 96.9 93.4 93.4 98.6
Binomial 7 [1, 6, 15, 20, 15, 6, 1] 93.0 93.0 98.1 93.2 93.2 98.8
Delta 1 [1] 92.0 89.9 91.5 93.9 93.9 97.3
Rect (baseline) 2 [1, 1] 93.0 92.3 94.8 94.4 94.4 97.7
Dense Triangle 3 [1, 2, 1] 93.9 93.5 96.7 94.5 94.5 98.3
Binomial 5 [1, 4, 6, 4, 1] 94.4 94.0 98.1 94.5 94.5 98.8
Binomial 7 [1, 6, 15, 20, 15, 6, 1] 94.5 94.3 98.8 94.5 94.6 98.9

Table 5. CIFAR Classification accuracy and consistency Results across blurring filters and training scenarios (without and with data
augmentation). We evaluate classification accuracy without shifts (Accuracy – None) and on random shifts (Accuracy – Random), as
well as classification consistency.

VGG-13 (Simonyan & Zisserman, 2015) DenseNet-40-12 (Huang et al., 2017)


No augment. With augment. Delta-1 (baseline) No augment. With augment. Delta-1
Rect-2 Tri-3 Binom-4 Rect-2 (baseline) Tri-3 Binom-5
Binom-5 Binom-6 Binom-7 Binom-7
1.00 1.00
0.98 0.98
0.96 0.96
Consistency

0.94 0.94
0.92 0.92
0.90 0.90
0.88 0.88
0.91 0.92 0.93 0.94 0.95 0.91 0.92 0.93 0.94 0.95
Accuracy Accuracy
Figure 8. CIFAR10 Classification consistency vs. accuracy. VGG (left) and DenseNet (right) networks. Up (more consistent) and to
the right (more accurate) is better. Number of sides corresponds to number of filter taps used (e.g., diamond for 4-tap filter); colors
correspond to filters trained without (blue) and with (pink) shift-based data augmentation, using various filters. We show accuracy for no
shift when training without shifts, and a random shift when training with shifts.
Making Convolutional Networks Shift-Invariant Again

Delta (Baseline) Triangle-3 Binomial-5 Binomial-7


Delta (Baseline) Rect-2 0 Triangle-3 Binomial-4 0.94
Rect-2 Binomial-4 Binomial-6
0.94
5 0.92
0.92
10
0.90

Accuracy
0.90
15
Binomial-5 Binomial-6 Binomial-7 0.88 0.88
20
0.86
25 0.86
0.84
30 16 12 8 4 0 4 8 12
0 5 10 15 20 25 30
0.84 Diagonal Shift [pix]

Figure 9. Average accuracy as a function of shift. (Left) We show classification accuracy across the test set as a function of shift, given
different filters. (Right) We plot accuracy vs diagonal shift in the input image, across different filters. Note that accuracy degrades quickly
with the baseline, but as increased filtering is added, classifications become consistent across spatial positions.

1.6
passes in an ensembling approach (1024× computation to
evaluate every shift), or evaluating each layer more densely
1.4 by exchanging striding for dilation (4×, 16×, 64×, 256×
1.2 computation for conv2-conv5, respectively). Given com-
Average Filter TV

1.0 putational resources, brute-force computation solves shift-


invariance.
0.8
MaxPool [1] Delta Average accuracy across spatial positions In Figure 9, we
0.6 MaxBlurPool [1 1] Rect-2
MaxBlurPool [1 2 1] Tri-3 train without augmentation, and show how accuracy system-
0.4 MaxBlurPool [1 3 3 1] Binomial-4 atically degrades as a function of spatial shift. We observe
MaxBlurPool [1 4 6 4 1] Binomial-5
0.2 MaxBlurPool [1 5 10 10 5 1] Binomial-6 the following:
MaxBlurPool [1 6 15 20 15 6 1] Binomial-7
0.0 • On the left, the baseline heatmap shows that classification
1 2 1 2 1 2 1 2 1 2
v1_ v1_ v2_ v2_ v3_ v3_ v4_ v4_ v5_ v5_
con con con con con con con con con con accuracy holds when testing with no shift, but quickly
Layer degrades when shifting.
• The proposed filtering decreases the degradation. Bin-7 is
Figure 10. Total Variation (TV) by layer. We compute average
smoothness of learned conv filters per layer (lower is smoother).
largely consistent across all spatial positions.
Baseline MaxPool is in black, and adding additional blurring is • On the right, we plot the accuracy when making diag-
shown in colors. Note that the learned convolutional layers become onal shifts to the input. As increased filtering is added,
smoother, indicating that a smoother feature extractor is induced. classification accuracy becomes consistent in all positions.

Classification variation distribution The consistency


metric in the main paper looks at the hard classifica-
smoother filters throughout the network, relative to the base-
tion, discounting classifier confidence. Similar to Azu-
line (black). Adding in more aggressive low-pass filtering
lay & Weiss (2018), we also compute the variation in
further decreases the TV (increasing smoothness). This in-
probability of correct classification (the traces shown
dicates that our method actually induces a smoother feature
in Figure 3 in the main paper), given different shifts.
extractor overall.
We
p can capture the variation across all possible shifts:
Timing analysis The average speed of a forward pass V arh,w ({Pcorrect class (Shifth,w (X))}).
of VGG13bn using batch size 100 CIFAR images on a
In Figure 11, we show the distribution of classification vari-
GTX1080Ti GPU is 10.19ms. Evaluating Max at stride 1
ations, before and after adding in the low-pass filter. Even
instead of 2 adds 3.0%. From there, low-pass filtering with
with a small 2 × 2 filter, immediately variation decreases.
kernel sizes 3, 5, 7 adds additional 5.5%, 7.6%, 9.3% time,
As the filter size is increased, the output classification vari-
respectively, relative to baseline. The method can be imple-
ation continues to decrease. This has a larger effect when
mented more efficiently by separating the low-pass filter into
training without data augmentation, but is still observable
horizontal and vertical components, allowing added time to
when training with data augmentation.
scale linearly with filter size, rather than quadratically. In
total, the largest filter adds 12.3% per forward pass. This Training with data augmentation with the baseline network
is significantly cheaper than evaluating multiple forward reduces variation. Anti-aliasing the networks reduces vari-
Making Convolutional Networks Shift-Invariant Again

Train without data augmentation Train with data augmentation


1600 MaxPool [1] Delta (Baseline) 1000
MaxBlurPool [1 1] Rect-2
1400 MaxBlurPool [1 2 1] Tri-3
MaxBlurPool [1 3 3 1] Binomial-4 800
Count in CIFAR10 Test Set

Count in CIFAR10 Test Set


1200 MaxBlurPool [1 4 6 4 1] Binomial-5
MaxBlurPool [1 5 10 10 5 1] Binomial-6
1000 MaxBlurPool [1 6 15 20 15 6 1] Binomial-7
600
800
600 400

400
200
200
0 0
10 5 10 4 10 3 10 2 10 1 100 10 5 10 4 10 3 10 2 10 1 100
Variation in probability of correct classification Variation in probability of correct classification
Figure 11. Distribution of per-image classification variation. We show the distribution of classification variation in the test set, (left)
without and (right) with data augmentation at training. Lower variation means more consistent classifications (and increased shift-
invariance). Training with data augmentation drastically reduces variation in classification. Adding filtering further decreases variation.
Train without Data Augmentation Train with Data Augmentation
1.0 1.0

0.9 0.9

0.8 0.8

0.7 0.7
Accuracy

0.6 0.6

0.5 MaxPool [1] Delta 0.5 MaxPool [1] Delta


MaxBlurPool [1 1] Rect-2 MaxBlurPool [1 1] Rect-2
MaxBlurPool [1 2 1] Triangle-3 MaxBlurPool [1 2 1] Triangle-3
MaxBlurPool [1 3 3 1] Binomial-4 MaxBlurPool [1 3 3 1] Binomial-4
0.4 MaxBlurPool [1 4 6 4 1] Binomial-5 0.4 MaxBlurPool [1 4 6 4 1] Binomial-5
MaxBlurPool [1 5 10 10 5 1] Binomial-6 MaxBlurPool [1 5 10 10 5 1] Binomial-6
MaxBlurPool [1 6 15 20 15 6 1] Binomial-7 MaxBlurPool [1 6 15 20 15 6 1] Binomial-7
0.3 0.3
0 2 4 6 8 10 12 14 16 0 2 4 6 8 10 12 14 16
Max adversarial shift [pix] Max adversarial shift [pix]
Figure 12. Robustness to shift-based adversarial attack. Classification accuracy as a function of the number of pixels an adversary is
allowed to shift the image. Applying our proposed filtering increases robustness, both without (left) and with right data augmentation.

ation in both scenarios. More aggressive filtering further of 0 means that there is no adversary. Conversely, a max
decreases variation. shift of 16 means the image must be correctly classified at
all 32 × 32 = 1024 positions.
Robustness to shift-based adversary In the main paper,
we show that anti-aliased the networks increases the classi- Our primary observations are as follows:
fication consistency, while maintaining accuracy. A logical
consequence is increased accuracy in presence of a shift-
• As seen in Figure 12 (left), the baseline network (gray) is
based adversary. We empirically confirm this in Figure 12
very sensitive to the adversary.
for VGG13 on CIFAR10. We compute classification accu-
• Adding larger Binomial filters (from red to purple) in-
racy as a function of maximum adversarial shift. A max
creases robustness to the adversary. In fact, Bin-7 filter
shift of 2 means the adversary can choose any of the 25
(purple) without augmentation outperforms the baseline
positions within a 5 × 5 window. For the classifier to “win”,
(black) with augmentation.
it must correctly classify all of them correctly. Max shift
• As seen in Figure 12 (right), adding larger Binomial filters
Making Convolutional Networks Shift-Invariant Again

also increases adversarial robustness, even when training can learn convolution, but does not do so in practice.
with augmentation.
B. ImageNet Classification
These results corroborate the findings in the main paper, and We show expanded results and visualizations.
demonstrate a use case: increased robustness to a shift-based
adversarial attack. Classification and shift-invariance results In Table 6,
we show expanded results. These results are plotted
A.2. Alternatives to MaxBlurPool in Figure 6 in the main paper. All pretrained models
are available at https://richzhang.github.io/
In the paper, we follow signal processing first principles, to antialiased-cnns/.
arrive at our solution of MaxBlurPool, with a fixed blurring
kernel. Here, we explore possible alternatives – swapping Robustness results In the main paper, we show aggre-
max and blur operations, combining max and blur in parallel gated results for robustness tests on the Imagenet-C/P
through soft-gating, and learning the blur filter. datasets (Hendrycks et al., 2019). In Tables 8 and 7 we
show expanded results, separated by each corruption and
Swapping max and blur We blur after max, immediately perturbation type.
before subsampling, which has solid theoretical backing
in sampling theory. What happens when the operations Antialiasing is motivated by shift-invariance. Indeed, using
are swapped? The signal before the max operator is un- the Bin-5 antialiasing filter reduces flip rate by 22.3% to
doubtedly related to the signal after. Thus, blurring before translations. Table 8 indicates increased stability to other
max provides “second-hand” anti-aliasing and still increases perturbation types as well. We observe higher stability to
shift-invariance over the baseline. However, switching the geometric perturbations – rotation, tilting, and scaling. In
order is worse than max and blurring in the correct, proposed addition, antialiasing also helps stability to noise. This is
order. For example, for Bin-7, accuracy (93.2 → 92.6) and somewhat expected, as adding low-pass filtering helps can
consistency (98.8 → 98.6) both decrease. We consistently average away spurious noise. Surprisingly, adding blurring
observe this across filters. within the network also increases resilience to blurred im-
ages. In total, antialiasing increases stability almost across
Softly gating between max-pool and average-pool Lee the board – 9 of the 10 perturbations are reliably stabilized.
et al. (2016) investigate combining MaxPool and AvgPool
in parallel, with a soft-gating mechanism, called “Mixed” We also observe increased accuracy, in the face of corrup-
Max-AvgPool. We instead combine them in series. We con- tions, as shown in Table 7. Again, adding low-pass filtering
duct additional experiments here. On CIFAR (VGG w/ aug, helps smooth away spurious noise on the input, helping
see Tab 5), MixedPool can offer improvements over Max- better maintain performance. Other high-frequency pertur-
Pool baseline (96.6→97.2 consistency). However, by softly bations, such as pixelation and jpeg compression, are also
weighting AvgPool, some antialiasing capability is left on consistency improved with antialiasing. Overall, antialias-
the table. MaxBlurPool provides higher invariance (97.6). ing increases robustness to perturbations – 13 of the 15
All have similar accuracy – 93.8, 93.7, and 93.7 for baseline corruptions are reliably improved.
MaxPool, MixedPool, and our MaxBlurPool, respectively. In total, these results indicate that adding antialiasing pro-
We use our Rect-2 variant here for clean comparison. vides a smoother feature extractor, which is more stable and
Importantly, our paper proposes a methodology, not a pool- robust to out-of-distribution perturbations.
ing layer. The same technique to modify MaxPool (reduce
stride, then BlurPool) applies to the MixedPool layer, in- C. Qualitative examples for Labels→Facades
creasing its shift-invariance (97.2→97.8).
In the main paper, we discussed the tension between needing
Learning the blur filter We have shown that adding anti- to generate high-frequency content and low-pass filtering
aliasing filtering improves shift-equivariance. What if the for shift-invariance. Here, we show an example of applying
blur kernel were learned? We initialize the filters with our increasingly aggressive filters. In general, generation quality
fixed weights, Tri-3 and Bin-5, and allow them to be adjusted is maintained with the Rect-2 and Tri-3 filters, and then
during training (while constraining the kernel symmetrical). degrades with additional filtering.
The function space has more degrees of freedom and is
strictly more general. However, we find that while accuracy
holds, consistency decreases: relative to the fixed filters, we
see 98.0 → 97.5 for length-3 and 98.4 → 97.3 for length-5.
While shift-invariance can be learned, there is no explicit
incentive to do so. Analogously, a fully connected network
Making Convolutional Networks Shift-Invariant Again

AlexNet VGG16 VGG16bn


Filter Accuracy Consistency Accuracy Consistency Accuracy Consistency
Abs ∆ Abs ∆ Abs ∆ Abs ∆ Abs ∆ Abs ∆
Baseline 56.55 – 78.18 – 71.59 – 88.52 – 73.36 – 89.24 –
Rect-2 57.24 +0.69 81.33 +3.15 72.15 +0.56 89.24 +0.72 74.01 +0.65 90.72 +1.48
Tri-3 56.90 +0.35 82.15 +3.97 72.20 +0.61 89.60 +1.08 73.91 +0.55 91.10 +1.86
Bin-5 56.58 +0.03 82.51 +4.33 72.33 +0.74 90.19 +1.67 74.05 +0.69 91.35 +2.11

ResNet18 ResNet34 ResNet50


Filter Accuracy Consistency Accuracy Consistency Accuracy Consistency
Abs ∆ Abs ∆ Abs ∆ Abs ∆ Abs ∆ Abs ∆
Baseline 69.74 – 85.11 – 73.30 – 87.56 – 76.16 – 89.20 –
Rect-2 71.39 +1.65 86.90 +1.79 74.46 +1.16 89.14 +1.58 76.81 +0.65 89.96 +0.76
Tri-3 71.69 +1.95 87.51 +2.40 74.33 +1.03 89.32 +1.76 76.83 +0.67 90.91 +1.71
Bin-5 71.38 +1.64 88.25 +3.14 74.20 +0.90 89.49 +1.93 77.04 +0.88 91.31 +2.11

ResNet101 DenseNet121 MobileNetv2


Filter Accuracy Consistency Accuracy Consistency Accuracy Consistency
Abs ∆ Abs ∆ Abs ∆ Abs ∆ Abs ∆ Abs ∆
Baseline 77.37 – 89.81 – 74.43 – 88.81 – 71.88 – 86.50 –
Rect-2 77.82 +0.45 91.04 +1.23 75.04 +0.61 89.53 +0.72 72.63 +0.75 87.33 +0.83
Tri-3 78.13 +0.76 91.62 +1.81 75.14 +0.71 89.78 +0.97 72.59 +0.71 87.46 +0.96
Bin-5 77.92 +0.55 91.74 +1.93 75.03 +0.60 90.39 +1.58 72.50 +0.62 87.79 +1.29
Table 6. Imagenet Classification. We show 1000-way classification accuracy and consistency (higher is better), across 4 architectures,
with anti-aliasing filtering added. We test 3 possible filters, in addition to the off-the-shelf reference models. This shows results plotted
in Figure 6 in the main paper. Abs is the absolute performance, and ∆ is the difference to the baseline. As designed, classification
consistency is improved across all methods. Interestingly, accuracy is also improved.

ResNet50 on ImageNet-C (Hendrycks et al., 2019)


Corruption Error (CE) (lower is better)
Noise Blur Weather Digital Mean
Gauss Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel Jpeg Unnorm Norm
Baseline 68.70 71.10 74.04 61.40 73.39 61.43 63.93 67.76 62.08 54.61 32.04 61.25 55.24 55.24 46.32 60.57 76.43
Rect-2 65.81 68.27 70.49 60.01 72.14 62.19 63.96 68.00 61.83 54.95 32.09 60.25 55.56 53.89 43.62 59.54 75.16
Tri-3 63.86 66.07 69.15 58.36 71.70 60.74 61.58 66.78 60.29 54.40 31.48 58.09 55.26 53.89 43.62 58.35 73.73
Bin-5 64.31 66.39 69.88 60.31 71.37 61.60 61.25 66.82 59.82 51.84 31.51 58.12 55.29 50.81 42.84 58.14 73.41

Corruption Error, Percentage reduced from Baseline ResNet50 (higher is better)


Noise Blur Weather Digital Mean
Gauss Shot Impulse Defocus Glass Motion Zoom Snow Frost Fog Bright Contrast Elastic Pixel Jpeg Unnorm Norm
Baseline 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Rect-2 4.21 3.98 4.79 2.26 1.70 -1.24 -0.05 -0.35 0.40 -0.62 -0.16 1.63 -0.58 2.44 5.83 1.62 1.32
Tri-3 7.05 7.07 6.60 4.95 2.30 1.12 3.68 1.45 2.88 0.38 1.75 5.16 -0.04 2.44 5.83 3.51 3.34
Bin-5 6.39 6.62 5.62 1.78 2.75 -0.28 4.19 1.39 3.64 5.07 1.65 5.11 -0.09 8.02 7.51 3.96 3.70

Table 7. Generalization to Corruptions. (Top) Corruption error rate (lower is better) of Resnet50 on the Imagenet-C. With antialiasing,
the error rate decreases, often times significantly, on most corruptions. (Bottom) The percentage reduction relative to the baseline
ResNet50 (higher is better). The right two columns show mean across corruptions. “Unnorm” is the raw average. “Norm” is normalized
to errors made from AlexNet, as proposed in (Hendrycks et al., 2019).
Making Convolutional Networks Shift-Invariant Again

ResNet50 on ImageNet-P (Hendrycks et al., 2019)


Flip Rate (FR) (lower is better)
Noise Blur Weather Geometric Mean
Gauss Shot Motion Zoom Snow Bright Translate Rotate Tilt Scale Unnorm Norm
Baseline 14.04 17.38 6.00 4.29 7.54 3.03 4.86 6.79 4.01 11.32 7.92 57.99
Rect-2 14.08 17.16 5.98 4.21 7.34 3.20 4.42 6.43 3.80 10.61 7.72 56.70
Tri-3 12.59 15.57 5.39 3.79 6.98 3.01 3.95 5.80 3.53 9.90 7.05 51.91
Bin-5 12.39 15.22 5.44 3.72 6.76 3.15 3.78 5.67 3.44 9.45 6.90 51.18

Flip Rate (FR) [Percentage reduced from Baseline] (higher is better)


Noise Blur Weather Geometric Mean
Gauss Shot Motion Zoom Snow Bright Translate Rotate Tilt Scale Unnorm Norm
Baseline 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Rect-2 -0.25 1.27 0.30 1.73 2.65 -5.75 9.21 5.34 5.16 6.20 2.55 2.22
Tri-3 10.35 10.41 10.09 11.58 7.42 0.53 18.89 14.55 12.02 12.50 11.03 10.48
Bin-5 11.81 12.42 9.27 13.28 10.28 -4.10 22.27 16.59 14.11 16.50 12.91 11.75

Top-5 Distance (T5D) (lower is better)


Noise Blur Weather Geometric Mean
Gauss Shot Motion Zoom Snow Bright Translate Rotate Tilt Scale Unnorm Norm
Baseline 3.92 4.55 1.63 1.20 1.95 1.00 1.68 2.15 1.40 3.01 2.25 78.36
Rect-2 3.94 4.54 1.63 1.19 1.91 1.06 1.56 2.07 1.34 2.89 2.21 77.40
Tri-3 3.67 4.28 1.50 1.10 1.85 1.00 1.43 1.92 1.25 2.72 2.07 72.36
Bin-5 3.65 4.22 1.53 1.09 1.78 1.04 1.39 1.89 1.25 2.66 2.05 71.86

Top-5 Distance (T5D) [Percentage reduced from Baseline] (higher is better)


Noise Blur Weather Geometric Mean
Gauss Shot Motion Zoom Snow Bright Translate Rotate Tilt Scale Unnorm Norm
Baseline 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
Rect-2 -0.41 0.09 -0.12 0.39 1.71 -5.83 7.19 3.74 3.90 3.93 1.51 1.22
Tri-3 6.53 5.82 7.95 8.10 5.21 -0.65 15.11 10.82 10.26 9.80 7.86 7.65
Bin-5 7.03 7.26 6.24 9.15 8.45 -4.13 17.73 12.15 10.62 11.80 8.91 8.30
Table 8. Stability to Perturbations. Flip Rate (FR) and Top-5 Distance (T5D) of ResNet50 on ImageNet-P. Though our antialiasing is
motivated by shift-invariance (“translate”), it adds additional stability across many other perturbation types.

Generated Facades
Input Label Map
Delta (Baseline) [1] Rectangle-2 [1 1] Triangle-3 [1 2 1] Binomial-4 [1 3 3 1] Binomial-5 [1 4 6 4 1]

Figure 13. Example generations. We show generations with U-Nets trained with 5 different filters. In general, generation quality is
well-maintained to Tri-3 filter, but decreases noticeably with Bin-4 and Bin-5 filters due to oversmoothing.

You might also like