Convnets Match Vision Transformers at Scale
Convnets Match Vision Transformers at Scale
Convnets Match Vision Transformers at Scale
Many researchers believe that ConvNets perform well on small or moderately sized datasets, but are not
competitive with Vision Transformers when given access to datasets on the web-scale. We challenge this
belief by evaluating a performant ConvNet architecture pre-trained on JFT-4B, a large labelled dataset of
images often used for training foundation models. We consider pre-training compute budgets between
0.4k and 110k TPU-v4 core compute hours, and train a series of networks of increasing depth and width
from the NFNet model family. We observe a log-log scaling law between held out loss and compute
budget. After fine-tuning on ImageNet, NFNets match the reported performance of Vision Transformers
arXiv:2310.16764v1 [cs.CV] 25 Oct 2023
with comparable compute budgets. Our strongest fine-tuned model achieves a Top-1 accuracy of 90.4%.
Keywords: ConvNets, CNN, Convolution, Transformer, Vision, ViTs, NFNets, JFT, Scaling, Image
The base learning rate is tuned separately for forming language modelling with transformers
each epoch budget on a small logarithmic grid. (Brown et al., 2020; Hoffmann et al., 2022).
In Figure 2, we provide the validation loss at the
The optimal model size and the optimal epoch
end of training on a held out set of 130k images,
budget (which achieve the lowest validation loss)
plotted against the compute budget required to
both increase in size as the compute budget in-
train each model2 . We note that F7 has the same
creases. We found that a reliable rule of thumb is
width as F3, but is double the depth. Similarly
to scale the model size and the number of train-
F3 is double the depth of F1, and F1 is double
ing epochs at the same rate, as previously ob-
the depth of F0. F3+ and F7+ have the same
served for language modelling by Hoffmann et al.
depths as F3 and F7 but larger width. We train us-
(2022). We note that the optimal epoch budget
ing SGD with Momentum and Adaptive Gradient
was greater than 1 for overall compute budgets
Clipping (AGC) at batch size 4096, and we use an
greater than roughly 5k TPU-v4 core hours.
image resolution of 224×224 during training and
256 × 256 at evaluation. For additional details In Figure 3 we plot the observed optimal learn-
describing the NFNet architecture and training ing rate (which minimizes validation loss), for 3
pipeline we refer the reader to the original paper of our models, across a range of epoch budgets.3
(Brock et al., 2021), including the pre-training Note that we tuned the learning rate on a loga-
framework for JFT described in Section 6.2. Note rithmic grid spaced by factors of 2. We find that
that we removed near-duplicates of images in the all models in the NFNet family show a similar
training and validation sets of ImageNet from optimal learning rate 𝛼 ≈ 1.6 for small epoch
JFT-4B before training (Kolesnikov et al., 2020). budgets. However the optimal learning rate falls
as the epoch budget rises, and for large models
Figure 2 shows a clear linear trend, consistent
the optimal learning rate falls more quickly. In
with a log-log scaling law between validation loss
practice one can efficiently tune the learning rate
and pre-training compute. This matches the log-
within 2 trials by assuming that the optimal learn-
log scaling laws previously observed when per-
ing rate falls slowly but monotonically as both the
model size and the epoch budget increases.
2We estimate the compute required to train each model
by eye from the typical steps per second achieved by each 3 The optimal learning rate showed very similar trends
model during training (when not pre-empted). for all models. We select 3 models here for visual clarity.
2
ConvNets Match Vision Transformers at Scale
3.0
1.6
2.8
2.7
F0
F1 0.8
2.6 F3
2.5 F3+ F0
F7 F3
2.4 F7+ 0.4
F7+
103 104 105 0.25 0.5 1 2 4 8
TPU-v4 Core Hours Training Epochs
Figure 2 | Held out loss of NFNets on JFT-4B, Figure 3 | The optimal learning rate behaves pre-
plotted against the compute used during training. dictably and is easy to tune. All models show
Both axes are log-scaled, and each curve denotes similar optimal learning rates 𝛼 ∼ 1.6 when the
a different model trained for a range of epoch epoch budget is small. The learning rate falls
budgets. We observe a linear trend, matching the slowly as model size and epoch budget increases.
scaling laws observed for language modelling.
mentation multiplicity 4.4 For comparison, the
Finally, we note that some pre-trained models best reported Top-1 accuracy of an NFNet on Im-
in Figure 2 perform less well than expected. For ageNet without extra data is 86.8% (Fort et al.,
example, the curve for NFNet-F7+ models at dif- 2021), achieved by an NFNet-F5 with repeated
ferent pre-training budgets is not smooth. We be- augmentation. This demonstrates that NFNets
lieve this arises because our data loading pipeline benefit substantially from large scale pre-training.
did not guarantee that each training example
Despite the substantial differences between
would be sampled once per epoch if the training
the two model architectures, the performance of
run was pre-empted/restarted, potentially caus-
pre-trained NFNets at scale is remarkably similar
ing some training examples to be under-sampled
to the performance of pre-trained Vision Trans-
if a training run was restarted multiple times.
formers. For example, Zhai et al. (2022) achieve
90.2% Top-1 on ImageNet with a ViT-g/14, af-
Fine-tuned NFNets are competitive ter pre-training on JFT-3B for 210k TPU-v3 core
with Vision Transformers on ImageNet hours, and 90.45% with a ViT-G/14 after pre-
training on JFT-3B for over 500k TPU-v3 core
In Figure 1, we fine-tune our pre-trained NFNets hours. In a recent work, Alabdulmohsin et al.
on ImageNet, and plot the Top-1 error against (2023) optimize the ViT architecture and achieve
the compute used during pre-training. We fine- 90.3% Top-1 with a SoViT-400m/14 after pre-
tune each model for 50 epochs using sharpness training on JFT-3B for 230k TPU-v3 hours.
aware minimization (SAM) (Foret et al., 2020) We evaluated the pre-training speed for these
with stochastic depth and dropout. We train at models on TPU-v4 (using the original authors’
resolution 384 × 384 and evaluate at 480 × 480. codebase), and estimate that ViT-g/14 would take
The ImageNet Top-1 accuracy consistently im- 120k TPU-v4 core hours to pre-train, while ViT-
proves as the compute budget increases. Our most G/14 would take 280k TPU-v4 core hours and
expensive pre-trained model, an NFNet-F7+ pre- SoViT-400m/14 would take 130k TPU-v4 core
trained for 8 epochs, achieves an ImageNet Top-1 hours. We use these estimates to compare the pre-
accuracy of 90.3% while requiring roughly 110k training efficiency of ViTs and NFNets in Figure 1.
TPU-v4 core hours to pre-train and 1.6k TPU-v4 We note however that NFNets were optimized for
core hours to fine-tune. Furthermore, we achieve TPU-v4, and perform less well when evaluated
90.4% Top-1 accuracy if we additionally intro- 4When using repeated augmentation, we reduce the num-
duce repeated augmentation during fine-tuning ber of passes through the data such that the total computa-
(Fort et al., 2021; Hoffer et al., 2019) with aug- tional cost of fine-tuning is constant.
3
ConvNets Match Vision Transformers at Scale
on other devices. For example, we estimate that R. Bavishi, E. Elsen, C. Hawthorne, M. Nye,
NFNet-F7+ would require 250 TPU-v3 core hours A. Odena, A. Somani, and S. Taşırlar. Intro-
to pre-train for 8 epochs in our codebase. ducing our multimodal models, 2023. URL
https://www.adept.ai/blog/fuyu-8b.
Finally, we note that the pre-trained check-
points achieving the lowest validation loss on A. Brock, S. De, S. L. Smith, and K. Simonyan.
JFT-4B did not always achieve the highest Top-1 High-performance large-scale image recogni-
accuracy on ImageNet after fine-tuning. In par- tion without normalization. In International
ticular, we found that, under a fixed pre-training Conference on Machine Learning, pages 1059–
compute budget, the fine-tuning regime consis- 1071. PMLR, 2021.
tently favoured slightly larger models and slightly
smaller epoch budgets. Intuitively, larger models T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D.
have more capacity and are therefore better able Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
to adapt to the new task. In some cases, slightly G. Sastry, A. Askell, et al. Language models
larger learning rates (during pre-training) also are few-shot learners. Advances in neural in-
achieved better performance after fine-tuning. formation processing systems, 33:1877–1901,
2020.
4
ConvNets Match Vision Transformers at Scale