Paper 5
Paper 5
A IM-7B
74.5
Top-1 accuracy
Top-1 accuracy
78
(15 benchmarks)
(15 benchmarks)
74
A IM-3B
77
73.5
A IM-1B
76 A IM-0.6B
73 A IM-0.6B
Figure 1. A IM scaling behavior (Left) As we scale the capacity of A IM, we observe improved performance for the pre-training objective
which directly correlates with stronger downstream performance. (Right) A IM exhibits stronger downstream performance when trained
using larger sets of uncurated web data [32, 33]. The downstream performance is the average attentive probe top-1 accuracy over a diverse
set of 15 image recognition benchmarks. All models are trained for the same number of updates.
Abstract 1. Introduction
Pre-training task agnostic models has become the standard
This paper introduces A IM, a collection of vision models in Natural Language Processing with the recent revolution
pre-trained with an autoregressive objective. These models of large language models (LLMs) [13, 64, 75]. These mod-
are inspired by their textual counterparts, i.e., Large Lan- els can solve complex reasoning tasks from a few exam-
guage Models (LLMs), and exhibit similar scaling proper- ples [13], follow instructions [59], and now serve as the en-
ties. Specifically, we highlight two key findings: (1) the per- gine of widely used AI assistants such as ChatGPT. A key
formance of the visual features scale with both the model factor contributing to their success is the ability to consis-
capacity and the quantity of data, (2) the value of the objec- tently improve as the capacity (i.e., number of parameters)
tive function correlates with the performance of the model or the amount of pre-training data [64] increases.
on downstream tasks. We illustrate the practical implica- The scaling behavior of these models is remarkable for
tion of these findings by pre-training a 7 billion parameter two key reasons. First, even though these models are trained
A IM on 2 billion images that achieves 84.0% on ImageNet- with a simple objective – predicting the next word in a sen-
1k with a frozen trunk. Interestingly, even at this scale, we tence given its past – they are able to learn intricate patterns
observe no sign of saturation in performance, suggesting over long contexts. Second, the scalability of this autore-
that A IM potentially represents a new frontier for training gressive objective is mostly observed when used in conjunc-
large-scale vision models. The pre-training of A IM is sim- tion with certain architectures, and in particular Transform-
ilar to the pre-training of LLMs, and does not require any ers [79], highlighting the potential synergy between the au-
image-specific strategy to stabilize the training at scale. toregressive pre-training and this architecture.
These observations naturally raise the follow-up ques-
tion of whether the success of scaling Transformers with
an autoregressive objective is exclusive to text. This is par-
∗ Work done while at Apple. Now at Google DeepMind. ticularly significant considering that none of the aforemen-
1
tioned elements are inherently specific to language model- Of particular interest, Van den Oord et al. [76] show that
ing. Autoregressive objectives take their roots in the data using an architecture adapted to images, e.g., a convolution
compression literature [69], and similar approaches have network, significantly improved over autoregressive models
been investigated in audio [57] and images [18, 76]. The built with more generic architecture [77], e.g., a recurrent
Transformer architecture has also been successfully used in network [31]. Parmar et al. [61] further improve the qual-
other domains, in particular, computer vision with the suc- ity of these autoregressive models by adopting the trans-
cess of the Vision Transformers (ViT) [29]. Therefore, as former architecture [79]. More recently, Chen et al. [18]
a first step towards generalizing the findings of LLMs, we have shown that scaling with more compute leads to con-
explore if training ViT models with an autoregressive objec- tinuous improvements. Our work follows this line of re-
tive leads to competitive performance, in terms of learning search, and we benefit from training on significantly more
representations, with the same scaling ability as LLMs. data, and further improvement in architecture design [29],
In this paper, we introduce Autoregressive Image Mod- training [73, 75] and understanding of the scaling law [43].
els (A IM), an autoregressive approach for large-scale pre- Concurrent to our work, Bai et al. [3] demonstrate the ef-
training for visual features. We revisit prior work in au- fectiveness of large-scale autoregressive vision models for
toregressive representation learning such as iGPT [18] us- in-context pixel prediction tasks (e.g., semantic segmenta-
ing a modern toolset that includes vision transformers, col- tion, depth estimation).
lections of large-scale web data [32, 33] and recent ad- Self-supervised pre-training. Pre-training vision models
vances in LLM pre-training [43, 75]. Additionally, we in- on datasets of images without supervision has been a fruitful
troduce two architectural modifications to adapt autoregres- area of research in recent years [10, 27, 28, 34, 54, 87, 88].
sive pre-training to visual features. First, instead of restrict- Different approaches have been employed, focusing on var-
ing the self-attention to be fully causal as is typically the ious proxy tasks for feature learning. For example, Noroozi
case for LLMs, we adopt a prefix attention, as in T5 [66]. and Favaro [55] learn to re-arrange the order of shuffled
This choice enables moving to a fully bidirectional atten- image patches. Some other works have relied on cluster-
tion during downstream tasks. Second, we use a heav- ing [7, 14, 17, 83]. Another popular approach involves
ily parameterized token-level prediction head, inspired by the use of a contrastive objective, that resembles predic-
the heads used in contrastive learning [19]. We observe tive coding, where the objective is to identify each im-
that this modification significantly improves the quality of age [19, 40]. Most recent contrastive approaches include
the subsequent features with little overhead during training. DINO [58], BYOL [38] or iBot [88]. In a similar vein,
Overall, the training of A IM is similar to the training of some works have proposed predictive approaches [2, 6] or a
recent LLMs and does not rely on any stability-inducing form of feature whitening [85]. Closer to our approach are
techniques [24, 45, 74] that supervised [24, 74] or self- works inspired by BERT [26] where patches are masked and
supervised [5, 58] methods need. predicted with an autoencoder in either their discrete [5] or
We provide a study of a series of models, ranging from pixel [41] form.
600M to 7B parameters pre-trained using 2B uncurated im- Other generative pre-training. Autoregressive modeling
ages with permissive licenses. Our A IM models exhibit is a form of generative modeling, and few other genera-
strong scaling behavior w.r.t. the model size as shown in tive approaches have been considered to learn visual fea-
Figure 1 where higher capacity models achieve better down- tures. The first category leverages some form of autoencod-
stream performance, measured as the average accuracy over ing where the pretext task corresponds to some denoising
15 image recognition benchmarks. More importantly, there task. For instance, the noise can be salt-and-pepper [81]
is a correlation between the value of our objective function or masking [5, 62]. Another line of work leverages Gen-
on a validation set and the quality of the subsequent frozen erative Adversarial Networks (GANs) [35]. Most notably,
features. This observation confirms that the autoregres- BigGAN [12] trains a large GAN and re-uses the image dis-
sive objective is adequate for the training of visual features. criminator to produce image features. More recently, Diff-
Furthermore, we observe consistent improvement in down- MAE [82] used diffusion models to learn image features.
stream performance as we train on more images, with no
sign of saturation. Overall, these observations are aligned Pre-training at scale. There are numerous works on scal-
with the previous studies on scaling large language models. ing the pre-training of visual features with no supervi-
sion [15, 36, 37, 58, 70, 72]. The most salient work in this
2. Related Work area is DINOv2 where they produce the best self-supervised
features by scaling the iBot method [88] on a private dataset
Autoregressive models. While most of the literature on au- of 142M images and a 460M parameter model. The con-
toregressive models come from language modeling [9, 53, clusion from this work is that a carefully tuned contrastive
64] or speech [56, 57], few works have explored the po- method scales reasonably well, but they do not exhibit the
tential of this approach for images [18, 49, 61, 61, 68, 76]. scaling law that we observe with language modeling. They
2
also rely on an intricate implementation of contrastive learn-
ing to avoid the pitfalls described by Chen et al. [20]. In par-
allel, Singh et al. [70] study the scaling of Masked Autoen-
coders (MAE) [39]. While the study focuses on a weakly- MLP
supervised setup, it does not showcase strong improvements
to the self-supervised pre-training as the data is scaled to (prefix) Causal
billions of images. In contrast, we observe a clear benefit of Transformer
scale on the quality of our features, even at a scale of a few
billions of parameters and billions of images.
3. Pre-training Dataset
1 2 3
We pre-train our models on the DFN dataset introduced
by Fang et al. [32]. This dataset is composed of a larger
4 5 6
collection of 12.8B image-text pairs [33] filtered from Com-
mon Crawl. The data has been pre-processed to remove
7 8 9
NSFW content, blur faces, and reduce contamination by
deduplicating against the evaluation sets. A data filtering Figure 2. A IM pre-training overview.. Input images are split into
network [32] ranks the samples in the 12.8B collection ac- non-overlapping patches and embedded linearly following Doso-
cording to the alignment score between images and their vitskiy et al. [29]. The patch features are fed to a transformer in
corresponding caption. A subset of 2B images, called DFN- which the self-attention operation is causally masked to prevent
2B, has been extracted from the DataComp 12.8B dataset attending to preceding positions. Afterward, a heavily parameter-
[33] by keeping the top 15% samples. Note that other than ized MLP processes each of the patch features independently and
the privacy and safety filters, this process does not include finally projects it to pixel space. The targets correspond to the in-
any additional curation based on the image content. Since put sequence shifted one position to the left, requiring the model
our pre-training does not require text, our method could be to predict the next patch in raster order.
pre-trained using larger image collections that are not paired
that fits in memory and hence we do not need to truncate
with captions or have low image-text alignment such as the
the context length. The training loss over a set X of images
rest of DataComp 12.8B.
is then defined as the negative log-likelihood (NLL):
Motivated by the common practice in LLM pre-
training [75] of oversampling high-quality data sources \sum _{x\in \mathcal {X}} \sum _{k=1}^K -\log P(x_k~|~x_{<k}).
such as Wikipedia and Books, during pre-training, we sam-
ple images from DFN-2B with a probability of p = 0.8
and sample images from ImageNet-1k with a probability of Minimizing this objective over an infinite amount of im-
p = 0.2. We refer to such dataset as DFN-2B+. ages, with no further assumptions, is theoretically equiva-
lent to learning the true underlying image distribution.
4. Approach
Prediction loss Our training objective naturally gives rise
4.1. Training Objective
to certain variants of losses, each corresponding to a choice
Our training objective follows that of a standard autore- of the distribution P (xk | x<k ). By default, we adopt a nor-
gressive model applied on a sequence of image patches. malized pixel-level regression loss similar to He et al. [41].
More precisely, an image x is split into a grid of K non- This loss corresponds to setting P (xk | x<k ) as Gaussian
overlapping patches xk , k ∈ [1, K], which collectively form distributions with a constant variance. Namely, given x̂k (θ)
a sequence of tokens. We assume that the sequence order as the prediction of the k th patch from a network parameter-
is fixed across all images, and we use a raster (row-major) ized with θ, and xk as its corresponding ground-truth value,
ordering by default unless otherwise specified. Given the our objective is to minimize the sum ℓ2 squared distance
above order, the probability of an image can be factorized between the prediction and the ground-truth:
as a product of patch conditional probabilities:
\min _{\theta } \frac {1}{K}\sum _{k=1}^K \|\hat {x}_k(\theta )-x_k\|_2^2. (2)
P(x) = \prod _{k=1}^K P(x_k~|~x_{<k}), (1)
We also consider a cross-entropy loss with patches con-
where x<k denotes the set of the first k − 1 patches, and verted to discrete tokens using an offline tokenizer. Our ab-
is the context used to predict the k th patch. As opposed to lation studies show that these designs work, although they
language modeling, our sequences have a fixed length of K do not produce as strong features as the pixel-wise loss.
3
ViT models in downstream tasks, where bidirectional self-
attention is employed. This discrepancy leads to a decrease
in performance, irrespective of whether the causal mask is
retained during downstream adaptation or not (as shown in
the ablations presented in Table 3). To address this issue, we
propose to consider the initial patches of the sequence, re-
ferred to as the prefix, as a context for predicting the remain-
ing patches following the PrefixLM formulation of Raffel
Pre-training Downstream et al. [65]. The prefix patches are excluded from the au-
(e.g. prefix len=3) Adaptation
toregressive prediction and therefore are not constrained to
Figure 3. Prefix causal attention. During pre-training we uni- be causal. More precisely, we select a prefix length of size
formly sample a prefix length S. The attention for the first S S ∈ [1, K − 1], and remove the causal mask, i.e., ai,k > 0
patches are set to be bidirectional and loss is only computed for the for k < S. This modification helps the model to work in the
remaining patches in the image. During adaptation to downstream
absence of causal masking, allowing it to be removed during
tasks, this allows us to drop the attention causal mask, improving
downstream adaptation. This approach improves the perfor-
the downstream performance.
mance of the model in downstream tasks and eliminates the
need for architectural changes to ViT. Figure 3 illustrates
Model #Params Hidden size Layers LR #Patches Batch size
the difference between causal and prefix attention.
A IM-0.6B 0.6B 1536 24 1e−3 0.5T 4096
A IM-1B 1.2B 2048 24 1e−3 1.2T 4096
MLP prediction heads. It is a common practice to adopt
A IM-3B 2.7B 3072 24 1e−3 1.2T 4096
certain prediction heads during pre-training, which are dis-
A IM-7B 6.5B 4096 32 1e−3 1.2T 4096 carded when transferring to downstream tasks [16, 17, 19,
20, 38]. The purpose of these heads is to prevent the trunk
Table 1. Model specifications. We provide the embedding di-
features from becoming too specialized in the pre-training
mension, number of layers, and parameter count for all A IM vari-
ants. We also provide the learning rate and batch size during pre- objective, thus enhancing their suitability for downstream
training. For A IM with 1B parameters and higher, the pre-training transfer. We opt for a simple design where we use N blocks
process involves 1.2M iterations, which corresponds to 1.2 trillion of MLP on top of the final transformer layer, processing
patches, or 5B images, seen during pre-training. each patch independently. We observed that this design
strikes a good balance between performance and the addi-
4.2. Architecture tional costs incurred during pre-training.
As the backbone, we adopt the Vision Transformer archi- Straightforward implementation. It is worth noting
tecture (ViT) [28]. For scaling in the model capacity, we that A IM does not require particular optimization stability-
follow the common practice in language modeling and we inducing mechanisms such as LayerScale [74], stochastic
prioritize expanding width rather than depth [64, 75]. In Ta- depth [45], QK-Norm [24], or freezing the patch projec-
ble 1, we provide an overview of the design parameters of tor [20]. These mechanisms have been crucial for the suc-
A IM, including its depth and width, as well as the amount cess of other methods, either supervised or self-supervised.
of data and optimization scheme for each model capacity. On the contrary, we observe that A IM scales using the same
The overall model is illustrated in Figure 2. set of optimization hyperparameters across model sizes with
During pre-training, we apply causal masks to the self- no further tuning (see Table 1).
attention layers to model the probability of a patch given the We add sinusoidal positional embeddings [79] to the in-
preceding patches. More precisely, given a self-attention put patches before the transformer and before the MLP
layer, the embedding for the patch i is computed by: head. We use a standard expansion ratio of 4 for all the
MLP blocks in the trunk and the head. We drop the bias
y_i = \sum _{k=1}^K a_{ik} v_i, (3) term for simplicity, and unlike the original ViT, we do not
append a classification token to the input. By default, we
where aik is the attention weight and vk the value embed- use 12 blocks for the MLP head for all model capacities.
ding. To enforce the desired constraints, we utilize a causal The pixel targets are normalized per patch before the loss
maskPfor the attention weights, where aik = 0 for k > i, computation following He et al. [41]. We train our model
K using bfloat16 precision. We use the AdamW [52] opti-
and k=1 aik = 1. This approach enables us to process
the image with a single forward pass during training, with- mizer with linear warmup and a cosine decay schedule. We
out incurring additional computational overhead. detail the hyperparameters used for pre-training and down-
Prefix Transformer. The autoregressive objective in pre- stream adaptation in Appendix D.
training requires a causal mask in the self-attention oper- Downstream adaptation. Pre-training large-scale models
ation. However, this differs from the standard usage of is a resource-intensive process, and even fine-tuning them
4
Validation Loss (IN-1k)
A IM-0.6B
0.34 A IM-1B
80
IN-1k Top-1
A IM-3B
A IM-7B
0.32 A IM-0.6B
75
A IM-1B
A IM-3B
A IM-7B
0.3 70
100k 200k 300k 400k 500k 100k 200k 300k 400k 500k
Pre-training iterations Pre-training iterations
Figure 4. A IM pre-training across model sizes. We observe a clear improvement in the performance of the pre-training objective with
increasing the capacity of A IM. Moreover, the downstream performance (IN-1k top-1) is monotonically improving for higher capacity
models as well as with longer pre-training. We do not observe clear signs of plateauing during pre-training even after training for 500k
iterations, indicating that A IM can benefit from even longer pre-training schedules. Note that the loss saturation at the very end of training
is caused by the cosine decay schedule where the learning rate is effectively zero.
5
0.308
IN-1k
DFN-2B
0.34 DFN-2B+ 0.304
2B images
0.3 5B images
1B
0.32 0.296 3B
7B
0.292
100k 200k 300k 400k 500k 1e21 2e21 4e21 8e21
Pre-training iterations FLOPs (log scale)
Figure 5. Dataset impact on pre-training performance. On the Figure 6. Scaling in FLOPs. That total number of FLOPs during
one hand, pre-training using IN-1k leads to overfitting, even for training correlates with the final validation loss, suggesting com-
the A IM-0.6B model. On the other hand, pre-training using the pute driven scaling law similar to Hoffmann et al. [43].
uncurated DFN-2B dataset prevents overfitting but converges to a
similar point due to the distributional shift. Pre-training on DFN- Targets and objective (a). We explore various potential
2B+, a data mixture that predominantly consists of DFN-2B with representations for the target patches. One approach is to
a small presence of IN-1k samples leads to the best performance. utilize the raw pixel values, and training the model with
mean squared error (MSE) regression loss. A second op-
the performance that eventually surpasses pre-training on
tion, proposed by He et al. [41], involves using per-patch
IN-1k. We confirm that the resulting model also leads to a
normalized pixel values instead of the raw signal with the
better downstream performance in Table 2.
same MSE loss. Finally, another option is to use a dis-
pre-training dataset IN-1k DFN-2B DFN-2B+ cretized representation of the patches, either using k-means
attentive 73.5 74.5 75.6
or a discrete VAE [67, 78]. In this case, the model is trained
using a cross-entropy objective similar to language model-
Table 2. Dataset impact of downstream performance (15 ing. Our experiments show that A IM performs best when
benchmarks). The behavior in Figure 5 is consistent with the
using the MSE objective with normalized pixel values.
downstream performance where we observe that using a data mix-
ture of DFN-2B and IN-1k results in the best performance. Autoregression pattern (b). Autoregressive pre-training
typically follows a specific order of traversal to facilitate
Compute-optimal pre-training. Since we do not observe the prediction of the next token. In the case of language,
signs of overfitting when we train using the DFN-2B+ the traversal pattern is clear, as text is read and written one
dataset, we proceed to examine the impact of extending word at a time in a sequential manner (e.g., left to right for
the length of our pre-training schedule. In Figure 6, we English). However, for images, determining the traversal
study the impact of increasing the length of the pre-training pattern is less obvious. We explore various deterministic
schedule from 500k to 1.2M iterations, i.e., 2B to 5B im- patterns, including raster, spiraling out, checkerboard, and
ages seen during pre-training. We observe that models pre- randomly pre-sampled patterns. Detailed examples of each
trained with a longer schedule achieve significantly lower pattern are found in Appendix B. Even though our model
validation loss. This suggests that one can improve the per- performs reasonably well with each pattern, we observe that
formance of A IM either by increasing the model capacity the raster pattern leads to significantly higher performance.
or by pre-training for longer schedules. Interestingly, we To gain deeper insights into this result, we examine the
find that lower-capacity models trained for a longer sched- difficulty of predicting patches along sequences for each
ule achieve comparable validation loss to higher-capacity pattern. This can be done by measuring the loss value per
models trained for a shorter schedule while using a similar patch as we progress along a sequence, as illustrated in Fig-
amount of FLOPs. This finding is consistent with Hoff- ure 7. Our observation is that patterns that present a more
mann et al. [43] and implies that A IM could follow similar uniform distribution of difficulty across patches result in su-
scaling laws. However, we defer further investigations in perior models, as compared to patterns where the prediction
this aspect for future work. becomes progressively easier as the sequence unfolds. We
attribute this to the difficulty of predicting patches through-
5.2. Architecture and Design out the sequence that forces the model to retain more infor-
In this section, we investigate the impact of some variations mation about the image. This leads to better patch features,
in our model and training objective. These ablations are and consequently, to better image representation as a whole.
conducted using an A IM-0.6B model, which has been pre- Cropping scale (c). We explore the impact of the infor-
trained and evaluated on the IN-1k dataset. The results of mation content of each patch by adjusting the lower bound
these ablations are presented in Table 3. of the cropping scale. On the one hand, opting for a crop-
6
target pixels norm. pixel [41] KMeans dVAE [67] pattern raster spiral checkerboard random crop scale 0.08 0.4 1.0
linear 67.5 70.0 66.6 64.0 linear 69.5 67.7 68.2 65.8 linear 68.4 70.0 49.6
attentive 76.2 78.2 75.9 74.5 attentive 77.4 76.3 76.0 75.7 attentive 77.7 78.2 63.5
(a) Targets. (b) Autoregression Pattern (causal). (c) Crop Scale.
pre-training attn. causal prefix
inference attn. causal bidirectional causal bidirectional head None MLP Transformer architecture deep wide
linear 69.5 30.9 68.4 70.0 linear 64.0 70.0 70.5 linear 68.8 70.0
attentive 77.4 52.3 76.9 78.2 attentive 75.4 78.2 78.5 attentive 77.9 78.2
(d) Attention Structure. (e) Head Design. (f) Architecture.
Table 3. Ablations We investigate various design choices of A IM. We use an A IM-0.6B model that is pre-trained and evaluated using
IN-1k. We report the linear and attentive probing results. The default settings for A IM used for the main results are highlighted in gray .
ping scale that is too small leads to an easier next-patch- width 512 1024 2048 depth 6 8 12
prediction task as neighboring patches’ similarity increases. linear 69.4 69.6 70.0 linear 65.3 68.1 70.0
On the other hand, using a large cropping scale can lead to attentive 77.7 78.1 78.2 attentive 76.2 77.1 78.2
severe overfitting unless the dataset size is sufficiently large. (a) MLP width. (b) MLP depth.
Since this study is conducted using IN-1k, we observe a Table 4. MLP design. We vary the capacity of the MLP head by
clear drop in performance due to overfitting. changing the number of MLP blocks (i.e. depth) or the embedding
Causal vs. Prefix Attention (d). We measure the impact size (i.e. width). Downstream performance improves with more
of incorporating prefix attention during pre-training, as op- capacity in either width or depth, but depth has more impact.
posed to using standard causal attention. We observe that
pre-training with causal self-attention produces models that masked image modeling
autoregressive
are effective in downstream transfer tasks only when the ratio=50% ratio=75%
causal mask is preserved. These models experience a signif- attentive 78.2 70.3 77.8
icant decline in performance when bidirectional attention is
Table 5. Autoregressive vs. Masking We evaluate the IN-1k per-
employed. However, pre-training with prefix attention leads
formance of the autoregressive objective of A IM, in comparison
to models that operate effectively in both causal and bidi- to the masking objective [5, 26]. We keep all the other architec-
rectional modes. Notably, the best performance is achieved tural and optimization components fixed. We observe that, under
when combining prefix attention during pre-training with the same pre-training settings, the frozen-trunk performance of the
bidirectional attention during downstream adaptation. autoregressive objective outperforms masking.
Raster
transformer of the same depth and width only yields a
0.2 Spiral
marginal performance improvement but at a significantly
Checkerboard
Fixed Random
higher computational cost. Therefore, we opt to use an MLP
0.15 head in our approach. We hypothesize that these heads spe-
cialize in capturing the low-level signals necessary for ac-
0.1 curate pixel-level prediction. By incorporating a head, the
trunk can learn higher-level features that are more suitable
for downstream transfer. A similar design was employed
1 2 3 4 5 6 7 8 for contrastive learning to prevent the backbone from spe-
Chunks cializing in predicting specific image transformations [19].
Figure 7. Autoregression patterns We explore a number of pat- Deeper vs. Wider architecture (f). We present the design
terns for the autoregressive traversal of an image. The set of image specifications of A IM in Table 1, outlining its width and
patches is broken into equal-sized chunks and the validation loss depth. Unlike the original design of ViT [29], where the
is measured per chunk. We observe that the way the task difficulty
depth is scaled more rapidly than the width, we adopt a
is distributed across chunks varies strongly among patterns.
scaling strategy similar to that of Llama [75]. This allows
us to scale our model more gracefully while maintaining a
Head design (e). We consider different types of heads on reasonable depth. We validate the effectiveness of a wider
top of the backbone to make predictions at the pixel level. architecture in Table 3f. Our findings indicate that even for
Using no heads (i.e. None) performs reasonably well, but the relatively small-scale A IM-0.6B model, a wider archi-
adding an MLP further improves the quality of the back- tecture not only delivers strong performance but also im-
7
Camelyon17
Infographic
iWildCam
EuroSAT
Cifar100
Food101
iNAT-18
RxRX1
Cifar10
PCAM
fMoW
IN-1k
DTD
Cars
Pets
Model Arch. Data Avg
DINO [17] ViT-B/8 IN-1k 80.1 66.0 97.8 87.3 89.5 78.4 92.3 89.2 58.5 93.7 90.2 6.1 98.2 57.0 41.1 75.0
iBOT [88] ViT-L/16 IN-21k 83.5 70.5 99.2 93.3 93.5 81.6 92.8 90.8 61.8 94.5 90.0 5.9 98.0 60.3 47.7 77.6
DINOv2 [58] ViT-g/14516 LVD 86.4 84.5 99.6 95.2 96.3 86.3 96.4 95.6 68.2 96.5 90.7 8.0 98.6 66.7 58.8 81.9
BEiT [5] ViT-L/14 IN-21k 62.2 44.4 94.4 78.7 79.0 64.0 80.9 69.5 52.0 92.8 88.2 4.2 97.5 47.7 25.9 65.4
ViT-H/14 IN-1k 80.9 64.6 97.1 85.8 90.2 78.1 95.0 93.7 58.1 94.2 89.8 5.4 98.1 56.9 42.2 75.3
MAE [41, 70]
ViT-2B/14 IG-3B 82.2 70.8 97.5 87.3 93.4 81.2 95.1 94.9 57.8 94.4 90.3 7.3 98.2 60.1 50.2 77.4
A IM-0.6B ViT-H/14 78.5 64.0 97.2 86.8 90.1 80.1 93.0 93.0 57.9 94.3 90.0 7.8 98.4 58.3 45.2 75.6
A IM-1B ViT-1B/14 80.6 67.2 98.2 88.3 91.6 81.8 93.4 93.9 58.6 94.5 90.0 9.0 98.6 59.8 47.5 76.9
DFN-2B+
A IM-3B ViT-3B/14 82.2 69.7 98.4 89.9 92.7 81.9 94.1 93.8 58.8 94.3 90.4 9.7 98.5 60.9 48.9 77.6
A IM-7B ViT-7B/14 82.4 70.9 98.6 90.0 93.1 82.3 93.8 92.1 59.5 93.6 90.7 10.1 98.6 61.7 49.6 77.8
A IM-7B† ViT-7B/14 DFN-2B+ 84.0 75.5 98.9 91.8 94.1 85.6 95.4 95.0 61.4 94.2 90.5 8.4 98.5 63.5 57.7 79.6
Table 6. Downstream evaluation with a frozen trunk. We assess the quality of A IM features by evaluating against a diverse set of 15
image recognition benchmarks. A IM and the baseline methods are evaluated using attentive probing with a frozen trunk. A IM models
exhibit a strong performance across all benchmarks, especially the A IM-7B. A IM outperforms all other methods, using joint-embedding
or generative approaches, except for DINOv2 which utilizes higher-resolution images, that typically results in a 1-1.5% improvement on
ImageNet for instance. †: Extracting features from the 20th layer instead of the last (32nd ), see Table 7 for more details.
proves training stability. This observation supports the no- portant to note that we applied the masking objective in the
tion that some of the insights gained from training LLMs same setting as A IM, thereby isolating the impact on the
can be similarly applied to other domains. performance of the pre-training objective from other design
Attentive vs. Linear probe. For all ablations we report the choices that differ between A IM and other approaches. In
linear and attentive probing results. We observe that, con- the masking baseline, we randomly sample masks and re-
sistently across all experiments, attentive pooling provides place the masked patches with learnable mask tokens.
a significant boost to performance as it allows for a more In Table 5, we show that A IM performs better with an
nuanced aggregation of local features circumventing one of autoregressive objective than a masking objective. This is
the main weaknesses of generative pre-training: the absence consistent with the results reported by Chen et al. [18], pro-
of an image-level global descriptor. viding further evidence that our improvements stem from
Structure of the MLP. The MLP plays an important role the utilization of an autoregressive objective.
as ablated in Table 3e. In Table 4, we further investigate the
capacity of the MLP head and how it impacts downstream 5.4. Comparison with other methods
performance. We vary the capacity of the head by either In Table 6, we compare the attentive probing performance
changing the number of MLP blocks or their width. By de- of A IM to other state-of-the-art methods across a set of 15
fault, we use a head of 12 blocks and an embedding dimen- diverse benchmarks that are detailed in Appendix A.
sion of 2048. First, we observe that increasing the capacity Generative methods. A IM provides a strong performance
of the MLP either through depth or width leads to consis- compared to its generative counterparts. A IM outperforms
tent improvement in the downstream performance. Second, BEiT [5] by a large margin. Additionally, A IM-0.6B pro-
we find that increasing the number of MLP blocks, with a vides a better performance, averaged across all benchmarks,
fixed width, leads to a larger improvement compared to in- compared to MAE-H [41] which has an equivalent capac-
creasing the width for a fixed depth. Interestingly, we could ity. Moreover, we compare against the MAE-2B [70] model
not find a point where increasing the MLP capacity failed which has been pre-trained on IG-3B, a private dataset of 3
to yield further improvements. We did not explore higher billion images from Instagram. We find that both A IM-3B
capacities beyond those reported in Table 4 as it would lead and A IM-7B outperform MAE-2B, with A IM-7B exhibiting
to models with disproportionate head and trunk capacity. a particularly large improvement. It is worth noting that,
similar to A IM, two other generative approaches, BEiT and
5.3. Pre-training objective
MAE, benefit from attentive probing, thereby narrowing the
Autoregressive vs. Masking We conduct a comparison be- gap between generative and joint embedding methods.
tween our architecture trained with an autoregressive objec- Joint embedding methods. A IM provides a competi-
tive and the masking objective popularized by BERT [26] tive performance with joint embedding methods such as
for language, and by BEiT and MAE for vision. It is im- DINO [17], iBOT [88], and DINOv2 [58]. In terms of
8
average accuracy across all benchmarks, A IM outperforms 6. Discussion
DINO and iBOT. However, it falls behind DINOv2 which
achieves its results by evaluating with higher-resolution in- In this paper, we presented a simple and scalable method for
puts. Note that A IM attains such competitive performance pre-training vision models at scale without supervision. We
using higher capacity trunks. Nevertheless, A IM’s pre- employed a generative autoregressive objective during pre-
training and proposed several technical contributions to bet-
training is significantly simpler and can be trivially scaled in
ter adapt it for downstream transfer. Consequently, we ob-
terms of parameters and data, yielding consistent improve-
ments. On the contrary, state-of-the-art joint embedding served a number of desirable properties for our Autoregres-
methods like DINOv2 heavily rely on a number of tricks, sive Image Models. First, the capacity of our models can
such as multi-crop augmentation, KoLeo regularization, be effortlessly scaled to 7 billion parameters using a vanilla
LayerScale, Stochastic Depth, schedules for teacher mo- transformer implementation, without resorting to stability-
inducing techniques or extensive adjustments of hyperpa-
mentum and weight decay, and high-resolution fine-tuning
rameters for each model scale. Second, A IM’s performance
in order to achieve strong performance.
on the pre-training task has a strong correlation with down-
Extracting stronger features. We observe that higher- stream performance. Third, A IM achieves strong perfor-
quality features can be extracted from shallower layers com- mance across 15 recognition benchmarks, outperforming
pared to the last layer’s features. This is likely due to the prior state-of-the-art methods like MAE and significantly
generative nature of the pre-training objective that is inher- narrowing the gap between generative and joint embedding
ently different than the discriminative downstream tasks and pre-training approaches. Finally, we did not observe any
therefore, the features with the highest semantic content do clear signs of saturation as we scale either in terms of pa-
not necessarily concentrate around the last layer. In Table 7, rameters or data, suggesting that there is a potential for fur-
we report the IN-1k top-1 accuracy for features extracted ther performance improvements with larger models trained
from the last layer compared to the layer with the highest for even longer schedules. We hope that A IM serves as a
performance. A more detailed analysis of this phenomenon seed for future research in scalable vision models that effec-
is provided in Appendix D. tively leverage uncurated datasets without any bias towards
object-centric images or strong dependence on captions.
A IM-0.6B A IM-1B A IM-3B A IM-7B
last layer 78.5 80.6 82.2 82.4 Limitations. A IM excels in its seamless scalability and
best layer 79.4 82.3 83.3 84.0 its effective utilization of large volumes of uncurated im-
Table 7. Feature extraction. The highest quality features after age data. However, alternative methods can offer different
A IM pre-training typically reside in shallower layers than the last. trade-offs. MAE [41] provides high sample efficiency and
Extracting features from earlier layers leads to a non-negligible can learn good representations using a small amount of pre-
boost to the recognition performance on IN-1k. training data, reducing the risk of overfitting [30] in con-
trast to our approach. Contrastive methods [17, 58, 88] cur-
rently result in stronger representations for a given model
size compared to generative approaches such as MAE and
5.5. Low-Rank Adaptation A IM, but pose significant challenges in terms of scalability
and loss tractability due to the complexity of their objective.
In addition to frozen-trunk evaluation, we examine Low-
Rank Adaptation (LoRA) [44], a popular and efficient fine-
tuning method. We report the results of LoRA fintuning of
Acknowledgements
A IM in Table 8. We observe that LoRA is compatible with The authors would like to thank Brandon McKinzie, Samira
A IM, leading to a large boost in performance compared to Abnar, Preetum Nakkiran, and Jiatao Gu for valuable feed-
frozen-trunk evaluation. For example, A IM-7B improves back throughout the project. We thank Edouard Grave
by 3.9% (compared to the last layer’s performance) while and Hervé Jegou for their inspiring discussions during the
finetuning only 0.1% percent of the trunk parameters. earlier stages of the project. We thank Marco Cuturi,
James Thornton, Pierre Ablin, and Eugene Ndiaye for their
A IM-0.6B A IM-1B A IM-3B A IM-7B support and for many fruitful discussions throughout the
attentive 78.5 80.6 82.2 82.4 project. Finally, we would like to thank the entire Machine
LoRA (rank=8) 81.0 83.6 85.5 86.3 Learning Research team at Apple for many helpful discus-
sions and assistance with infra and data.
Table 8. Low-rank adaptation (IN-1k). A IM is compatible with
LoRA showing large gains compared to frozen-trunk evaluations.
9
References national Conference on Computer Vision, pages 2959–2968,
2019. 2
[1] Anonymous. V-JEPA: Latent video prediction for visual rep-
[16] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Pi-
resentation learning. In Submitted to The Twelfth Interna-
otr Bojanowski, and Armand Joulin. Unsupervised learn-
tional Conference on Learning Representations, 2023. 5
ing of visual features by contrasting cluster assignments. In
[2] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- NeurIPS, 2020. 4
janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and
[17] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
Nicolas Ballas. Self-supervised learning from images with
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
a joint-embedding predictive architecture. arXiv preprint
ing properties in self-supervised vision transformers. In
arXiv:2301.08243, 2023. 2
ICCV, 2021. 2, 4, 8, 9
[3] Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir
[18] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, and
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
Alexei A Efros. Sequential modeling enables scal-
training from pixels. In ICML, 2020. 2, 8, 14
able learning for large vision models. arXiv preprint
[19] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
arXiv:2312.00785, 2023. 2
offrey Hinton. A simple framework for contrastive learning
[4] Peter Bandi, Oscar Geessink, Quirine Manson, Mar- of visual representations. In ICML, 2020. 2, 4, 7
cory Van Dijk, Maschenka Balkenhol, Meyke Hermsen,
[20] Xinlei Chen, Saining Xie, and Kaiming He. An empiri-
Babak Ehteshami Bejnordi, Byungjae Lee, Kyunghyun
cal study of training self-supervised vision transformers. In
Paeng, Aoxiao Zhong, et al. From detection of individual
ICCV, 2021. 3, 4
metastases to classification of lymph node status at the pa-
[21] Gordon Christie, Neil Fendley, James Wilson, and Ryan
tient level: the camelyon17 challenge. IEEE Transactions
Mukherjee. Functional map of the world. In Proceedings
on Medical Imaging, 2018. 13
of the IEEE Conference on Computer Vision and Pattern
[5] Hangbo Bao, Li Dong, and Furu Wei. BEiT: Bert pre-
Recognition, 2018. 13
training of image transformers. In ICLR, 2022. 2, 7, 8
[22] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A.
[6] Adrien Bardes, Jean Ponce, and Yann LeCun. Vi-
Vedaldi. Describing textures in the wild. In CVPR, 2014.
creg: Variance-invariance-covariance regularization for self-
13
supervised learning. In ICLR, 2022. 2
[23] Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasude-
[7] Miguel A Bautista, Artsiom Sanakoyeu, Ekaterina van, and Quoc V Le. Autoaugment: Learning augmentation
Tikhoncheva, and Bjorn Ommer. Cliquecnn: Deep strategies from data. In Proceedings of the IEEE/CVF con-
unsupervised exemplar learning. Advances in Neural ference on computer vision and pattern recognition, pages
Information Processing Systems, 29, 2016. 2 113–123, 2019. 14
[8] Sara Beery, Elijah Cole, and Arvi Gjoka. The iwildcam 2020 [24] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr
competition dataset. arXiv preprint arXiv:2004.10340, 2020. Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter
13 Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul-
[9] Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. A mohsin, et al. Scaling vision transformers to 22 billion pa-
neural probabilistic language model. Advances in neural in- rameters. In ICML. PMLR, 2023. 2, 4
formation processing systems, 13, 2000. 2 [25] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[10] Piotr Bojanowski and Armand Joulin. Unsupervised learning and Li Fei-Fei. Imagenet: A large-scale hierarchical image
by predicting noise. In International Conference on Machine database. In 2009 IEEE conference on computer vision and
Learning, pages 517–526. PMLR, 2017. 2 pattern recognition, pages 248–255. Ieee, 2009. 13
[11] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. [26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Food-101 – mining discriminative components with random Toutanova. BERT: Pre-training of deep bidirectional trans-
forests. In ECCV, 2014. 13 formers for language understanding. In NAACL, 2018. 2, 7,
[12] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large 8
scale gan training for high fidelity natural image synthesis. [27] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper-
arXiv preprint arXiv:1809.11096, 2018. 2 vised visual representation learning by context prediction. In
[13] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie ICCV, 2015. 2
Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Nee- [28] Alexey Dosovitskiy, Jost Tobias Springenberg, Martin Ried-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, miller, and Thomas Brox. Discriminative unsupervised fea-
et al. Language models are few-shot learners. preprint ture learning with convolutional neural networks. Advances
arXiv:2005.14165, 2020. 1 in neural information processing systems, 27, 2014. 2, 4
[14] Mathilde Caron, Piotr Bojanowski, Armand Joulin, and [29] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
Matthijs Douze. Deep clustering for unsupervised learning Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
of visual features. In ECCV, 2018. 2 Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
[15] Mathilde Caron, Piotr Bojanowski, Julien Mairal, and Ar- vain Gelly, et al. An image is worth 16x16 words: Trans-
mand Joulin. Unsupervised pre-training of image features formers for image recognition at scale. In ICLR, 2021. 2, 3,
on non-curated data. In Proceedings of the IEEE/CVF Inter- 7
10
[30] Alaaeldin El-Nouby, Gautier Izacard, Hugo Touvron, Ivan [45] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
Laptev, Hervé Jegou, and Edouard Grave. Are large-scale Weinberger. Deep networks with stochastic depth. In ECCV,
datasets necessary for self-supervised pre-training? arXiv 2016. 2, 4
preprint arXiv:2112.10740, 2021. 9 [46] iNaturalist 2018 competition dataset. iNaturalist 2018 com-
[31] Jeffrey L Elman. Finding structure in time. Cognitive sci- petition dataset. https://github.com/visipedia/inat_
ence, 14(2):179–211, 1990. 2 comp/tree/master/2018, 2018. 13
[32] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig [47] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- 3d object representations for fine-grained categorization. In
tering networks. arXiv preprint arXiv:2309.17425, 2023. 1, 4th International IEEE Workshop on 3D Representation and
2, 3 Recognition (3dRR-13), Sydney, Australia, 2013. 13
[33] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan [48] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple
Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, layers of features from tiny images. 2009. 13
Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Dat- [49] Hugo Larochelle and Iain Murray. The neural autoregressive
acomp: In search of the next generation of multimodal distribution estimator. In Proceedings of the fourteenth inter-
datasets. arXiv preprint arXiv:2304.14108, 2023. 1, 2, 3 national conference on artificial intelligence and statistics,
[34] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un- pages 29–37. JMLR Workshop and Conference Proceedings,
supervised representation learning by predicting image rota- 2011. 2
tions. arXiv preprint arXiv:1803.07728, 2018. 2 [50] Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Kosiorek, Se-
[35] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing ungjin Choi, and Yee Whye Teh. Set transformer: A frame-
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and work for attention-based permutation-invariant neural net-
Yoshua Bengio. Generative adversarial nets. Advances in works. In ICML, 2019. 5
neural information processing systems, 27, 2014. 2 [51] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient
descent with warm restarts. In ICLR, 2017. 14
[36] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
Misra. Scaling and benchmarking self-supervised visual rep- [52] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
resentation learning. In ICCV, 2019. 2 regularization. arXiv preprint arXiv:1711.05101, 2017. 4,
14
[37] Priya Goyal, Quentin Duval, Isaac Seessel, Mathilde Caron,
Mannat Singh, Ishan Misra, Levent Sagun, Armand Joulin, [53] Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cer-
and Piotr Bojanowski. Vision models are more robust and nockỳ, and Sanjeev Khudanpur. Recurrent neural network
fair when pretrained on uncurated images without supervi- based language model. In Interspeech, 2010. 2
sion. arXiv preprint arXiv:2202.08360, 2022. 2 [54] Ishan Misra and Laurens van der Maaten. Self-supervised
learning of pretext-invariant representations. In CVPR, 2020.
[38] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
2
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh- [55] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of
laghi Azar, et al. Bootstrap your own latent-a new approach visual representations by solving jigsaw puzzles. In ECCV,
to self-supervised learning. NeurIPS, 2020. 2, 4 2016. 2
[56] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen
[39] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,
shick. Mask r-cnn. In ICCV, 2017. 3
Andrew Senior, and Koray Kavukcuoglu. Wavenet: A gener-
[40] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross ative model for raw audio. arXiv preprint arXiv:1609.03499,
Girshick. Momentum contrast for unsupervised visual rep- 2016. 2
resentation learning. In CVPR, 2020. 2
[57] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Rep-
[41] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr resentation learning with contrastive predictive coding. In
Dollár, and Ross Girshick. Masked autoencoders are scalable NeurIPS, 2018. 2
vision learners. In CVPR, 2022. 2, 3, 4, 6, 7, 8, 9
[58] Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V.
[42] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez,
Damian Borth. Eurosat: A novel dataset and deep learning Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Rus-
benchmark for land use and land cover classification, 2017. sell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-
13 Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nico-
[43] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, las Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou,
Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bo-
de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan janowski. Dinov2: Learning robust visual features without
Clark, et al. Training compute-optimal large language mod- supervision, 2023. 2, 8, 9, 14
els. arXiv preprint arXiv:2203.15556, 2022. 2, 6 [59] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
[44] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Agarwal, Katarina Slama, Alex Ray, et al. Training lan-
Lora: Low-rank adaptation of large language models. arXiv guage models to follow instructions with human feedback.
preprint arXiv:2106.09685, 2021. 9 NeurIPS, 2022. 1
11
[60] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar. [75] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Cats and dogs. In CVPR, 2012. 13 Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
[61] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im- Llama: Open and efficient foundation language models.
age transformer. In ICML, 2018. 2 arXiv preprint arXiv:2302.13971, 2023. 1, 2, 3, 4, 7
[62] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor [76] Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt,
Darrell, and Alexei A Efros. Context encoders: Feature Oriol Vinyals, Alex Graves, et al. Conditional image genera-
learning by inpainting. In CVPR, 2016. 2 tion with pixelcnn decoders. Advances in neural information
[63] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate processing systems, 29, 2016. 2
Saenko, and Bo Wang. Moment matching for multi-source [77] Aäron Van Den Oord, Nal Kalchbrenner, and Koray
domain adaptation. In ICCV, 2019. 13 Kavukcuoglu. Pixel recurrent neural networks. In Interna-
[64] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario tional conference on machine learning, pages 1747–1756.
Amodei, Ilya Sutskever, et al. Language models are unsu- PMLR, 2016. 2
pervised multitask learners. OpenAI blog, 2019. 1, 2, 4 [78] Aaron Van Den Oord, Oriol Vinyals, et al. Neurips. Ad-
[65] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, vances in neural information processing systems, 2017. 6
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and [79] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
Peter J Liu. Exploring the limits of transfer learning with reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
a unified text-to-text transformer. The Journal of Machine Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
Learning Research, 21(1):5485–5551, 2020. 4 2, 4
[66] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, [80] Bastiaan S Veeling, Jasper Linmans, Jim Winkens, Taco Co-
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and hen, and Max Welling. Rotation equivariant cnns for digital
Peter J Liu. Exploring the limits of transfer learning with pathology. In Medical Image Computing and Computer As-
a unified text-to-text transformer. The Journal of Machine sisted Intervention–MICCAI 2018: 21st International Con-
Learning Research, 2020. 2 ference, Granada, Spain, September 16-20, 2018, Proceed-
[67] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, ings, Part II 11, pages 210–218. Springer, 2018. 13
Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. [81] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua
Zero-shot text-to-image generation. In International Confer- Bengio, Pierre-Antoine Manzagol, and Léon Bottou.
ence on Machine Learning, pages 8821–8831. PMLR, 2021. Stacked denoising autoencoders: Learning useful represen-
6, 7 tations in a deep network with a local denoising criterion.
[68] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Journal of machine learning research, 11(12), 2010. 2
Kingma. Pixelcnn++: Improving the pixelcnn with dis- [82] Chen Wei, Karttikeya Mangalam, Po-Yao Huang, Yanghao
cretized logistic mixture likelihood and other modifications. Li, Haoqi Fan, Hu Xu, Huiyu Wang, Cihang Xie, Alan
arXiv preprint arXiv:1701.05517, 2017. 2 Yuille, and Christoph Feichtenhofer. Diffusion models as
[69] Claude E Shannon. Prediction and entropy of printed en- masked autoencoders. arXiv preprint arXiv:2304.03283,
glish. Bell system technical journal, 30(1):50–64, 1951. 2 2023. 2
[70] Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala, [83] Xueting Yan, Ishan Misra, Abhinav Gupta, Deepti Ghadi-
Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand yaram, and Dhruv Mahajan. ClusterFit: Improving General-
Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick, ization of Visual Representations. In CVPR, 2020. 2
et al. The effectiveness of mae pre-pretraining for billion- [84] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo-
scale pretraining. arXiv preprint arXiv:2303.13496, 2023. jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive
2, 3, 8 captioners are image-text foundation models. TMLR, 2022.
[71] J. Taylor, B. Earnshaw, B. Mabey, M. Victors, and J. Yosin- 5
ski. Rxrx1: An image set for cellular morphological varia- [85] Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and
tion across many experimental batches. In ICLR, 2019. 13 Stéphane Deny. Barlow twins: Self-supervised learning via
[72] Yonglong Tian, Olivier J Henaff, and Aäron van den Oord. redundancy reduction. In ICML, 2021. 2
Divide and contrast: Self-supervised learning from uncu- [86] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
rated data. In Proceedings of the IEEE/CVF International David Lopez-Paz. mixup: Beyond empirical risk minimiza-
Conference on Computer Vision, pages 10063–10074, 2021. tion. In ICLR, 2018. 14
2 [87] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
[73] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco image colorization. In ECCV, 2016. 2
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training [88] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
data-efficient image transformers & distillation through at- Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training
tention. In ICML, 2021. 2 with online tokenizer. In ICLR, 2022. 2, 8, 9
[74] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Gabriel Synnaeve, and Hervé Jégou. Going deeper with im-
age transformers. arXiv preprint arXiv:2103.17239, 2021. 2,
4, 5
12
A. Datasets
11 22 33 77 88 99
To assess the effectiveness and general applicability of the 11 22 33 77 88 99
learned representations by A IM, we measure its recognition
accuracy on a varied collection of 15 benchmarks in Table 6. 44 55 66 66 11 22
The specifics of each benchmark can be found in Table 9. 44 55 66 66 11 22
These benchmarks include datasets for tasks such as fine-
grained recognition, medical imaging, satellite imagery, im- 77 88 99 55 44 33
ages in natural environments, and infographic images. 77 88 99 55 44 33
(a) Raster (b) Spiral
Dataset train test classes
11 66 22 11 77 44
Imagenet-1k [25] 1,281,167 50,000 1000 11 66 22 11 77 44
iNAT-18 [46] 437,513 24,426 8142
CIFAR-10 [48] 50,000 10,000 10 77 33 88 99 33 55
CIFAR-100 [48] 50,000 10,000 100 77 33 88 99 33 55
Food101 [11] 75,750 25,250 101
DTD [22] 3,760 1,880 47 44 99 55 88 22 66
Pets [60] 3,680 3,669 37 44 99 55 88 22 66
Cars [47] 8,144 8,041 196
iWildCam [8] 129,809 14961 182 (c) Checkerboard (d) Random
Camelyon17 [4] 302,436 34904 2 Figure 8. Autoregression patterns. We illustrate the different au-
PCAM [80] 262,144 32768 2 toregression patterns studied in this work including raster, spiral,
RxRx1 [71] 40,612 9854 1139 checkerboard, and fixed random.
EuroSAT [42] 16,200 5400 10
fMoW [21] 76,863 19915 62
Validation Loss (IN-1k)
0.3
B. Autoregression Patterns
2 4 6 8 10 12 14 16
We investigate different patterns that can be used to traverse
an image during pre-training in Table 3b. All patterns used Chunks
in this investigation are illustrated in Figure 8. Figure 9. Raster pattern across patches. We compute the IN-1k
validation loss per a chunk of 16 patches (i.e., a row) for A IM-
0.6B, pre-trained using a raster pattern. We measure the same loss
C. Additional Analysis
for the vertically flipped images of the validation set. We observe
C.1. Raster pattern validation loss that, for IN-1k validation set, the patches from the top rows in
the image are easier to predict with lower loss, likely due to the
In Figure 7, we noticed that the validation loss of the raster concentration of background patches in that region.
pattern across chunks surprisingly declined for the second
chunk before increasing again. We investigated this further detail in Figure 10. We find that for all A IM variants, we ex-
in Figure 9 and observed that this behavior is a side-effect tract the highest quality features, with respect to the down-
of using the IN-1k validation set. In particular, we observed stream transfer, from layers roughly at two-thirds of the
that the top rows of the image, aside from the first one, typ- way into the model depth. However, it is important to note
ically have a lower loss, whether the loss is computed over that the performance of deeper layers does not experience a
the regular image or its vertically flipped counterpart. steep decline and continues to exhibit strong performance.
13
85 85 85
Validation Loss (IN-1k)
8 12 16 20 24 8 12 16 20 24 8 12 16 20 24 28 32
layers layers layers
Figure 10. Downstream performance across layers. The highest quality features in terms of transfer to downstream recognition tasks
can be extracted from layers different than the last, with the peak performance achieved by extracting from features roughly at two-thirds
of the model depth. Deeper layers still retain a strong performance and no sharp decline is observed.
Table 10. Pre-training hyperparameters All A IM variants of Table 11. Attentive probe hyperparameters. We detail the hy-
different capacities have been trained using the same set of hyper- perparameters used for attentive probing A IM as well as the base-
parameters detailed above. lines. For all experiments, we search over different learning rate
values and report the best for both A IM and baselines.
trained only for the shorter schedule of 500k iterations. We
did not observe any instability while scaling the capacity of head which leads to a modest gain in performance. Note
our model, thereby not requiring any further tuning of the that the descriptor dimensionality remains the same which
optimization hyperparameters. is different from the practice of concatenating features sim-
ilar to iGPT[18] which indirectly inflates the capacity of the
Attentive Probing. Downstream evaluation for A IM and evaluation head.
the baselines has been primarily conducted via attentive Low-rank adaptation. For LoRA finetuning, we use the
probing as described in § 4. We report the hyperparame- same hyperparameters as reported in Table 11 in addition to
ters used to probe all methods in Table 11. For a fair com- mixup [86] (alpha=0.8). We apply LoRA adaptation, with
parison with other baselines, we search over different val- rank=8, only to the parameters of the attention block. In
ues for the learning rate and report the best performance of particular, the weight matrices for the queries, values, and
each method similar to [58]. For A IM and other genera- output projection.
tive baselines, we average the features for the last 6 layers
of the model before feeding them to the attention-probing
14