Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR
Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR
Yonglong Tian1,† Lijie Fan2,†, * Kaifeng Chen1 Dina Katabi2 Dilip Krishnan1 Phillip Isola2
1 2 †
Google Research, MIT CSAIL, equal contribution
Github Repo: https://github.com/google-research/syn-rep-learn
arXiv:2312.17742v1 [cs.CV] 28 Dec 2023
Image
Abstract Text dataset dataset
Learner
We introduce SynCLR, a novel approach for learning vi- Learning
sual representations exclusively from synthetic images and from data !
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit
!f :X !Y
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit
<latexit sha1_base64="HYPSq+DctLdg9hrnUlynFDqraoE=">AAACTnicfZDfShtBFMZn0z/GaGu0l70ZjIKUEnZFULyS1oveFBWMRrIhnJ2c3QzOziwzZ1vDklfprX2c3vZFeic6iSm0Kh4Y+PGdb2bO+ZJCSUdh+DuovXj56vVCfbGxtPzm7Upzde3MmdIK7AijjO0m4FBJjR2SpLBbWIQ8UXieXH6e9s+/oXXS6FMaF9jPIdMylQLIS4PmWrrPuzy2MhsRWGu+84tBsxW2w1nxxxDNocXmdTxYDTbioRFljpqEAud6UVhQvwJLUiicNOLSYQHiEjLsedSQo+tXs+EnfNMrQ54a648mPlP/vVFB7tw4T7wzBxq5h72p+FSvV1K616+kLkpCLe4/SkvFyfBpEnwoLQpSYw8grPSzcjECC4J8Xo1GfIh+GYtf/cNHBVogYz9UMdgsh6uJXy6LP07pOaPUf42efK7RwxQfw9l2Owrb0clO6+DTPOE6e8/W2RaL2C47YF/YMeswwa7YD3bNfga/gj/BTXB7b60F8zvv2H9Vq98BlhezeQ==</latexit>
IN lin. acc
large dataset of image captions using LLMs, then use an off-
the-shelf text-to-image model to generate multiple images Text dataset Image
Generator
corresponding to each synthetic caption. We perform visual Learner
representation learning on these synthetic images via con- Hybrid !
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit
e.g.
!f :X !Y
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit
<latexit sha1_base64="HYPSq+DctLdg9hrnUlynFDqraoE=">AAACTnicfZDfShtBFMZn0z/GaGu0l70ZjIKUEnZFULyS1oveFBWMRrIhnJ2c3QzOziwzZ1vDklfprX2c3vZFeic6iSm0Kh4Y+PGdb2bO+ZJCSUdh+DuovXj56vVCfbGxtPzm7Upzde3MmdIK7AijjO0m4FBJjR2SpLBbWIQ8UXieXH6e9s+/oXXS6FMaF9jPIdMylQLIS4PmWrrPuzy2MhsRWGu+84tBsxW2w1nxxxDNocXmdTxYDTbioRFljpqEAud6UVhQvwJLUiicNOLSYQHiEjLsedSQo+tXs+EnfNMrQ54a648mPlP/vVFB7tw4T7wzBxq5h72p+FSvV1K616+kLkpCLe4/SkvFyfBpEnwoLQpSYw8grPSzcjECC4J8Xo1GfIh+GYtf/cNHBVogYz9UMdgsh6uJXy6LP07pOaPUf42efK7RwxQfw9l2Owrb0clO6+DTPOE6e8/W2RaL2C47YF/YMeswwa7YD3bNfga/gj/BTXB7b60F8zvv2H9Vq98BlhezeQ==</latexit>
trastive learning, treating images sharing the same caption StableRep 76.7%
<latexit sha1_base64="k3qOOvZfdDfdUvYR+hDWYeaK2zM=">AAACEnicjVC7TsMwFL0ur1JeBUYWi6oSU5R0oIwVLIwg0YfURpXjOq2p40S2g1RF/QcGFn6FBSFWJjb+BrfNAC0DR7J0dM65ur4nSATXxnW/UGFtfWNzq7hd2tnd2z8oHx61dJwqypo0FrHqBEQzwSVrGm4E6ySKkSgQrB2Mr2Z++4EpzWN5ZyYJ8yMylDzklBgrternTr1X7ZcrruPOgVeJl5MK5PhfvF/+7A1imkZMGiqI1l3PTYyfEWU4FWxa6qWaJYSOyZB1LZUkYtrP5idNcdUqAxzGyj5p8Fz9OZGRSOtJFNhkRMxIL3sz8S+vm5rwws+4TFLDJF0sClOBTYxn/eABV4waMbGEUMXtXzEdEUWosS2W7One8qGrpFVzPNfxbmuVxmXeWRFO4BTOwIM6NOAabqAJFO7hEZ7hFT2hF/SG3hfRAspnjuEX0Mc3dJaVgA==</latexit>
sha1_base64="aPXIoNODWf5x+LURXtNXl3iFWBA=">AAAB7XicbVA9TwJBEJ3DL8Qv1NJmIyGxutxRCCXRxhIT+UjgQvaWPVjZ273s7pmQC//BxkJjbP0/dv4bF7hCwZdM8vLeTGbmhQln2njet1PY2t7Z3Svulw4Oj45PyqdnHS1TRWibSC5VL8SaciZo2zDDaS9RFMchp91wervwu09UaSbFg5klNIjxWLCIEWys1Klfu/VBdViueK63BNokfk4qkKM1LH8NRpKkMRWGcKx13/cSE2RYGUY4nZcGqaYJJlM8pn1LBY6pDrLltXNUtcoIRVLZEgYt1d8TGY61nsWh7Yyxmeh1byH+5/VTEzWCjIkkNVSQ1aIo5chItHgdjZiixPCZJZgoZm9FZIIVJsYGVLIh+Osvb5JOzfU917+vVZo3eRxFuIBLuAIf6tCEO2hBGwg8wjO8wpsjnRfn3flYtRacfOYc/sD5/AHtao4H</latexit>
80.2%
models
<latexit sha1_base64="HYPSq+DctLdg9hrnUlynFDqraoE=">AAACTnicfZDfShtBFMZn0z/GaGu0l70ZjIKUEnZFULyS1oveFBWMRrIhnJ2c3QzOziwzZ1vDklfprX2c3vZFeic6iSm0Kh4Y+PGdb2bO+ZJCSUdh+DuovXj56vVCfbGxtPzm7Upzde3MmdIK7AijjO0m4FBJjR2SpLBbWIQ8UXieXH6e9s+/oXXS6FMaF9jPIdMylQLIS4PmWrrPuzy2MhsRWGu+84tBsxW2w1nxxxDNocXmdTxYDTbioRFljpqEAud6UVhQvwJLUiicNOLSYQHiEjLsedSQo+tXs+EnfNMrQ54a648mPlP/vVFB7tw4T7wzBxq5h72p+FSvV1K616+kLkpCLe4/SkvFyfBpEnwoLQpSYw8grPSzcjECC4J8Xo1GfIh+GYtf/cNHBVogYz9UMdgsh6uJXy6LP07pOaPUf42efK7RwxQfw9l2Owrb0clO6+DTPOE6e8/W2RaL2C47YF/YMeswwa7YD3bNfga/gj/BTXB7b60F8zvv2H9Vq98BlhezeQ==</latexit>
<latexit sha1_base64="9CT3PjVopMh/zXd+oW32lZVtduw=">AAACInicjVDLTsJAFJ3iC/FVHzs3E4EEN6RlI0siG01cYCLQpDRkOkxhwnTazExNsOFfXLjxV9wYdWXixzhAFwouPMkkJ+ecmzv3+DGjUlnWp5FbW9/Y3MpvF3Z29/YPzMOjjowSgUkbRywSjo8kYZSTtqKKEScWBIU+I11/3Jz53XsiJI34nZrExAvRkNOAYqS01DdPSnWrWuuVS7DSvLluQddxvPO+WbSq1hxwldgZKYIM/4v3zffeIMJJSLjCDEnp2lasvBQJRTEj00IvkSRGeIyGxNWUo5BIL52fOIVlrQxgEAn9uIJz9edEikIpJ6GvkyFSI7nszcS/PDdRQd1LKY8TRTheLAoSBlUEZ33BARUEKzbRBGFB9V8hHiGBsNKtFvTp9vKhq6RTq9pW1b6tFRuXWWd5cArOQAXY4AI0wBVogTbA4AE8gmfwajwZL8ab8bGI5oxs5hj8gvH1DbbvmZw=</latexit>
sha1_base64="V5/ITBdyyCU58U0qJSmYFqO7oEM=">AAAB/XicbVDLTsJAFJ36RHzVx87NRCDBTdOykSWRjSYuMBFoUhoyHaYwYTptZqYm2BB/xY0LjXHrf7jzbxygCwVPcpOTc+7NvfcECaNS2fa3sba+sbm1Xdgp7u7tHxyaR8cdGacCkzaOWSzcAEnCKCdtRRUjbiIIigJGusG4OfO7D0RIGvN7NUmIH6EhpyHFSGmpb56W67ZV61XKsNq8vWlBz3X9i75Zsi17DrhKnJyUQI5W3/zqDWKcRoQrzJCUnmMnys+QUBQzMi32UkkShMdoSDxNOYqI9LP59VNY0coAhrHQxRWcq78nMhRJOYkC3RkhNZLL3kz8z/NSFdb9jPIkVYTjxaIwZVDFcBYFHFBBsGITTRAWVN8K8QgJhJUOrKhDcJZfXiWdmuXYlnNXKzWu8jgK4AycgypwwCVogGvQAm2AwSN4Bq/gzXgyXox342PRumbkMyfgD4zPH6yHkiM=</latexit>
Ours 80.7%
SynCLR outperforms previous self-supervised methods by a IN lin. acc
<latexit sha1_base64="7aUnXkl7DSvTHSwrXSuHBGJijAg=">AAACEnicjVC7SgNBFL0bXzG+opY2gyFgtcyKkJRBG0sF84BkCbOT2WTM7OwyMyuEJf9gYeOv2IjYWtn5N06SLTSx8MDA4ZxzuXNPkAiuDcZfTmFtfWNzq7hd2tnd2z8oHx61dJwqypo0FrHqBEQzwSVrGm4E6ySKkSgQrB2Mr2Z++4EpzWN5ZyYJ8yMylDzklBgrterYrfWq/XIFu3gOtEq8nFQgx//i/fJnbxDTNGLSUEG07no4MX5GlOFUsGmpl2qWEDomQ9a1VJKIaT+bnzRFVasMUBgr+6RBc/XnREYirSdRYJMRMSO97M3Ev7xuasK6n3GZpIZJulgUpgKZGM36QQOuGDViYgmhitu/IjoiilBjWyzZ073lQ1dJ69z1sOvdXlQal3lnRTiBUzgDD2rQgGu4gSZQuIdHeIZX58l5cd6c90W04OQzx/ALzsc3bNaVfQ==</latexit>
sha1_base64="W8xjdoQtjD9zFBtJopo6ywChb/A=">AAAB7XicbVBNSwMxEJ2tX7V+tOrRS7AUPC27IrTHohePFewHtEvJptk2NpssSVYoS/+DFw+KePX/ePPfmLZ70NYHA4/3ZpiZFyacaeN5305ha3tnd6+4Xzo4PDouV05OO1qmitA2kVyqXog15UzQtmGG016iKI5DTrvh9Hbhd5+o0kyKBzNLaBDjsWARI9hYqdPw3PqgNqxUPddbAm0SPydVyNEaVr4GI0nSmApDONa673uJCTKsDCOczkuDVNMEkyke076lAsdUB9ny2jmqWWWEIqlsCYOW6u+JDMdaz+LQdsbYTPS6txD/8/qpiRpBxkSSGirIalGUcmQkWryORkxRYvjMEkwUs7ciMsEKE2MDKtkQ/PWXN0nnyvU917+/rjZv8jiKcA4XcAk+1KEJd9CCNhB4hGd4hTdHOi/Ou/Oxai04+cwZ/IHz+QPmY44E</latexit>
row: Traditional methods, such as CLIP [71], learn only from real
data; Middle row: Recent methods, such as StableRep [91], learn
from real text and generated images; Bottom row: Our method,
1. Introduction SynCLR, learns from synthetic text and synthetic images, and rival
80.3%
<latexit sha1_base64="NhPQy1UcpRZvqvYTVpX9kRNSYQg=">AAACHXicjVC7TgJBFL3rE/HBqqXNRCDBhuxiISXRxk5N5JHAhswOszBh9pF5mJANX2Jh46/YGGNhY/wbB9hCwcKTTHJyzrm5c4+fcCaV43xZa+sbm1vbuZ387t7+QcE+PGrJWAtCmyTmsej4WFLOItpUTHHaSQTFoc9p2x9fzfz2AxWSxdG9miTUC/EwYgEjWBmpbxdKdad63iuXUOVGC3nWt4tO1ZkDrRI3I0XI8L943/7oDWKiQxopwrGUXddJlJdioRjhdJrvaUkTTMZ4SLuGRjik0kvn101R2SgDFMTCvEihufpzIsWhlJPQN8kQq5Fc9mbiX15Xq6DupSxKtKIRWSwKNEcqRrOq0IAJShSfGIKJYOaviIywwESZQvPmdHf50FXSqlVdp+re1YqNy6yzHJzAKVTAhQtowDXcQhMIaHiEZ3i1nqwX6816X0TXrGzmGH7B+vwGN3eYZA==</latexit>
sha1_base64="OtU8mCdAbBed0EwPuAPmtjYgpXE=">AAAB+HicbVBNT8JAEJ36ifhB1aOXjUCCF9LiQY5EL97ERD4SaMh22cKG7bbZ3Zpgwy/x4kFjvPpTvPlvXKAHBV8yyct7M5mZ58ecKe0439bG5tb2zm5uL79/cHhUsI9P2ipKJKEtEvFIdn2sKGeCtjTTnHZjSXHoc9rxJzdzv/NIpWKReNDTmHohHgkWMIK1kQZ2oVR3qpf9cglV7hKpLgZ20ak6C6B14makCBmaA/urP4xIElKhCcdK9Vwn1l6KpWaE01m+nygaYzLBI9ozVOCQKi9dHD5DZaMMURBJU0Kjhfp7IsWhUtPQN50h1mO16s3F/7xeooO6lzIRJ5oKslwUJBzpCM1TQEMmKdF8aggmkplbERljiYk2WeVNCO7qy+ukXau6TtW9rxUb11kcOTiDc6iAC1fQgFtoQgsIJPAMr/BmPVkv1rv1sWzdsLKZU/gD6/MHU6aQ6w==</latexit>
Representation learning extracts and organizes information the linear transfer performance of CLIP on ImageNet despite not
from raw, often unlabeled data. The quality, quantity, and directly observing any real data.
diversity of the data determines how good a representation
the model can learn. The model becomes a reflection of the recent work has indeed shown that this can lead to strong
collective intelligence that exists in the data. We get what performance gains at scale [68], but this path is costly to
we feed in. pursue.
Unsurprisingly, the current best-performing visual rep-
To alleviate the cost, in this paper we ask if synthetic
resentation learning methods [68, 71] rely on large scale
data, sampled from off-the-shelf generative models, is a
real datasets. However, the collection of real data has its
viable path toward large scale curated datasets that can train
own dilemmas. Collecting large scale uncurated data [80] is
state-of-the-art visual representations.
relatively cheap and thus quite achievable. However, for self-
supervised representation learning, this approach exhibits We call such a paradigm learning from models, in con-
poor scaling behavior –i.e., adding more uncurated data has trast to directly learning from data. Models have several
little effect at large data scales [38, 90]. Collecting small advantages as a data source for building large scale train-
scale curated data [24] also is achievable, but models trained ing sets: via their latent variables, conditioning variables,
in this way are limited to relatively narrow tasks. The ideal and hyperparameters, they provide new controls for curat-
would be large scale curated datasets of real images, and ing data; we will make use of these controls in the method
we propose. Models also can be easier to share and store
* Work done while interning at Google. (because models are more compressed than data), and can
1
produce an unlimited number of data samples (albeit with class 1 class 2 class 3 class 4
finite diversity). A growing literature has studied these
properties and other advantages (and disadvantages) of us- SimCLR
ing generative models as a data source for training down-
stream models [3, 30, 45, 48, 78, 91]. Some of these meth-
ods use a hybrid mode – either mixing real and synthetic class 1
datasets [3] or needing a real dataset to generate another
synthetic dataset [91]. Other methods try to learn represen- Supervised
tations from purely synthetic data [78] but lag far behind CE
the best performing models. Instead, we show that learning
from models, without training on any real data, can yield rep-
class 1 class 2
resentations that match the top-performing representations
learnt from real data. For instance, as illustrated in Figure 1,
SynCLR
representations learnt by our method are able to transfer as
well as OpenAI’s CLIP [71] on ImageNet (both methods
using ViT-B [28]).
Our approach leverages generative models to re-define Figure 2. Different learning objectives treat classification granu-
the granularity of visual classes. As shown in Figure 2, con- larity differently. These images are generated by two prompts “a
golden retriever, wearing sunglasses and a beach hat, rides a bike"
sider we have four images generated using two prompts: “a
and “a cute golden retriever sits in a house made of sushi". Sim-
golden retriever, wearing sunglasses and a beach hat, rides CLR treats each image as a class, while supervised cross-entropy
a bike" and “a cute golden retriever sits in a house made treats them all as the same “golden retrieval” class. The former
of sushi". Traditional self-supervised method such as Sim- does not consider shared semantics between images, and the latter
CLR [13] will treat each of these images as a different class; is coarse-grained and ignores actions or relationships between sub-
embeddings for different images are pushed apart with no jects/background. Our approach, SynCLR, defines visual classes
explicit consideration of the shared semantics between im- by sentences.
ages. On the other extreme, supervised learning methods
(i.e. SupCE) will regard all these images as a single class
pability of large language models (LLMs), where we present
(e.g., “golden retriever”). This ignores nuances in the se-
examples of word-to-caption translations. Next, a text-to-
mantics of the images, such as the fact that the dogs are
image diffusion model is adopted to synthesize multiple
riding a bike in one pair of images and sitting inside a sushi
images for each synthetic caption. This yields a synthetic
house in the other pair of images. Instead, our method, Syn-
dataset of 600M images. Then we train visual representa-
CLR, treats captions as classes, i.e., each caption describes
tion models by a combination of multi-positive contrastive
a visual class (this level of granularity was also explored
learning [50] and masked image modeling [110].
in StableRep [91]). This allows us to group images by the
Our learned representations transfer well. With Syn-
concepts of “riding a bike” and “sitting in a sushi house”,
CLR pre-training, our ViT-B and ViT-L models achieve
in addition to grouping by a coarser class label like “golden
80.7% and 83.0% top-1 linear probing accuracy on
retrieval”. This level of granularity is difficult to mine in real
ImageNet-1K, respectively, which is on par with OpenAI’s
data, since collecting multiple images described by a given
CLIP [71]. On fine-grained classification tasks, SynCLR out-
caption is non-trivial, especially when scaling up the number
performs CLIP by 3.3% for ViT-B and 1.5% for ViT-L, and
of captions. However, text-to-image diffusion models are
performs similarly to DINO v2 [68] models, which are dis-
fundamentally built with this ability; simply by conditioning
tilled from a pre-trained ViT-g model. For semantic segmen-
on the same caption and using different noise inputs, a text-
tation on ADE20k, SynCLR outperforms MAE pre-trained
to-image diffusion model will produce different images that
on ImageNet by 6.2 and 4.1 in mIoU for ViT-B and ViT-L
all match the same caption. In our experiments, we find the
under the same setup, showing strong transfer ability for
caption-level granularity outperforms both SimCLR and su-
dense prediction tasks similar to DINO v2, which addition-
pervised training. Another advantage is that this definition of
ally involves a training period on 518x518 resolution images
visual classes has good scalability. Unlike ImageNet-1k/21k
that SynCLR does not have.
where a given number of classes is fixed, we can augment ex-
isting classes (or data) in an online fashion, and theoretically
scale up to as many classes as needed. 2. Related Works
Our system consists of three steps. The first step is to Self-supervised representation learning approaches in vi-
synthesize a large corpus of image captions. We design a sion develop domain-specific pre-text tasks, such as col-
scalable approach by leveraging the in-context learning ca- orization [106], rotation prediction [36], and solving jigsaw
2
puzzles [65]. Domain-agnostic approaches have been pop- 3. Approach
ular, such as contrastive learning [6, 13, 40, 43, 66, 88, 97]
and masked image modeling [2, 4, 5, 33, 44, 96, 100, 110]. In this paper, we study the problem of learning a visual en-
Contrastive learning promotes invariance [89] for two views coder f in the absence of real images or textual data. Our
of the same image and pushes apart representations for dif- approach hinges on the utilization of three key resources: a
ferent images [95] (or only invariance [11, 39]); the resulting language generation model (g1 ), a text-to-image generative
representations yield strong performance for linear or zero- model (g2 ), and a curated list of visual concepts (C). Our ex-
shot transfer. Masked image modeling reconstructs the pix- ploration include three steps: (1) we employ g1 to synthesize
els [44, 100] or local features [4], often producing excellent a comprehensive set of image descriptions T , which encom-
fine-tuning transfer performance, especially in dense predic- pass the range of visual concepts in C; (2) for each caption
tion tasks [44]. The state of the art DINO v2 [68] leverages in T , we generate multiple images using g2 , culminating in
both approaches, and our approach shares a similar spirit. an extensive synthetic image dataset X; (3) we train on X
to obtain a visual representation encoder f .
Supervised learning [41, 52, 84] used to be the dominant
We use Llama-2 7B [93] and Stable Diffusion 1.5 [73] as
approach for learning transferable visual representations for
g1 and g2 , respectively, because of their fast inference speed.
various tasks [26, 37, 81]. Recent studies [42, 57] has shown
We anticipate that better g1 and g2 in the future will further
that, the transferability of representations learned in this way
enhance the effectiveness of this approach.
is limited, e.g., pre-training has no improvement over random
initialization for dense prediction tasks (e.g., object detec-
3.1. Synthesizing captions
tion) when the fine-tuning is long enough. Such limitation
continues when the model has been scaled up to 22B [23]. To harness the capability of powerful text-to-image models
An alternative paradigm learns visual representations from for generating a substantial dataset of training images, we ini-
text supervision [49, 71], e.g., CLIP [71]. This approach is tially require a collection of captions that not only precisely
more flexible (i.e., not requiring classes) and provides richer depict an image but also exhibit diversity to encompass a
supervision, often learning generalizable representations. broad spectrum of visual concepts.
Generative models as representation learners. A number We have developed a scalable approach to create such a
of papers have explored the representations that are learned large collection of captions, leveraging the in-context learn-
by generative models for various recognition tasks [25, 56]. ing capability of LLMs [9]. Our method involves crafting
As might be expected intuitively, such models indeed learn specific prompt engineering templates that guide the LLM to
especially good representations for dense tasks, such as opti- produce the required captions. We start by gathering the con-
cal flow estimation [79], semantic segmentation [8, 101], and cept list C from some existing datasets, such as ImageNet-
depth estimation [107]. Another line of work [19, 55] adapt 21k [24] and Places-365 [108]. For each concept c ∈ C, we
pre-trained diffusion models for zero-shot image recognition consider three straightforward templates to generate captions
via analysis-by-synthesis. These approaches may need to effectively.
be adapted when the architectures of the generative models • c –> caption. As the most direct and simple approach, we
change or a new family of generative model emerge. Our have the Llama-2 model sample a sentence for the concept
approach treats images as universal interfaces with the hope c.
of better generality. • c, bg –> caption. We combine the visual concept c with
Learning from synthetic data from generative models. a background or setting bg. A naïve approach would ran-
Synthetic data has been explored to train machine learn- domly select both c and bg, where bg may correspond to a
ing models in various domains [31, 53, 62, 63, 74, 75, 83, class name from a places dataset like [108]. However, this
87, 102]. In computer vision, the utilization of synthetic method often leads to unlikely combinations in the real
data for training models is common, ranging from optical world, such as a blue whale in a football field. Our abla-
flow [61] and autonomous driving [1] to semantic segmenta- tion experiments demonstrate that this strategy results in
tion [15] and human pose estimation [94]. Others [48, 58] suboptimal performance, likely because the generated cap-
have explored synthetic data for representation learning, tions fall far outside the training distribution of g2 . Instead,
with the predominant approach of altering the latent vari- we employ GPT-4 [67] to generate a list of suitable back-
ables of deep generative models. Our approach aligns with grounds for the chosen concepts. This approach increases
this research paradigm, but it diverges in its use of text-to- the likelihood of generating more plausible combinations,
image models, which have also been investigated by other re- such as a tiger in a forest or a cat in a kitchen, enhancing
searchers [45, 78, 111]. But they use synthetic data for super- the overall quality of the results.
vised learning [30, 78]. The closet work is StableRep [91], • c, rel –> caption. Given a visual concept c, we consider
which also conducts representation learning but still needs a pairing it with a positional relationship word, rel. Take
real text dataset. for instance, if c signifies cat and rel translates to in front
3
Templates In context examples
c –> caption revolver –> Multiple antique revolvers lie on a wooden table, gleaming under soft, ambient light.
closet –> The compact closet, brimming with clothes and shoes, exudes a feeling of organization.
zebra –> A zebra is gallantly trotting across the vast, sunlit plains of the African savannah, creating a
captivating black and white spectacle.
bus station –> The bustling bus station thrums with restless energy, as travelers navigate through the crowded
space, awaiting their journeys amid the echoes of departing buses.
c,bg –> caption tiger, forest –> Two tigers are running together in the forest.
lighter, motorhome –> In the cozy, cluttered environment of a well-traveled motorhome, a sleek silver lighter
holds dominion on the rustic wooden table.
sunset, lake –> Golden sunset hues reflect on a calm lake, silhouetting a lone canoeist against a backdrop of
fiery clouds.
c,rel –> caption kit fox, in front of –> A group of small, fluffy, golden kit foxes is playfully gathered in front of a lush, green,
towering forest backdrop.
cabbage, besides –> A vibrant image portrays a lush, green cabbage, glistening with dewdrops, nestled
besides a rustic, wooden crate full of freshly harvested vegetables.
Table 1. We show examples for the three synthesis templates. Such examples are used as demonstrations for Llama-2 to perform the
in-context learning task. We have 176 such examples in total. Most of them are generated by prompting GPT-4 [67], while a handful of
others are human generated (in a 10M scale pilot study of synthetic captions, we do not notice significant differences between including or
excluding human generated examples.)
4
A plate of paella, a mixed rice dish with chicken, beans, and seafood A vintage electric locomotive rolls along a railway line through a quaint paddy
field in a tranquil rural landscape.
An industrial power plant with its smokestacks belching black smoke. On a desk, a glass water bed is surrounded by a chaotic, messy workspace.
A fluffy, black and white junco bird perches on a snow-covered fence, A combine harvester pulling a trailer full of hay, driving along a narrow road
overlooking a dark forest. with a lake in the distance.
Figure 4. Random examples of synthetic captions and images generated in our SynCLR pipeline. Each caption comes with 4 images.
between a and b (a is allowed to match multiple b): For these local crops, we only employ the contrastive loss,
omitting the iBOT loss. Local crops are encoded only by
exp(a · bi /τ )
q i = PK (1) the student network, and matched to global crops from the
j=1 exp(a · bj /τ ) same caption encoded by the EMA model. Such reuse of
1match(a,bi ) global crops saves computation. For each image x, where
pi = PK (2) we generate a single global crop xg alongside n local crops
j=1 1match(a,bj )
xl , the final loss can be expressed as follows:
where τ ∈ R+ is the scalar temperature, a and all b have
been ℓ2 normalized, and the indicator function 1match(·,·)
n
1X
L(xg ) + L(xli ) + LiBOT (xg ) (4)
indicates whether two samples are from the same caption. n i=1
The contrastive loss for a is given as
K
X
3.4. Implementation
L(a) = H(p, q) = − pi log qi (3) Concept list. We concatenate class names from various
i=1
datasets, including IN-1k [24], IN-21k (we keep the most fre-
iBOT [110] is a masked image modeling objective, wherein quent 13k classes), Aircraft [60], Cars [51], DTD [18], Flow-
a localized patch is masked, and the model is tasked with ers [64], Pets [69], Sun397 [98], Caltech-101 [34], Food-
predicting the tokenized representation of said masked patch. 101 [7], and Places-365 [108]. If the concept is a place (i.e.
It adapts the DINO [11] objective from the image level into SUN397 and Places) or a texture (i.e. DTD), we only apply
the patch level. We follow [76] to replace the softmax- the c –> caption template. For fine-grained classes such
centering method with the iterative Sinkhorn-Knopp (SK) as pets or flowers, we employ GPT-4 to generate a consol-
algorithm [22]. We run SK for 3 iterations to build the idated list of probable backgrounds, rather than producing
prediction target. distinct lists for each specific class. We favor more frequent
Exponential Moving Average (EMA) is firstly introduced sampling from IN-1k, Food101, Cars, Aircraft, and Flowers.
into self-supervised learning by MoCo [43]. We use EMA to Batches. For each training batch, we sample 2048 captions
encode crops as b and to produce the targets for iBOT loss. (except when noted), and use all of the 4 images generated
We update the EMA model as θema ← λθema + (1 − λ)θ, by each caption. We generate 1 global and 4 local crops for
following a cosine schedule for λ from 0.994 to 1 during each image. As a result, each batch contains 8192 global
training [39, 68]. We find the EMA module not only in- crops, which is similar with prior work [13, 14, 39, 91].
creases the final performance, but also improves the training Masking. For the iBOT loss, we randomly choose 50%
stability for long training schedules. images inside a batch to mask, and randomly mask 50% of
Multi-crop strategy is introduced by [10] as a smart way to the tokens in each chosen image. We use 65536 prototypes.
improve computation efficiency, and is adopted in this paper. While the target from the EMA model is ascertained using
5
StableRep SynCLR method EMA iBOT MC IN avg. ADE20k
captions
IN avg. IN avg.
StableRep 75.8 85.7 -
cc12m 73.0 81.6 77.1 85.3
✓ 76.7 86.7 48.0
IN+h+Places 75.4 80.0 78.7 83.0
✓ ✓ 77.6 87.1 50.5
IN+Places+LLM 73.7 76.9 77.6 81.8
✓ ✓ 78.6 87.8 49.5
IN+OurBG+LLM 75.3 78.5 78.2 81.9
SynCLR ✓ ✓ ✓ 78.8 88.1 50.8
our final config. 75.8 85.7 78.8 88.1
Table 4. Important components for our model. ViT-B/16 models
Table 2. Comparison of different caption synthesis strategies. are trained for 85000 iterations. We study the modules that af-
We report top-1 ImageNet linear evaluation accuracy and the aver- fect the ImageNet linear evaluation, the fine-grained classification
age accuracy over 9 fine-grained datasets. Every item here includes (avg.), and ADE20k segmentation.
10M captions and 4 images per caption.
method IN avg.
CFG 2 3 4
Supervised CE 71.9 75.0
IN top-1 72.8 72.6 72.6 SimCLR 63.6 67.9
Table 3. Classifier-free guidance scale (CFG). Contrastive loss SynCLR 75.3 78.5
prefers small CFG scale but is not very sensitive to it.
Table 5. Comparison of different learning objectives. These
the SK algorithm, we apply softmax normalization to the objectives assume different level of classification granularity, as
output of the student model. shown in Figure 2. Our modeling, i.e., defining classes as captions,
Projection heads. We follow the design in MoCo v3 [14] outperforms the other two. To accomondate Supervised CE training,
and DINO [11] for the contrastive and iBOT loss heads, all items here used IN+OurBG+LLM entry in Table 2.
respectively, ensuring consistency with established methods.
Other hyper-parameters. We set the temperature in the con- ration specified in Section 3.1. For each of the config, we
trastive loss to 0.08. For the temperature used in the iBOT generate 10M captions. If not enough, we do duplication.
loss, we linearly increase it from 0.04 to 0.07 over 4000
Results are summarized in Table 2, where we train both
iterations, and keep it as 0.07 afterwards, as in DINO [11].
StableRep and SynCLR to avoid biases favored by a single
Additionally, the weight decay parameter is incrementally
method. Compared to a real caption dataset cc12m, sim-
adjusted from 0.04 to 0.2, adhering to a cosine schedule.
ply concatenating IN and Places class names improves the
ImageNet linear accuracy but reduces the fine-grained classi-
4. Experiment fication performance. Interestingly, naively asking Llama to
We first perform an ablation study to evaluate the efficacy of combine IN and Places classes into captions yields the worst
various designs and modules within our pipeline. Then we performance. Replacing random background from places
proceed to scale up the volume of synthetic data. with GPT generated background improves the accuracy. This
shows the importance of synthesizing captions that follow
4.1. Study different components the distribution of real captions, which were used to train the
We analyze each component of SynCLR, and ablate their text-to-image model. Finally, our full configuration achieves
effectiveness in two measurements: (1) linear probing perfor- the best accuracy on both ImageNet and fine-grained classi-
mance on IN-1k; (2) average accuracy of linear transfer on fication. Another advantage of our synthesis method is its
fine-grained datasets Aircraft [60], Cars [51], DTD [18], scalability – scale up to hundreds of millions of captions with
Flowers [64], Pets [69], Sun397 [98], Caltech-101 [34], little duplication. In contrast, if we concatenate IN classes
Food-101 [7], and Pascal VOC [29]. For analysis conducted with Places classes, there are at most 365k unique captions.
in this subsection, we train ViT-B/16 [28] models for 85000 Synthesize images. There are two major parameters in this
iterations, and use the cls token as image representation. process: number of images per caption and classifier free
Synthesize captions. Following [91], we use cc12m [12] guidance scale. For the former, we find generating 4 images
real captions as our baseline, which has 10M sentences. is almost able to reproduce StableRep [91]’s performance
To synthesize captions, we design the following variants: (10 images) when using cc12m captions (ours 73.0% v.s.
(a) IN+h+Places randomly combines one IN class plus its StableRep 73.5% on ImageNet). Thus we stick to 4. For
hypernyms in WordNet graph, with one place class; (b) guidance scale, we briefly find the contrastive loss is not very
IN+Places+LLM uses the c, bg –> caption in-context syn- sensitive to CFG in a pilot study, as shown in Table 3. Thus
thesis template with c from IN and bg from places; (c) we stick to 2.5, similar as StableRep [91].
IN+ourBG+LLM uses the background classes output by Model components. We present the improvement of accu-
GPT-4, instead of Places; (d) ours means our full configu- racy brought by different modules in Table 4. Compared
6
Caltech-101
ImageNet
VOC2007
Food-101
SUN397
Average
Flowers
Aircraft
DTD
Cars
Pets
text img # imgs
StableRep real syn 100M ViT-B/16 75.7 59.2 83.5 80.1 97.3 88.3 74.3 94.7 85.1 87.9 83.4
ViT-B/16 80.2 59.5 86.7 79.2 98.1 93.1 78.4 94.7 92.8 89.2 85.7
CLIP real real 400M
ViT-L/14 83.9 69.4 90.9 82.1 99.2 95.1 81.8 96.5 95.2 89.6 88.9
400M ViT-B/16 78.9 61.1 92.3 81.9 98.2 91.5 77.9 95.2 90.9 88.0 86.3
OpenCLIP real real 400M ViT-L/14 82.3 67.1 94.0 83.6 98.8 92.5 81.0 96.4 93.4 88.8 88.4
2B ViT-L/14 83.4 71.7 95.3 85.3 99.0 94.2 82.2 97.5 94.1 88.9 89.8
ViT-B/14 83.9† 79.4 88.2 83.3 99.6 96.2 77.3 96.1 92.8 88.2 89.0
DINO v2* - real 142M
ViT-L/14 85.7† 81.5 90.1 84.0 99.7 96.6 78.7 97.5 94.3 88.3 90.1
ViT-B/16 80.7 81.7 93.8 79.9 99.1 93.6 76.2 95.3 91.6 89.4 89.0
SynCLR syn syn 600M
ViT-L/14 83.0 85.6 94.2 82.1 99.2 94.1 78.4 96.1 93.4 90.3 90.4
Table 6. Comparison on ImageNet linear evaluation and fine-grained classificaton. SynCLR achieves comparable results with OpenAI’s
CLIP and DINO v2 models, despite only using synthetic data. *DINO v2 modes are distilled from a ViT-g model, thus advantageous in this
comparison. † we rerun only using cls token instead of concatenating multiple layers presented in the original DINO v2 paper [68].
7
Country211
RESISC45
method pre-train data distill ViT-B ViT-L
EuroSAT
Average
GTSRB
MNIST
KITTI
StableRep hybrid, 100M 49.4 -
MoCo v3 real, IN1K-1M 47.3 49.1
BEiT real, IN1K-1M+DALLE 47.1 53.3 ViT-B/16 97.1 86.6 33.3 99.0 92.7 64.7 78.9
CLIP
MAE real, IN1K-1M 48.1 53.6 ViT-L/14 98.2 92.5 42.9 99.2 94.1 69.2 82.7
iBOT real, IN1K-1M 50.0 - ViT-B/14 96.0 72.8 21.6 98.6 92.5 75.3 76.1
DINO v2
CLIP real, WIT-400M 52.6 - ViT-L/14 96.7 74.1 24.1 98.2 93.8 76.9 77.3
BEiT v2 real, WIT-400M, IN1K ✓ 53.1 56.7 ViT-B/16 96.6 78.6 21.0 98.4 93.7 77.3 77.6
DINO v2 real, LVD-142M ✓ 54.4 † 57.5† SynCLR
ViT-L/14 96.7 79.2 24.3 98.5 93.8 78.0 78.4
SynCLR synthetic, 600M 54.3 57.7 †
Table 9. Generalization to concepts not seen by DINO v2 and
Table 7. ADE20K semantic segmentation (mIoU) using UperNet, SynCLR. SynCLR outperforms DINO v2. CLIP achieves the
with single scale at 512x512 resolution. † use patch size of 14x14, best accuracy, possibly because its training data includes similar
thus adapt to 518x518 resolution. concepts as these datasets.
8
Dino v2 SynCLR (ours) Dino v2 SynCLR (ours) Dino v2 SynCLR (ours)
Figure 5. PCA visualization. Follow DINO v2 [68], we compute a PCA between the patches of the images from the same set and colorize
by their first 3 components. Compared to DINO v2, SynCLR produces more accurate maps for cars (e.g., zoom-in to see the two bars on the
roof of the first car, and the three side windows of the third car) and airplanes (e.g., the boundaries), while being slightly worse for dogs (e.g.,
heads). We use ViT-L/14 for both methods. Images are resized to 336x448 resolution before being fed into the networks, yielding 24x32
visualization grids.
9
5. Discussions and Conclusion atao Gu, and Michael Auli. Data2vec: A general framework
for self-supervised learning in speech, vision and language.
Why learn from generative models? One compelling rea- In ICML, 2022. 3, 8
son is that a generative model can act like hundreds of [5] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit:
datasets simultaneously. Traditionally, researchers have to Bert pre-training of image transformers. arXiv preprint
spend separate effort collecting datasets for different image arXiv:2106.08254, 2021. 3, 8, 10, 14
categories, e.g., cars, flowers, cats, dogs, and so on. DINO [6] Suzanna Becker and Geoffrey E Hinton. Self-organizing
v2 [68] achieves robust representations by curating and amal- neural network that discovers surfaces in random-dot stere-
gamating numerous such datasets. Such a process introduces ograms. Nature, 1992. 3
complexities such as clustering and search challenges. In [7] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.
contrast, advanced text-to-image generative models like Sta- Food-101–mining discriminative components with random
ble Diffusion [72] or Imagen [77] have the capability to forests. In ECCV, 2014. 5, 6
generate many diverse datasets. These models provide the [8] Emmanuel Asiedu Brempong, Simon Kornblith, Ting Chen,
flexibility to produce an infinite number of samples (albeit Niki Parmar, Matthias Minderer, and Mohammad Norouzi.
finite diversity) and control the generation process through Denoising pretraining for semantic segmentation. In CVPR,
textual input. Thus, generative models offer a convenient and 2022. 3
effective method for curating training data. In our study, we [9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,
harness this advantage to synthesize images encompassing a Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
broad spectrum of visual concepts.
guage models are few-shot learners. NeurIPS, 2020. 3
What can be further improved? Enhanced caption sets
[10] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal,
can be achieved through various methods, such as enriching
Piotr Bojanowski, and Armand Joulin. Unsupervised learn-
the set of in-context examples, optimizing the sampling ra- ing of visual features by contrasting cluster assignments. In
tios among different concepts, and utilizing more advanced NeurIPS, 2020. 5
LLMs. In terms of the learning process, one approach is to [11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé-
distill knowledge from a larger model, and incorporate an ad- gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.
ditional high-resolution training phase (as discussed in [68]) Emerging properties in self-supervised vision transformers.
or an intermediate IN-21k fine-tuning stage (as per [5, 70]). In ICCV, 2021. 3, 5, 6, 14
Regarding architectural improvements, the integration of [12] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
SwiGLU and LayerScale, coupled with superior model ini- Soricut. Conceptual 12m: Pushing web-scale image-text
tialization strategies (referenced in [32]), can be beneficial. pre-training to recognize long-tail visual concepts. In CVPR,
However, due to limited resources and the scope of this 2021. 6
paper not being focused on achieving the highest possible [13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
metrics, we propose these areas for further exploration in offrey Hinton. A simple framework for contrastive learning
future research endeavors. of visual representations. In ICML, 2020. 2, 3, 5, 15
In summary, this paper studies a new paradigm for visual [14] Xinlei Chen, Saining Xie, and Kaiming He. An empirical
representation learning – learning from generative models. study of training self-supervised vision transformers. In
Without using any real data, SynCLR learns visual represen- ICCV, 2021. 5, 6, 8, 14
tations that are comparable with those achieved by state of [15] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool.
the art general-purpose visual representation learners. Learning semantic segmentation from synthetic data: A
geometrically guided input-output adaptation approach. In
CVPR, 2019. 3
References [16] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote
[1] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars sensing image scene classification: Benchmark and state of
Mescheder, Andreas Geiger, and Carsten Rother. Aug- the art. Proceedings of the IEEE, 2017. 8
mented reality meets computer vision: Efficient data genera- [17] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell
tion for urban driving scenes. IJCV, 2018. 3 Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh-
[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible
janowski, Florian Bordes, Pascal Vincent, Armand Joulin, scaling laws for contrastive language-image learning. In
Mike Rabbat, and Nicolas Ballas. Masked siamese networks CVPR, 2023. 7
for label-efficient learning. In ECCV, 2022. 3 [18] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy
[3] Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- Mohamed, and Andrea Vedaldi. Describing textures in the
hammad Norouzi, and David J Fleet. Synthetic data from wild. In CVPR, 2014. 5, 6
diffusion models improves imagenet classification. arXiv [19] Kevin Clark and Priyank Jaini. Text-to-image diffu-
preprint arXiv:2304.08466, 2023. 2 sion models are zero-shot classifiers. arXiv preprint
[4] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Ji- arXiv:2303.15233, 2023. 3
10
[20] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christo- [36] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
pher D Manning. Electra: Pre-training text encoders supervised representation learning by predicting image rota-
as discriminators rather than generators. arXiv preprint tions. In ICLR, 2018. 2
arXiv:2003.10555, 2020. 14 [37] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
[21] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Malik. Rich feature hierarchies for accurate object detection
Le. Randaugment: Practical automated data augmentation and semantic segmentation. In CVPR, 2014. 3
with a reduced search space. In CVPR workshops, 2020. 14 [38] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
[22] Marco Cuturi. Sinkhorn distances: Lightspeed computation Misra. Scaling and benchmarking self-supervised visual
of optimal transport. In NeurIPS, 2013. 5 representation learning. In ICCV, 2019. 1
[23] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr [39] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul- Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
mohsin, et al. Scaling vision transformers to 22 billion laghi Azar, et al. Bootstrap your own latent-a new approach
parameters. In ICML, 2023. 3 to self-supervised learning. In NeurIPS, 2020. 3, 5, 14, 15
[24] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[40] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
ality reduction by learning an invariant mapping. In CVPR,
database. In CVPR, 2009. 1, 3, 5
2006. 3
[25] Jeff Donahue and Karen Simonyan. Large scale adversarial
representation learning. NeurIPS, 2019. 3 [41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[26] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Deep residual learning for image recognition. In CVPR,
Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep 2016. 3
convolutional activation feature for generic visual recogni- [42] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking
tion. In ICML, 2014. 3 imagenet pre-training. In ICCV, 2019. 3
[27] Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, [43] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Girshick. Momentum contrast for unsupervised visual rep-
Yu, and Baining Guo. Peco: Perceptual codebook for bert resentation learning. In CVPR, 2020. 3, 5, 14
pre-training of vision transformers. In AAAI, 2023. 8 [44] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi-
[28] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, otr Dollár, and Ross Girshick. Masked autoencoders are
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, scalable vision learners. In CVPR, 2022. 3, 8, 14
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [45] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing
vain Gelly, et al. An image is worth 16x16 words: Trans- Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic
formers for image recognition at scale. arXiv preprint data from generative models ready for image recognition?
arXiv:2010.11929, 2020. 2, 6 arXiv preprint arXiv:2210.07574, 2022. 2, 3
[29] Mark Everingham, Luc Van Gool, Christopher KI Williams, [46] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
John Winn, and Andrew Zisserman. The pascal visual object Damian Borth. Eurosat: A novel dataset and deep learning
classes (voc) challenge. IJCV, 2010. 6 benchmark for land use and land cover classification. IEEE
[30] Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Journal of Selected Topics in Applied Earth Observations
Phillip Isola, and Yonglong Tian. Scaling laws of synthetic and Remote Sensing, 2019. 8
images for model training ... for now. arXiv:2312.04567,
[47] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
2023. 2, 3, 8
Weinberger. Deep networks with stochastic depth. In ECCV,
[31] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and
2016. 14
Yonglong Tian. Improving clip training with language
rewrites. In NeurIPS, 2023. 3 [48] Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola.
[32] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- Generative models as a data source for multiview represen-
long Wang, and Yue Cao. Eva-02: A visual representation tation learning. arXiv preprint arXiv:2106.05258, 2021. 2,
for neon genesis. arXiv preprint arXiv:2303.11331, 2023. 3
10 [49] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
[33] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Duerig. Scaling up visual and vision-language representa-
Eva: Exploring the limits of masked visual representation tion learning with noisy text supervision. In ICML, 2021.
learning at scale. In CVPR, 2023. 3 3
[34] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- [50] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
ative visual models from few training examples: An incre- Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
mental bayesian approach tested on 101 object categories. Dilip Krishnan. Supervised contrastive learning. In NeurIPS,
In CVPR, 2004. 5, 6 2020. 2, 4
[35] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we [51] Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei.
ready for autonomous driving? the kitti vision benchmark Collecting a large-scale dataset of fine-grained cars. tech
suite. In CVPR, 2012. 8 report, 2013. 5, 6
11
[52] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 3, 5, 7, 9, 10,
Imagenet classification with deep convolutional neural net- 15
works. In NeurIPS, 2012. 3 [69] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
[53] Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data CV Jawahar. Cats and dogs. In CVPR, 2012. 5, 6
augmentation using pre-trained transformer models. arXiv [70] Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu
preprint arXiv:2003.02245, 2020. 3 Wei. Beit v2: Masked image modeling with vector-quantized
[54] Yann LeCun. The mnist database of handwritten digits. visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
http://yann. lecun. com/exdb/mnist/, 1998. 8 8, 10
[55] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis [71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Brown, and Deepak Pathak. Your diffusion model is secretly Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
a zero-shot classifier. arXiv preprint arXiv:2303.16203, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
2023. 3 transferable visual models from natural language supervi-
[56] Tianhong Li, Huiwen Chang, Shlok Mishra, Han Zhang, sion. In ICML, 2021. 1, 2, 3, 7, 8, 9
Dina Katabi, and Dilip Krishnan. Mage: Masked generative [72] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
encoder to unify representation learning and image synthesis. Patrick Esser, and Björn Ommer. High-resolution image
In CVPR, 2023. 3 synthesis with latent diffusion models. In CVPR, 2022. 10
[57] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim- [73] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
ing He, and Ross Girshick. Benchmarking detection Patrick Esser, and Björn Ommer. High-resolution image
transfer learning with vision transformers. arXiv preprint synthesis with latent diffusion models. In CVPR, 2022. 3
arXiv:2111.11429, 2021. 3 [74] Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye
[58] Hao Liu, Tom Zahavy, Volodymyr Mnih, and Satinder Singh. Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. Speech
Palm up: Playing in the latent manifold for unsupervised recognition with augmented synthesized speech. In ASRU,
pretraining. arXiv preprint arXiv:2210.10913, 2022. 3 2019. 3
[59] Ilya Loshchilov and Frank Hutter. Decoupled weight decay [75] Nick Rossenbach, Albert Zeyer, Ralf Schlüter, and Hermann
regularization. arXiv preprint arXiv:1711.05101, 2017. 14, Ney. Generating synthetic audio data for attention-based
15 speech recognition systems. In ICASSP, 2020. 3
[60] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew [76] Yangjun Ruan, Saurabh Singh, Warren Morningstar, Alexan-
Blaschko, and Andrea Vedaldi. Fine-grained visual clas- der A Alemi, Sergey Ioffe, Ian Fischer, and Joshua V Dillon.
sification of aircraft. arXiv:1306.5151, 2013. 5, 6 Weighted ensemble self-supervised learning. arXiv preprint
[61] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, arXiv:2211.09981, 2022. 5
Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A [77] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
large dataset to train convolutional networks for disparity, Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
optical flow, and scene flow estimation. In CVPR, 2016. 3 Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
[62] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. Gener- et al. Photorealistic text-to-image diffusion models with
ating training data with language models: Towards zero-shot deep language understanding. In NeurIPS, 2022. 10
language understanding. arXiv preprint arXiv:2202.04538, [78] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and
2022. 3 Yannis Kalantidis. Fake it till you make it: Learning trans-
[63] Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke ferable representations from synthetic imagenet clones. In
Sakai, and Tatsuya Kawahara. Leveraging sequence-to- CVPR, 2023. 2, 3, 8
sequence speech synthesis for enhancing acoustic-to-word [79] Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek
speech recognition. In SLT, 2018. 3 Kar, Mohammad Norouzi, Deqing Sun, and David J Fleet.
[64] Maria-Elena Nilsback and Andrew Zisserman. Automated The surprising effectiveness of diffusion models for opti-
flower classification over a large number of classes. In cal flow and monocular depth estimation. arXiv preprint
Indian Conference on Computer Vision, Graphics & Image arXiv:2306.01923, 2023. 3
Processing, 2008. 5, 6 [80] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[65] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
visual representations by solving jigsaw puzzles. In ECCV, Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
2016. 3 man, et al. Laion-5b: An open large-scale dataset for training
[66] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- next generation image-text models. In NeurIPS, 2022. 1
sentation learning with contrastive predictive coding. arXiv [81] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan,
preprint arXiv:1807.03748, 2018. 3 and Stefan Carlsson. Cnn features off-the-shelf: an astound-
[67] OpenAI. Gpt-4 technical report. arXiv preprint ing baseline for recognition. In CVPR workshops, 2014.
arXiv:2303.08774, 2023. 3, 4 3
[68] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy [82] Noam Shazeer. Glu variants improve transformer. arXiv
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, preprint arXiv:2002.05202, 2020. 7
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. [83] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Dinov2: Learning robust visual features without supervision. Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu-
12
cas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the [100] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
game of go without human knowledge. Nature, 2017. 3 Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple
[84] Karen Simonyan and Andrew Zisserman. Very deep convo- framework for masked image modeling. In CVPR, 2022. 3,
lutional networks for large-scale image recognition. arXiv 8
preprint arXiv:1409.1556, 2014. 3 [101] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao-
[85] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and long Wang, and Shalini De Mello. Open-vocabulary panop-
Christian Igel. The german traffic sign recognition bench- tic segmentation with text-to-image diffusion models. In
mark: a multi-class classification competition. In IJCNN, CVPR, 2023. 3
2011. 8 [102] Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha
[86] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bha-
Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train gavatula, Yejin Choi, and Doug Downey. Generative data
your vit? data, augmentation, and regularization in vision augmentation for commonsense reasoning. arXiv preprint
transformers. arXiv preprint arXiv:2106.10270, 2021. 7 arXiv:2004.11546, 2020. 3
[87] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, [103] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
Hashimoto. Alpaca: A strong, replicable instruction- larization strategy to train strong classifiers with localizable
following model. Stanford Center for Research on Founda- features. In ICCV, 2019. 14
tion Models., 2023. 3 [104] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and
[88] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- Lucas Beyer. Scaling vision transformers. In CVPR, 2022.
trastive multiview coding. arXiv:1906.05849, 2019. 3, 14 7
[89] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, [105] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
Cordelia Schmid, and Phillip Isola. What makes for good David Lopez-Paz. mixup: Beyond empirical risk minimiza-
views for contrastive learning? In NeurIPS, 2020. 3 tion. arXiv preprint arXiv:1710.09412, 2017. 14
[90] Yonglong Tian, Olivier J Henaff, and Aäron van den Oord. [106] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
Divide and contrast: Self-supervised learning from uncu- image colorization. In ECCV, 2016. 2
rated data. In ICCV, 2021. 1 [107] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu,
[91] Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Jie Zhou, and Jiwen Lu. Unleashing text-to-image dif-
Dilip Krishnan. Stablerep: Synthetic images from text-to- fusion models for visual perception. arXiv preprint
image models make strong visual representation learners. In arXiv:2303.02153, 2023. 3
NeurIPS, 2023. 1, 2, 3, 4, 5, 6, 7, 14 [108] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
[92] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, and Antonio Torralba. Learning deep features for discrimi-
Gabriel Synnaeve, and Hervé Jégou. Going deeper with native localization. In CVPR, 2016. 3, 5
image transformers. In ICCV, 2021. 7 [109] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja
[93] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Fidler, Adela Barriuso, and Antonio Torralba. Semantic
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, understanding of scenes through the ade20k dataset. IJCV,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2019. 8, 14
Llama 2: Open foundation and fine-tuned chat models. arXiv [110] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
preprint arXiv:2307.09288, 2023. 3, 4 Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training
[94] Gul Varol, Javier Romero, Xavier Martin, Naureen Mah- with online tokenizer. arXiv preprint arXiv:2111.07832,
mood, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2021. 2, 3, 5, 8
Learning from synthetic humans. In CVPR, 2017. 3 [111] Yongchao Zhou, Hshmat Sahak, and Jimmy Ba. Training on
[95] Tongzhou Wang and Phillip Isola. Understanding contrastive thin air: Improve image classification with generated data.
representation learning through alignment and uniformity arXiv preprint arXiv:2305.15316, 2023. 3
on the hypersphere. In ICML, 2020. 3
[96] Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan
Yuille, and Christoph Feichtenhofer. Masked feature predic-
tion for self-supervised visual pre-training. In CVPR, 2022.
3
[97] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In CVPR, 2018. 3
[98] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
and Antonio Torralba. Sun database: Large-scale scene
recognition from abbey to zoo. In CVPR, 2010. 5, 6
[99] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understanding.
In ECCV, 2018. 8, 14
13
A. Concept Sampling which tries to concatenate cls token with average pooled
patch tokens and sweep over whether to use multiple layers.
The concepts used to synthesize captions are randomly sam-
We follow prior work [11, 14] to train the linear classifier.
pled from the names of various datasets. The rough ratios
It has been generally observed that regularization such as
are presented in Table 11. It is likely that different combina-
weight decay hurts the performance [43, 88]. Therefore,
tions of these ratios lead to different results, but we do not
we set weight decay as 0, and we sweep the base_lr over
optimize over this dimension. For example, we simply con-
{0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50} × 10−2 .
catenate IN-21k concepts with the classes of other datasets
(e.g., Caltech-101, Pets), and do uniform sampling from the config value
concatenated list. This may lead to under-sampling for other batch size 1024
datasets, as the list is dominated by IN-21 classes. optimizer SGD
base learning rate sweep
source prob. peak learning rate blr × bsz/256
IN-1k 0.47 weight decay 0
Aircraft 0.05 optimizer momentum 0.9
Cars 0.05 learning rate schedule cosine decay
epochs 90
Food 0.05
augmentation RandomResizedCrop, Flip
Flowers 0.03
Places-365, SUN397 0.09
IN-21k and others 0.26 Table 13. ImageNet linear probing settings.
Table 11. Rough concept sampling probabilities. B.3. End-to-End ImageNet fine-tuning
Following common practice [5, 44], we append a linear
B. Implementation Details classifier on top of the CLS token of the last transformer
block, and fine-tune the whole network. We use layer-wise
B.1. Pre-training lr decay [20]. Table 14 shows the settings.
The setting for our final long schedule training in Section
4.2 is summarized in Table 12, where models are trained for config value
500k steps with a batch size of 8192 captions. For ablation optimizer AdamW [59]
study present in Section 4.1, we only train for 85k steps with base learning rate 5e-5
peak learning rate blr × bsz/256
a batch size of 2048 captions; for the scaling plots in Section
optimizer momentum β1 , β2 =0.9, 0.999
4.3, we train all models for 300k steps with a batch size of
layer-wise lr decay 0.65 (B), 0.8 (L)
2048. batch size 1024
config value learning rate schedule cosine decay
warmup epochs 20 (B), 5 (L)
batch size 8192
epochs 100 (B), 50 (L)
optimizer AdamW [59] RandAugment [21] 9/0.5
peak learning rate 2e-3 (B), 1.5e-3 (L) label smoothing 0.1 (B), 0.2 (L)
weight decay 0.04 –> 0.2, cosine erasing prob. 0.25
optimizer momentum β1 , β2 =0.9, 0.999 mixup [105] 0.8
learning rate schedule cosine decay cutmix [103] 1.0
steps 500k stoch. depth [47] 0.1 (B), 0.3 (L)
warmup steps 80k test crop ratio 0.95 (B), 1.0 (L)
stoch. depth [47] 0.1 (B), 0.4 (L) ema 0.9999
augmentation Downsample [91] + BYOL Aug. [39]
Table 14. ImageNet end-to-end fine-tuning settings.
Table 12. SynCLR pre-training settings.
14
of 512×512 for models with a patch size of 16×16 and a res-
olution of 518×518 for models with a patch size of 14×14.
The hyper-parameters are summarized in Table 15.
config value
batch size 32 (B), 16 (L)
optimizer AdamW [59]
peak learning rate 8e-5
optimizer momentum β1 , β2 =0.9, 0.999
weight decay 0.05
layer-wise lr decay 0.6 (B), 0.8 (L)
steps 60k (B), 160k (L)
warmup steps 1500
stoch. depth 0.1 (B), 0.2 (L)
15
Table 16. Detailed in-context learning examples for Template 1: c –> Caption. Here c is the concept.
1 coucal –> A vibrant coucal is perched on the branch of a lush green tree, surrounded by wildflowers.
2 bee eater –> A lively bee eater is elegantly perched on a branch, peering intently.
3 three-toed sloth –> A three-toed sloth is lazily hanging from a sturdy, tropical rainforest tree.
4 hay –> In the serene countryside, hundreds of neatly stacked hay bales lay scattered under the
softly glowing golden sunset sky.
5 station wagon –> A shiny, red station wagon is parked under the dappled shade of a large oak tree,
highlighting its spacious and family-friendly design.
6 zebra –> A zebra is gallantly trotting across the vast, sunlit plains of the African savannah, creating
a captivating black and white spectacle.
7 vase –> In the well-lit living room, a beautifully designed, delicate vase stands out as the center-
piece, exuding an aura of elegance.
8 barber chair –> A shiny black barber chair sits invitingly in a bustling, well-lit barbershop.
9 carbonara –> A heaping plate of creamy carbonara pasta topped with fresh parsley sprigs.
10 mink –> In the midst of a dense forest with shimmering green leaves, a sleek mink gracefully
navigates the underbrush, showcasing its rich, brown fur.
11 small white butterfly –> A small white butterfly gracefully flutters amongst vibrant, blooming summer flowers.
12 christmas stocking –> A vibrant red Christmas stocking is hanging delicately from a festively decorated man-
telpiece.
13 horse-drawn vehicle –> An antique horse-drawn vehicle is stationed amidst a peaceful country landscape, its
rustic wooden structure gleaming under the warm afternoon sun.
14 ruler measuring stick –> A manual craftsman is precisely measuring a wooden log with a ruler stick.
15 picket fence –> A tranquil suburban scene featuring multiple white picket fences surrounding well-
maintained green lawns, punctuated by diverse, colorful flowerbeds.
16 suspension bridge –> Depicting a long suspension bridge, its steel cables elegantly stretching towards the sky,
connecting two ends over a scenic river.
17 brain coral –> A vibrant brain coral stands out amidst the serene backdrop of underwater marine life.
18 revolver –> Multiple antique revolvers lie on a wooden table, gleaming under soft, ambient light.
19 slip-on shoe –> A pair of slip-on shoes, with their sleek, black leather exterior and comfortable, cushioned
interior, are neatly placed on a wooden floor.
20 hand-held computer –> A hand-held computer, compact and portable, rests on a well-lit desk, surrounded by
various technological paraphernalia and a steaming cup of coffee.
21 mattress –> A teddy bear lying face down on a bedspread covered mattress in front of a window.
22 refrigerator –> A nicely decorated kitchen with metallic refrigerator and blue counter.
23 ball –> Silver balls are lined up in the sand as people mill about in the background.
24 wheel –> The motorcycle’s gleaming steering wheel, vivid red door reflected in the side mirror,
and a youth passing by, creating a dynamic urban tableau.
25 plane –> A group of trick planes turned upside down leaving smoke trails.
26 vehicle –> Army vehicles, including a U.S. Army jeep and aircraft in a hangar or on display
27 boy –> a little boy wearing sunglasses laying on a shelf in a basement.
28 fence –> a man standing near a fence as reflected in a side-view mirror of a red car.
29 wood table –> A footed glass with water in front of a glass with ice tea, and green serpentine bottle
with pink flowers, all on a wood table in front of chair, with a window to city view.
30 toilet –> A black and white toilet sitting in a bathroom next to a plant filled with waste.
31 table lamp –> A textured brass table lamp, casting a warm, golden glow, accents a cozy reading nook
beside a leather armchair and a stack of books.
32 hair dryer –> A modern sleek and white hair dryer, with a textured grip, stands next to a set of
hairbrushes.
33 street sign –> The street signs indicate which way a car can and cannot turn while the signal light
controls traffic.
34 instrument –> Man dressed in Native American clothes protecting musical instruments from the rain
with an umbrella.
16
35 train –> A man and a cow’s faces are near each other as a train passes by on a bridge.
36 giraffe –> A couple of large giraffe standing next to each other.
37 red admiral butterfly –> a red admiral butterfly, alights upon a dew-kissed sunflower, wings glistening under the
soft morning light.
38 stupa –> Surrounded by verdant foliage, a white stupa rises, adorned with golden accents and
intricate patterns, while devotees circle its base offering prayers.
39 elephant –> A group of elephants being led into the water.
40 bottle –> Motorcycles parked on a street with a bottle sitting on the seat of the nearest the camera.
41 trombone –> On a polished wooden stage, a gleaming brass trombone rests, its slide extended, next to
scattered sheet music and a muted trumpet.
42 keyboard –> Sleek black keyboard with illuminated backlit keys, a soft wrist rest, and a nearby
wireless mouse on a textured matte desk surface.
43 bear –> The brown bear sits watching another bear climb the rocks
44 snowboard –> A man standing next to his snowboard posing for the camera.
45 railway –> a woman and her son walking along the tracks of a disused railway.
46 sand –> the waves and the sand on the beach close up
47 pixel –> very colorful series of squares or pixels in all the colors of the spectrum , from light to
dark
48 cigar –> a burning cigar in a glass ashtray with a blurred background.
49 music –> happy girl listening music on headphones and using tablet in the outdoor cafe.
50 earring –> this gorgeous pair of earrings were featured in april issue.
51 cliff –> Steep cliff, jagged edges against azure sky, with seabirds soaring and waves crashing
below.
52 corn cob –> Fresh corn cob, golden kernels glistening with dew, nestled amid green husks in a sunlit
field.
53 archaeological exca- –> In this intriguing scene, archaeologists meticulously uncover ancient relics at an archaeo-
vation logical excavation site filled with historical secrets and enigmas.
54 formal garden –> This is an immaculately kept formal garden, with perfectly trimmed hedges, colorful,
well-arranged flower beds, and classic statuary, giving a vibe of tranquil sophistication.
55 veterinarians office –> The busy veterinarian’s office is a hive of activity with pets awaiting treatment and care.
56 elevator –> A modern, well-lit elevator interior with shiny metal walls and sleek buttons.
57 heliport –> Situated in a lively area, the heliport stands out with numerous helicopters taking off and
landing against the city’s skyline.
58 airport terminal –> In the spacious airport terminal, travelers hurriedly navigate through check-ins and
security, making it a hive of constant activity.
59 car interior –> Inside the car, the leather seats exude luxury, contrasted by the high-tech dashboard,
creating an atmosphere of sleek comfort and convenience.
60 train interior –> The inside of the train offers a spacious setting with numerous comfortable seats.
61 candy store –> The sweet aroma of sugared treats fills the air in a vibrant candy store, adorned with
colourful candies and cheerful customers.
62 bus station –> The bustling bus station thrums with restless energy, as travelers navigate through the
crowded space, awaiting their journeys amid the echoes of departing buses.
63 castle –> Nestled amidst towering mountains, the majestic castle spews ancient grandeur, with its
stone walls and towering turrets exuding tranquility and timeless mystique.
64 palace –> The grand palace exudes regality, radiant under the sun, showcasing ornate decorations,
intricate sculptures, and exquisite architectural sophistication.
65 kitchen –> The heart of the home unfolds in the kitchen, characterized by stainless steel appliances,
navy blue cabinets, and a patterned tile backsplash.
66 raceway –> The high-speed adrenaline-filled atmosphere of the raceway is pulsing with the roars of
powerful engines and excited cheering fans.
67 bakery –> The warm, inviting bakery is filled with the intoxicating aroma of fresh bread, assorted
pastries, and brewing coffee.
17
68 medina –> This ancient, labyrinth-like medina exudes an air of mystique with its vibrantly decorated
shops lining narrow, stone-cobbled pathways.
69 skyscraper –> The city skyline is dominated by towering skyscrapers, creating a captivating blend of
technology and architectural innovation.
70 supermarket –> The supermarket scene is lively, filled with individuals scanning shelves, children reach-
ing for treats, and clerks restocking fresh produce.
71 closet –> The compact closet, brimming with clothes and shoes, exudes a feeling of organization.
72 assembly line –> In the heart of a busy factory, an orderly assembly line hums with continuous activity,
filled with workers focused on their precision tasks.
73 palace room –> A man in military dress uniform stands in an ornate palace room with antique furniture
and Christmas decorations.
74 barn doorway –> A farmer holding an animal back while another farmer stands in a barn doorway.
75 food court –> A bustling food court with a variety of culinary stalls, featuring vibrant signage, aromatic
dishes, and communal seating, creates a diverse dining experience.
76 mountain –> Majestic mountains, their peaks dusted with snow, overlook a serene alpine lake where
hikers and photographers gather to enjoy the breathtaking scenery.
77 squash court –> Against a clear glass wall, a squash court with gleaming wooden floors, white boundary
lines, and two rackets awaits players.
78 subway station –> Dimly lit subway station with graffiti-covered walls, commuters waiting
79 restaurant –> Cozy restaurant with wooden tables, ambient lighting, patrons chatting, and plates filled
with colorful dishes, framed by exposed brick walls and hanging green plants.
80 field –> there is a large heard of cows and a man standing on a field.
81 aquarium –> Amidst vivid coral formations, an aquarium teems with colorful fish, shimmering under
soft blue lights.
82 market –> A large group of bananas on a table outside in the market.
83 park –> a young boy is skating on ramps at a park
84 beach –> old fishing boats beached on a coastal beach in countryside.
85 grass –> little boy sitting on the grass with drone and remote controller.
86 woven –> The woven basket’s intricate pattern creates a visually captivating and tactile surface.
87 knitted –> The knitted blanket envelops with cozy warmth
88 flecked –> The stone surface was flecked, giving it a uniquely speckled and rough appearance.
89 bubbly –> The liquid gleamed, showcasing its bubbly, effervescent texture vividly.
90 cobwebbed –> The dusty corner was cobwebbed, displaying years of untouched, eerie beauty.
91 stained –> A weather-worn wall manifests an intriguing pattern of stained texture.
92 scaly –> The image showcases a close-up of a lizard’s scaly, rough texture.
93 meshed –> A patterned image depicting the intricate, tightly-knit texture of meshed fabric.
94 waffled –> A fresh, golden-brown waffle displays its distinct crisply waffled texture invitingly.
95 pitted –> The image portrays an intriguing terrain, characterized by a pitted, moon-like surface.
96 studded –> A studded leather jacket gleams, highlighting its rough, tactile texture.
97 crystalline –> The picture showcases an exquisite, crystalline texture with stunning brilliance and
clarity.
98 gauzy –> A delicate veil of gauzy texture enhances the ethereal, dreamy atmosphere.
99 zigzagged –> The photo captures the zigzagged texture, emphasizing the rhythmic, sharp-edged pat-
terns.
100 pleated –> A flowing skirt delicately showcasing the intricate detail of pleated texture.
101 veined –> A detailed image showcasing the intricate, veined texture of a leaf.
102 spiralled –> The spiralled texture of the seashell creates a captivating, tactile pattern.
103 lacelike –> The delicate veil features an intricate, lacelike texture, exuding elegant sophistication.
104 smeared –> A wall coated with thick, smeared paint exudes a rough texture.
105 crosshatched –> A worn, vintage book cover, richly crosshatched, exuding old-world charm.
106 particle –> abstract background of a heart made up of particles.
18
Table 17. Detailed in-context learning examples for Template 2: c,bg –> caption. Here c is the concept, and bg is the background.
107 stick insect, under- –> A stick insect, masterfully camouflaged, clings to a fern amidst the sprawling, dense
growth undergrowth of a lush, tropical forest.
108 black swan, public –> In the peaceful ambiance of a lush public garden, a majestic black swan gracefully glides
garden across a shimmering emerald-green pond.
109 st. bernard, family- –> In the heartwarming family photo, a gregarious St. Bernard dog is seen joyfully nestled
photo among his adoring human companions.
110 measuring cup, food –> In the food prep area, multiple transparent measuring cups are neatly organized on the
prep area marble countertop.
111 can opener, hotel –> A sleek, stainless steel can opener is sitting on the glossy dark-wood kitchenette counter
room of a modern, well-appointed hotel room.
112 small white butterfly, –> A delicate, small white butterfly flutters gracefully above the tranquil pond side, creating
pond side a serene image amidst lush greenery.
113 hair dryer, theatre –> A sleek, professional hair dryer is positioned center stage amidst the dramatic velvet
curtains and ornate details of a bustling theatre.
114 water bottle, airport –> A reusable water bottle sits on the glossy surface of a bustling airport terminal counter,
amidst a backdrop of hurried travelers and departure screens.
115 leonberger, horse –> Several Leonbergers are joyfully romping around a bustling horse ranch.
ranch
116 lighter, motorhome –> In the cozy, cluttered environment of a well-traveled motorhome, a sleek silver lighter
holds dominion on the rustic wooden table.
117 slug, foliage –> A solitary, glistening slug meanders slowly amidst lush, dense green foliage, leaving a
slimy trail on dewy leaves in its path.
118 ring binder, educa- –> The ring binder, filled with important documents, sits prominently on a well-organized
tion department desk in the bustling education department.
119 weimaraner, pet store –> A sleek, silver-gray Weimaraner is spotted curiously sniffing around various pet supplies
in a well-stocked and vibrant pet store.
120 norfolk terrier, coun- –> A lively Norfolk terrier joyfully bounds across a lush, green countryside, its red fur
tryside contrasting vividly with the vast open surroundings.
121 dalmatian, apple or- –> A lively Dalmatian is playfully darting amongst the lush rows of a bountiful apple
chard orchard, its spots contrasting against the ruby fruits.
122 television, mountain –> A sleek, modern television sits prominently against the rustic, wooden walls of an
lodge inviting mountain lodge, surrounded by pine-furnished decor.
123 guillotine, horror –> In the shadowy landscape of a suspenseful horror story, a grim, menacing guillotine
story looms ominously, exuding a petrifying sense of imminent dread.
124 hot tub, condo- –> A luxurious hot tub is nestled in the private balcony of a high-rise condominium, boasting
minium spectacular cityscape views.
125 leaf beetle, plant nurs- –> A vibrant leaf beetle is diligently navigating through a lush plant nursery, its metallic
eries sheen contrasting against the abundant green foliage.
126 carolina anole, hiking –> A small Carolina Anole lizard basks in the warm sunlight, gracefully draped over a
trails gnarled tree root next to a bustling hiking trail.
127 girl, laboratory –> teenage girl and boy working in a laboratory on an experiment.
128 tiger, forest –> Two tigers are running together in the forest.
129 sunset, lake –> Golden sunset hues reflect on a calm lake, silhouetting a lone canoeist against a backdrop
of fiery clouds.
130 building, mountain –> town of skyline over roofs of historic buildings with the mountains in the background.
131 block plane, weath- –> A block plane, its sharp blade gleaming, rests on weathered wood
ered wood
132 olive tree, soil –> single olive tree planted in the center of a dry and cracked soil
133 hamster, pet store –> A curious hamster peers out, with pet store shelves stacked with supplies behind.
134 bag, factory –> plastic bags production line in a factory.
19
135 restaurant, ocean –> young pretty couple dining in a romantic atmosphere at restaurant on the boat with ocean
on the background
136 helicopter, burning –> a helicopter flies over a portion of burning forest.
forest
137 pipe organ, commem- –> striking pipe organ dominates with its notes resonating, while a somber commemoration
oration event event unfolds in the backdrop
138 rotisserie, wedding –> Rotisserie turning golden meats, with a bustling wedding reception, twinkling lights, and
reception guests mingling.
139 duck, taiga –> A group of ducks paddle on a tranquil pond, dense taiga and towering conifers looming
in the background.
140 tiger beetle, rice –> Amidst verdant rice fields, a shimmering tiger beetle perches prominently on a dew-
fields kissed blade of grass.
141 girl, barn –> slow motion clip of a girl walking with her horse through a barn
142 headmaster, gradua- –> the headmaster addresses the graduating seniors during graduation ceremonies.
tion ceremony
143 businessperson, mu- –> businessperson and guest attend music festival.
sic festival
144 fountain, park –> Water cascades from an ornate fountain, surrounded by autumn-hued trees in a serene
park.
145 speedboat, water –> A sleek speedboat glides on shimmering waters, powered by twin high-horsepower
outboard motors.
146 pipe, beach –> a rusty water pipe on the beach.
147 pretzel, home kitchen –> Golden pretzel rests on a wooden board, with a cozy home kitchen, pots and tiled
backsplash, behind.
148 forklift, paper mill –> A forklift transports hefty paper rolls amidst the industrial bustling paper mill.
149 lotion, therapy center –> Blue lotion bottles lined up at a thalasso therapy center by the ocean.
150 guinea pig, sand –> Guinea pig exploring vast golden sand dunes, with tiny footprints trailing behind.
dunes
151 groom, wedding cere- –> father of groom congratulating him after the wedding ceremony.
mony
152 fishing boat, village –> fishing boats moored at fishing village a suburb of capital of the state,
153 red fox, yard –> wild red fox sitting on a partially snow covered front yard of a house in the suburbs of a
small city
154 grey wolf, woodland –> A grey wolf prowls silently, eyes alert, through dense, misty woodland areas with
areas moss-covered trees.
155 cheetah, edges of –> A cheetah crouches, poised and watchful, at the lush edges of murky swamplands.
swamplands
156 wine bottle, living –> in the living room, a person si opening a wine bottle with corkscrew with wooden barrel
room
Table 18. Detailed in-context learning examples for Template 3: c,rel –> caption. Here c is the concept, and rel is the relation.
157 product packet / pack- –> A vibrant product packet, adorned with colorful labels and intricate designs, is neatly
aging, next to placed next to an elegant crystal glass.
158 croquet ball, behind –> A vivid, red croquet ball rests serenely, hiding behind a worn, rustic wooden fence in a
sun-kissed, lush green lawn.
159 bassoon, in front of –> A beautifully crafted bassoon stands elegantly in front of a backdrop of velvet curtains,
ready to perform at a concert.
160 grand piano, above –> A gorgeous, antique chandelier is suspended above the glossy black grand piano, illumi-
nating it with warm, opulent light.
161 bolo tie, behind –> A beautifully crafted bolo tie is casually hung, indicating its previous use, behind a rustic,
well-polished wooden shelf.
20
162 waffle iron, next to –> A large, black waffle iron is placed next to a sparkling glass jar filled with golden maple
syrup on a wooden countertop.
163 komodo dragon, be- –> A young child grins excitedly, peering down from a secure bridge, as a colossal Komodo
low dragon sprawls lazily below in the wildlife park.
164 vaulted or arched ceil- –> Besides the grand marble statue, glimpses of an intricate vaulted or arched ceiling add to
ing, besides the room’s majestic charm.
165 gossamer-winged –> A lovely, vibrant gossamer-winged butterfly is gently perched next to a dew-kissed red
butterfly, next to rose in an early morning garden.
166 kit fox, in front of –> A group of small, fluffy, golden kit foxes is playfully gathered in front of a lush, green,
towering forest backdrop.
167 koala, in –> A cute, fuzzy koala is visibly relaxed, nestled contentedly in the crook of a towering,
lush green eucalyptus tree.
168 centipede, above –> A vibrant green centipede is effortlessly crawling on a tree branch, positioned distinctly
above a patch of untouched fern leaves.
169 mountain bike, above –> A mountain bike is displayed prominently above the rustic mantlepiece, showcasing its
sleek design and intricate details.
170 wallaby, above –> A fluffy, brown wallaby is leaping high, appearing as if it is effortlessly floating above a
lush, green Australian field.
171 giant panda, on –> A playful giant panda is perched on a sturdy tree branch, munching on fresh green
bamboo amidst the tranquil forest ambiance.
172 beagle, on –> A pack of adorable beagles are spotted lounging on an expansive, sunbathed meadow
with colorful wildflowers sprouting around them.
173 beach, on –> A vivid sunset is on display over a sprawling beach, casting warm hues on the waves
gently lapping at the sandy shore.
174 grey whale, on –> A voluminous grey whale is majestically breaching, its massive body on display against
the azure backdrop of the expansive ocean.
175 tractor, in front of –> A bright red tractor is parked in front of a rustic, weathered barn, casting long shadows
under the golden afternoon sun.
176 cabbage, besides –> A vibrant image portrays a lush, green cabbage, glistening with dewdrops, nestled
besides a rustic, wooden crate full of freshly harvested vegetables.
21