0% found this document useful (0 votes)

59 views21 pages

Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR

The document introduces SynCLR, a novel approach for learning visual representations exclusively from synthetic images and text datasets without using real image data. SynCLR learns visual concepts by training models on large synthetic datasets generated from text prompts using diffusion models. This allows vision models to be learned solely from synthetic data, rivaling models trained on large real image datasets.

Uploaded by

Joo-Haeng Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views21 pages

Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR

Uploaded by

Joo-Haeng Lee

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Learning Vision from Models Rivals Learning Vision from Data

Yonglong Tian1,† Lijie Fan2,†, * Kaifeng Chen1 Dina Katabi2 Dilip Krishnan1 Phillip Isola2
1 2 †
Google Research, MIT CSAIL, equal contribution
Github Repo: https://github.com/google-research/syn-rep-learn
arXiv:2312.17742v1 [cs.CV] 28 Dec 2023

Image
Abstract Text dataset dataset

Learner
We introduce SynCLR, a novel approach for learning vi- Learning
sual representations exclusively from synthetic images and from data !
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit
!f :X !Y
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit

<latexit sha1_base64="HYPSq+DctLdg9hrnUlynFDqraoE=">AAACTnicfZDfShtBFMZn0z/GaGu0l70ZjIKUEnZFULyS1oveFBWMRrIhnJ2c3QzOziwzZ1vDklfprX2c3vZFeic6iSm0Kh4Y+PGdb2bO+ZJCSUdh+DuovXj56vVCfbGxtPzm7Upzde3MmdIK7AijjO0m4FBJjR2SpLBbWIQ8UXieXH6e9s+/oXXS6FMaF9jPIdMylQLIS4PmWrrPuzy2MhsRWGu+84tBsxW2w1nxxxDNocXmdTxYDTbioRFljpqEAud6UVhQvwJLUiicNOLSYQHiEjLsedSQo+tXs+EnfNMrQ54a648mPlP/vVFB7tw4T7wzBxq5h72p+FSvV1K616+kLkpCLe4/SkvFyfBpEnwoLQpSYw8grPSzcjECC4J8Xo1GfIh+GYtf/cNHBVogYz9UMdgsh6uJXy6LP07pOaPUf42efK7RwxQfw9l2Owrb0clO6+DTPOE6e8/W2RaL2C47YF/YMeswwa7YD3bNfga/gj/BTXB7b60F8zvv2H9Vq98BlhezeQ==</latexit>

e.g. CLIP 80.2%

synthetic captions, without any real data. We synthesize a <latexit sha1_base64="ypyoh2GfdwbXu/VtH3OnwnnQzzw=">AAACEnicjVC7SgNBFL3rM8ZX1NJmMASslt00pgzaWCqYByQhzE7uJmNmZ5eZWSEs+QcLG3/FRsTWys6/cZJsoYmFBwYO55zLnXuCRHBtPO/LWVvf2NzaLuwUd/f2Dw5LR8dNHaeKYYPFIlbtgGoUXGLDcCOwnSikUSCwFYyvZn7rAZXmsbwzkwR7ER1KHnJGjZWaNc+tdiv9UtlzvTnIKvFzUoYc/4v3S5/dQczSCKVhgmrd8b3E9DKqDGcCp8VuqjGhbEyH2LFU0gh1L5ufNCUVqwxIGCv7pCFz9edERiOtJ1FgkxE1I73szcS/vE5qwlov4zJJDUq2WBSmgpiYzPohA66QGTGxhDLF7V8JG1FFmbEtFu3p/vKhq6RZdX3P9W+r5fpl3lkBTuEMzsGHC6jDNdxAAxjcwyM8w6vz5Lw4b877Irrm5DMn8AvOxzdj35V2</latexit>
sha1_base64="DREih0KZjV6Xct/SSu8ZhKXVD3E=">AAAB7XicbVBNS8NAEJ3Ur1q/oh69LJaCp5D0Yo9FLx4r2A9oQ9lsN+3azW7Y3Qgl9D948aCIV/+PN/+N2zYHbX0w8Hhvhpl5UcqZNr7/7ZS2tnd298r7lYPDo+MT9/Sso2WmCG0TyaXqRVhTzgRtG2Y47aWK4iTitBtNbxd+94kqzaR4MLOUhgkeCxYzgo2VOg3fqw9qQ7fqe/4SaJMEBalCgdbQ/RqMJMkSKgzhWOt+4KcmzLEyjHA6rwwyTVNMpnhM+5YKnFAd5str56hmlRGKpbIlDFqqvydynGg9SyLbmWAz0eveQvzP62cmboQ5E2lmqCCrRXHGkZFo8ToaMUWJ4TNLMFHM3orIBCtMjA2oYkMI1l/eJJ26F/hecF+vNm+KOMpwAZdwBQFcQxPuoAVtIPAIz/AKb450Xpx352PVWnKKmXP4A+fzB94ljf0=</latexit>

IN lin. acc
large dataset of image captions using LLMs, then use an off-
the-shelf text-to-image model to generate multiple images Text dataset Image
Generator
corresponding to each synthetic caption. We perform visual Learner
representation learning on these synthetic images via con- Hybrid !
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit

e.g.
!f :X !Y
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit

trastive learning, treating images sharing the same caption StableRep 76.7%
<latexit sha1_base64="k3qOOvZfdDfdUvYR+hDWYeaK2zM=">AAACEnicjVC7TsMwFL0ur1JeBUYWi6oSU5R0oIwVLIwg0YfURpXjOq2p40S2g1RF/QcGFn6FBSFWJjb+BrfNAC0DR7J0dM65ur4nSATXxnW/UGFtfWNzq7hd2tnd2z8oHx61dJwqypo0FrHqBEQzwSVrGm4E6ySKkSgQrB2Mr2Z++4EpzWN5ZyYJ8yMylDzklBgrternTr1X7ZcrruPOgVeJl5MK5PhfvF/+7A1imkZMGiqI1l3PTYyfEWU4FWxa6qWaJYSOyZB1LZUkYtrP5idNcdUqAxzGyj5p8Fz9OZGRSOtJFNhkRMxIL3sz8S+vm5rwws+4TFLDJF0sClOBTYxn/eABV4waMbGEUMXtXzEdEUWosS2W7One8qGrpFVzPNfxbmuVxmXeWRFO4BTOwIM6NOAabqAJFO7hEZ7hFT2hF/SG3hfRAspnjuEX0Mc3dJaVgA==</latexit>
sha1_base64="aPXIoNODWf5x+LURXtNXl3iFWBA=">AAAB7XicbVA9TwJBEJ3DL8Qv1NJmIyGxutxRCCXRxhIT+UjgQvaWPVjZ273s7pmQC//BxkJjbP0/dv4bF7hCwZdM8vLeTGbmhQln2njet1PY2t7Z3Svulw4Oj45PyqdnHS1TRWibSC5VL8SaciZo2zDDaS9RFMchp91wervwu09UaSbFg5klNIjxWLCIEWys1Klfu/VBdViueK63BNokfk4qkKM1LH8NRpKkMRWGcKx13/cSE2RYGUY4nZcGqaYJJlM8pn1LBY6pDrLltXNUtcoIRVLZEgYt1d8TGY61nsWh7Yyxmeh1byH+5/VTEzWCjIkkNVSQ1aIo5chItHgdjZiixPCZJZgoZm9FZIIVJsYGVLIh+Osvb5JOzfU917+vVZo3eRxFuIBLuAIf6tCEO2hBGwg8wjO8wpsjnRfn3flYtRacfOYc/sD5/AHtao4H</latexit>

as positive pairs. The resulting representations transfer well IN lin. acc

to many downstream tasks, competing favorably with other

Image
general-purpose visual representation learners such as CLIP LLM Generator
Learning
Learner
and DINO v2 in image classification tasks. Furthermore,
in dense prediction tasks such as semantic segmentation,
from !
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit
!f :X !Y
<latexit sha1_base64="sVBkjs/c+hJlwPgmxP0/MoyXMvk=">AAAB8nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lE0GPRi8cK9gPSUDbbTbt0kw27E6WE/gwvHhTx6q/x5r9x2+agrQ8GHu/NMDMvTKUw6LrfTmltfWNzq7xd2dnd2z+oHh61jco04y2mpNLdkBouRcJbKFDybqo5jUPJO+H4duZ3Hrk2QiUPOEl5ENNhIiLBKFrJ72kxHCHVWj31qzW37s5BVolXkBoUaParX72BYlnME2SSGuN7bopBTjUKJvm00ssMTykb0yH3LU1ozE2Qz0+ekjOrDEiktK0EyVz9PZHT2JhJHNrOmOLILHsz8T/PzzC6DnKRpBnyhC0WRZkkqMjsfzIQmjOUE0so08LeStiIasrQplSxIXjLL6+S9kXdc+ve/WWtcVPEUYYTOIVz8OAKGnAHTWgBAwXP8ApvDjovzrvzsWgtOcXMMfyB8/kDwruRjQ==</latexit>
<latexit

80.2%
models
<latexit sha1_base64="HYPSq+DctLdg9hrnUlynFDqraoE=">AAACTnicfZDfShtBFMZn0z/GaGu0l70ZjIKUEnZFULyS1oveFBWMRrIhnJ2c3QzOziwzZ1vDklfprX2c3vZFeic6iSm0Kh4Y+PGdb2bO+ZJCSUdh+DuovXj56vVCfbGxtPzm7Upzde3MmdIK7AijjO0m4FBJjR2SpLBbWIQ8UXieXH6e9s+/oXXS6FMaF9jPIdMylQLIS4PmWrrPuzy2MhsRWGu+84tBsxW2w1nxxxDNocXmdTxYDTbioRFljpqEAud6UVhQvwJLUiicNOLSYQHiEjLsedSQo+tXs+EnfNMrQ54a648mPlP/vVFB7tw4T7wzBxq5h72p+FSvV1K616+kLkpCLe4/SkvFyfBpEnwoLQpSYw8grPSzcjECC4J8Xo1GfIh+GYtf/cNHBVogYz9UMdgsh6uJXy6LP07pOaPUf42efK7RwxQfw9l2Owrb0clO6+DTPOE6e8/W2RaL2C47YF/YMeswwa7YD3bNfga/gj/BTXB7b60F8zvv2H9Vq98BlhezeQ==</latexit>

<latexit sha1_base64="9CT3PjVopMh/zXd+oW32lZVtduw=">AAACInicjVDLTsJAFJ3iC/FVHzs3E4EEN6RlI0siG01cYCLQpDRkOkxhwnTazExNsOFfXLjxV9wYdWXixzhAFwouPMkkJ+ecmzv3+DGjUlnWp5FbW9/Y3MpvF3Z29/YPzMOjjowSgUkbRywSjo8kYZSTtqKKEScWBIU+I11/3Jz53XsiJI34nZrExAvRkNOAYqS01DdPSnWrWuuVS7DSvLluQddxvPO+WbSq1hxwldgZKYIM/4v3zffeIMJJSLjCDEnp2lasvBQJRTEj00IvkSRGeIyGxNWUo5BIL52fOIVlrQxgEAn9uIJz9edEikIpJ6GvkyFSI7nszcS/PDdRQd1LKY8TRTheLAoSBlUEZ33BARUEKzbRBGFB9V8hHiGBsNKtFvTp9vKhq6RTq9pW1b6tFRuXWWd5cArOQAXY4AI0wBVogTbA4AE8gmfwajwZL8ab8bGI5oxs5hj8gvH1DbbvmZw=</latexit>
sha1_base64="V5/ITBdyyCU58U0qJSmYFqO7oEM=">AAAB/XicbVDLTsJAFJ36RHzVx87NRCDBTdOykSWRjSYuMBFoUhoyHaYwYTptZqYm2BB/xY0LjXHrf7jzbxygCwVPcpOTc+7NvfcECaNS2fa3sba+sbm1Xdgp7u7tHxyaR8cdGacCkzaOWSzcAEnCKCdtRRUjbiIIigJGusG4OfO7D0RIGvN7NUmIH6EhpyHFSGmpb56W67ZV61XKsNq8vWlBz3X9i75Zsi17DrhKnJyUQI5W3/zqDWKcRoQrzJCUnmMnys+QUBQzMi32UkkShMdoSDxNOYqI9LP59VNY0coAhrHQxRWcq78nMhRJOYkC3RkhNZLL3kz8z/NSFdb9jPIkVYTjxaIwZVDFcBYFHFBBsGITTRAWVN8K8QgJhJUOrKhDcJZfXiWdmuXYlnNXKzWu8jgK4AycgypwwCVogGvQAm2AwSN4Bq/gzXgyXox342PRumbkMyfgD4zPH6yHkiM=</latexit>

Ours 80.7%
SynCLR outperforms previous self-supervised methods by a IN lin. acc
<latexit sha1_base64="7aUnXkl7DSvTHSwrXSuHBGJijAg=">AAACEnicjVC7SgNBFL0bXzG+opY2gyFgtcyKkJRBG0sF84BkCbOT2WTM7OwyMyuEJf9gYeOv2IjYWtn5N06SLTSx8MDA4ZxzuXNPkAiuDcZfTmFtfWNzq7hd2tnd2z8oHx61dJwqypo0FrHqBEQzwSVrGm4E6ySKkSgQrB2Mr2Z++4EpzWN5ZyYJ8yMylDzklBgrterYrfWq/XIFu3gOtEq8nFQgx//i/fJnbxDTNGLSUEG07no4MX5GlOFUsGmpl2qWEDomQ9a1VJKIaT+bnzRFVasMUBgr+6RBc/XnREYirSdRYJMRMSO97M3Ev7xuasK6n3GZpIZJulgUpgKZGM36QQOuGDViYgmhitu/IjoiilBjWyzZ073lQ1dJ69z1sOvdXlQal3lnRTiBUzgDD2rQgGu4gSZQuIdHeIZX58l5cd6c90W04OQzx/ALzsc3bNaVfQ==</latexit>
sha1_base64="W8xjdoQtjD9zFBtJopo6ywChb/A=">AAAB7XicbVBNSwMxEJ2tX7V+tOrRS7AUPC27IrTHohePFewHtEvJptk2NpssSVYoS/+DFw+KePX/ePPfmLZ70NYHA4/3ZpiZFyacaeN5305ha3tnd6+4Xzo4PDouV05OO1qmitA2kVyqXog15UzQtmGG016iKI5DTrvh9Hbhd5+o0kyKBzNLaBDjsWARI9hYqdPw3PqgNqxUPddbAm0SPydVyNEaVr4GI0nSmApDONa673uJCTKsDCOczkuDVNMEkyke076lAsdUB9ny2jmqWWWEIqlsCYOW6u+JDMdaz+LQdsbYTPS6txD/8/qpiRpBxkSSGirIalGUcmQkWryORkxRYvjMEkwUs7ciMsEKE2MDKtkQ/PWXN0nnyvU917+/rjZv8jiKcA4XcAk+1KEJd9CCNhB4hGd4hTdHOi/Ou/Oxai04+cwZ/IHz+QPmY44E</latexit>

significant margin, e.g., improving over MAE and iBOT by 76.

6.2 and 4.3 mIoU on ADE20k for ViT-B/16. Figure 1. Three paradigms for visual representation learning. Top
<latexit sha1_base64="4fMNMAWUS8cXS10eQk/Zp+Gu08M=">AAACJ3icjVDLSsNAFJ34rPUVdSVuBttC3YSkC+uy6Malr7aBNJTJdNIOnUzCzEQoofg1Ltz4K4KI6NI/cdpmoa0LDwwczj2HO+cGCaNS2fansbS8srq2Xtgobm5t7+yae/stGacCkyaOWSzcAEnCKCdNRRUjbiIIigJG2sHwYjJv3xMhaczv1CghfoT6nIYUI6WlrnlYrp9a9U6lDKu3CunUDUmg57r+Sdcs2ZY9BVwkTk5KIMf/7F3ztdOLcRoRrjBDUnqOnSg/Q0JRzMi42EklSRAeoj7xNOUoItLPpj3HsKKVHgxjoR9XcKr+TGQoknIUBdoZITWQ87OJ+NfMS1V45meUJ6kiHM8WhSmDKoaTo8EeFQQrNtIEYUH1XyEeIIGw0qct6urOfNFF0qpZjm0517VS4zy/WQEcgWNQBQ6ogwa4BFegCTB4AI/gGbwZT8aL8W58zKxLRp45AL9gfH0DjjScMg==</latexit>
sha1_base64="BzdP2sPYHqx7DcniQvc0ekvPJNQ=">AAACAnicbVC7TsMwFHV4lvIKMCEWi7ZSWaKkA2WsYGEsj7aR0qhyXKe16jiR7SBVUcXCr7AwgBArX8HG3+C2GaDlSJaOzrlH1/cECaNS2fa3sbK6tr6xWdgqbu/s7u2bB4dtGacCkxaOWSzcAEnCKCctRRUjbiIIigJGOsHoaup3HoiQNOb3apwQP0IDTkOKkdJSzzwu18+terdShtU7hXTqliTQc13/rGeWbMueAS4TJyclkKPZM7+6/RinEeEKMySl59iJ8jMkFMWMTIrdVJIE4REaEE9TjiIi/Wx2wgRWtNKHYSz04wrO1N+JDEVSjqNAT0ZIDeWiNxX/87xUhRd+RnmSKsLxfFGYMqhiOO0D9qkgWLGxJggLqv8K8RAJhJVurahLcBZPXibtmuXYlnNTKzUu8zoK4AScgipwQB00wDVoghbA4BE8g1fwZjwZL8a78TEfXTHyzBH4A+PzByqflLk=</latexit>

row: Traditional methods, such as CLIP [71], learn only from real
data; Middle row: Recent methods, such as StableRep [91], learn
from real text and generated images; Bottom row: Our method,
1. Introduction SynCLR, learns from synthetic text and synthetic images, and rival
80.3%
<latexit sha1_base64="NhPQy1UcpRZvqvYTVpX9kRNSYQg=">AAACHXicjVC7TgJBFL3rE/HBqqXNRCDBhuxiISXRxk5N5JHAhswOszBh9pF5mJANX2Jh46/YGGNhY/wbB9hCwcKTTHJyzrm5c4+fcCaV43xZa+sbm1vbuZ387t7+QcE+PGrJWAtCmyTmsej4WFLOItpUTHHaSQTFoc9p2x9fzfz2AxWSxdG9miTUC/EwYgEjWBmpbxdKdad63iuXUOVGC3nWt4tO1ZkDrRI3I0XI8L943/7oDWKiQxopwrGUXddJlJdioRjhdJrvaUkTTMZ4SLuGRjik0kvn101R2SgDFMTCvEihufpzIsWhlJPQN8kQq5Fc9mbiX15Xq6DupSxKtKIRWSwKNEcqRrOq0IAJShSfGIKJYOaviIywwESZQvPmdHf50FXSqlVdp+re1YqNy6yzHJzAKVTAhQtowDXcQhMIaHiEZ3i1nqwX6816X0TXrGzmGH7B+vwGN3eYZA==</latexit>
sha1_base64="OtU8mCdAbBed0EwPuAPmtjYgpXE=">AAAB+HicbVBNT8JAEJ36ifhB1aOXjUCCF9LiQY5EL97ERD4SaMh22cKG7bbZ3Zpgwy/x4kFjvPpTvPlvXKAHBV8yyct7M5mZ58ecKe0439bG5tb2zm5uL79/cHhUsI9P2ipKJKEtEvFIdn2sKGeCtjTTnHZjSXHoc9rxJzdzv/NIpWKReNDTmHohHgkWMIK1kQZ2oVR3qpf9cglV7hKpLgZ20ak6C6B14makCBmaA/urP4xIElKhCcdK9Vwn1l6KpWaE01m+nygaYzLBI9ozVOCQKi9dHD5DZaMMURBJU0Kjhfp7IsWhUtPQN50h1mO16s3F/7xeooO6lzIRJ5oKslwUJBzpCM1TQEMmKdF8aggmkplbERljiYk2WeVNCO7qy+ukXau6TtW9rxUb11kcOTiDc6iAC1fQgFtoQgsIJPAMr/BmPVkv1rv1sWzdsLKZU/gD6/MHU6aQ6w==</latexit>

Representation learning extracts and organizes information the linear transfer performance of CLIP on ImageNet despite not
from raw, often unlabeled data. The quality, quantity, and directly observing any real data.
diversity of the data determines how good a representation
the model can learn. The model becomes a reflection of the recent work has indeed shown that this can lead to strong
collective intelligence that exists in the data. We get what performance gains at scale [68], but this path is costly to
we feed in. pursue.
Unsurprisingly, the current best-performing visual rep-
To alleviate the cost, in this paper we ask if synthetic
resentation learning methods [68, 71] rely on large scale
data, sampled from off-the-shelf generative models, is a
real datasets. However, the collection of real data has its
viable path toward large scale curated datasets that can train
own dilemmas. Collecting large scale uncurated data [80] is
state-of-the-art visual representations.
relatively cheap and thus quite achievable. However, for self-
supervised representation learning, this approach exhibits We call such a paradigm learning from models, in con-
poor scaling behavior –i.e., adding more uncurated data has trast to directly learning from data. Models have several
little effect at large data scales [38, 90]. Collecting small advantages as a data source for building large scale train-
scale curated data [24] also is achievable, but models trained ing sets: via their latent variables, conditioning variables,
in this way are limited to relatively narrow tasks. The ideal and hyperparameters, they provide new controls for curat-
would be large scale curated datasets of real images, and ing data; we will make use of these controls in the method
we propose. Models also can be easier to share and store
* Work done while interning at Google. (because models are more compressed than data), and can

1
produce an unlimited number of data samples (albeit with class 1 class 2 class 3 class 4
finite diversity). A growing literature has studied these
properties and other advantages (and disadvantages) of us- SimCLR
ing generative models as a data source for training down-
stream models [3, 30, 45, 48, 78, 91]. Some of these meth-
ods use a hybrid mode – either mixing real and synthetic class 1
datasets [3] or needing a real dataset to generate another
synthetic dataset [91]. Other methods try to learn represen- Supervised
tations from purely synthetic data [78] but lag far behind CE
the best performing models. Instead, we show that learning
from models, without training on any real data, can yield rep-
class 1 class 2
resentations that match the top-performing representations
learnt from real data. For instance, as illustrated in Figure 1,
SynCLR
representations learnt by our method are able to transfer as
well as OpenAI’s CLIP [71] on ImageNet (both methods
using ViT-B [28]).
Our approach leverages generative models to re-define Figure 2. Different learning objectives treat classification granu-
the granularity of visual classes. As shown in Figure 2, con- larity differently. These images are generated by two prompts “a
golden retriever, wearing sunglasses and a beach hat, rides a bike"
sider we have four images generated using two prompts: “a
and “a cute golden retriever sits in a house made of sushi". Sim-
golden retriever, wearing sunglasses and a beach hat, rides CLR treats each image as a class, while supervised cross-entropy
a bike" and “a cute golden retriever sits in a house made treats them all as the same “golden retrieval” class. The former
of sushi". Traditional self-supervised method such as Sim- does not consider shared semantics between images, and the latter
CLR [13] will treat each of these images as a different class; is coarse-grained and ignores actions or relationships between sub-
embeddings for different images are pushed apart with no jects/background. Our approach, SynCLR, defines visual classes
explicit consideration of the shared semantics between im- by sentences.
ages. On the other extreme, supervised learning methods
(i.e. SupCE) will regard all these images as a single class
pability of large language models (LLMs), where we present
(e.g., “golden retriever”). This ignores nuances in the se-
examples of word-to-caption translations. Next, a text-to-
mantics of the images, such as the fact that the dogs are
image diffusion model is adopted to synthesize multiple
riding a bike in one pair of images and sitting inside a sushi
images for each synthetic caption. This yields a synthetic
house in the other pair of images. Instead, our method, Syn-
dataset of 600M images. Then we train visual representa-
CLR, treats captions as classes, i.e., each caption describes
tion models by a combination of multi-positive contrastive
a visual class (this level of granularity was also explored
learning [50] and masked image modeling [110].
in StableRep [91]). This allows us to group images by the
Our learned representations transfer well. With Syn-
concepts of “riding a bike” and “sitting in a sushi house”,
CLR pre-training, our ViT-B and ViT-L models achieve
in addition to grouping by a coarser class label like “golden
80.7% and 83.0% top-1 linear probing accuracy on
retrieval”. This level of granularity is difficult to mine in real
ImageNet-1K, respectively, which is on par with OpenAI’s
data, since collecting multiple images described by a given
CLIP [71]. On fine-grained classification tasks, SynCLR out-
caption is non-trivial, especially when scaling up the number
performs CLIP by 3.3% for ViT-B and 1.5% for ViT-L, and
of captions. However, text-to-image diffusion models are
performs similarly to DINO v2 [68] models, which are dis-
fundamentally built with this ability; simply by conditioning
tilled from a pre-trained ViT-g model. For semantic segmen-
on the same caption and using different noise inputs, a text-
tation on ADE20k, SynCLR outperforms MAE pre-trained
to-image diffusion model will produce different images that
on ImageNet by 6.2 and 4.1 in mIoU for ViT-B and ViT-L
all match the same caption. In our experiments, we find the
under the same setup, showing strong transfer ability for
caption-level granularity outperforms both SimCLR and su-
dense prediction tasks similar to DINO v2, which addition-
pervised training. Another advantage is that this definition of
ally involves a training period on 518x518 resolution images
visual classes has good scalability. Unlike ImageNet-1k/21k
that SynCLR does not have.
where a given number of classes is fixed, we can augment ex-
isting classes (or data) in an online fashion, and theoretically
scale up to as many classes as needed. 2. Related Works
Our system consists of three steps. The first step is to Self-supervised representation learning approaches in vi-
synthesize a large corpus of image captions. We design a sion develop domain-specific pre-text tasks, such as col-
scalable approach by leveraging the in-context learning ca- orization [106], rotation prediction [36], and solving jigsaw

2
puzzles [65]. Domain-agnostic approaches have been pop- 3. Approach
ular, such as contrastive learning [6, 13, 40, 43, 66, 88, 97]
and masked image modeling [2, 4, 5, 33, 44, 96, 100, 110]. In this paper, we study the problem of learning a visual en-
Contrastive learning promotes invariance [89] for two views coder f in the absence of real images or textual data. Our
of the same image and pushes apart representations for dif- approach hinges on the utilization of three key resources: a
ferent images [95] (or only invariance [11, 39]); the resulting language generation model (g1 ), a text-to-image generative
representations yield strong performance for linear or zero- model (g2 ), and a curated list of visual concepts (C). Our ex-
shot transfer. Masked image modeling reconstructs the pix- ploration include three steps: (1) we employ g1 to synthesize
els [44, 100] or local features [4], often producing excellent a comprehensive set of image descriptions T , which encom-
fine-tuning transfer performance, especially in dense predic- pass the range of visual concepts in C; (2) for each caption
tion tasks [44]. The state of the art DINO v2 [68] leverages in T , we generate multiple images using g2 , culminating in
both approaches, and our approach shares a similar spirit. an extensive synthetic image dataset X; (3) we train on X
to obtain a visual representation encoder f .
Supervised learning [41, 52, 84] used to be the dominant
We use Llama-2 7B [93] and Stable Diffusion 1.5 [73] as
approach for learning transferable visual representations for
g1 and g2 , respectively, because of their fast inference speed.
various tasks [26, 37, 81]. Recent studies [42, 57] has shown
We anticipate that better g1 and g2 in the future will further
that, the transferability of representations learned in this way
enhance the effectiveness of this approach.
is limited, e.g., pre-training has no improvement over random
initialization for dense prediction tasks (e.g., object detec-
3.1. Synthesizing captions
tion) when the fine-tuning is long enough. Such limitation
continues when the model has been scaled up to 22B [23]. To harness the capability of powerful text-to-image models
An alternative paradigm learns visual representations from for generating a substantial dataset of training images, we ini-
text supervision [49, 71], e.g., CLIP [71]. This approach is tially require a collection of captions that not only precisely
more flexible (i.e., not requiring classes) and provides richer depict an image but also exhibit diversity to encompass a
supervision, often learning generalizable representations. broad spectrum of visual concepts.
Generative models as representation learners. A number We have developed a scalable approach to create such a
of papers have explored the representations that are learned large collection of captions, leveraging the in-context learn-
by generative models for various recognition tasks [25, 56]. ing capability of LLMs [9]. Our method involves crafting
As might be expected intuitively, such models indeed learn specific prompt engineering templates that guide the LLM to
especially good representations for dense tasks, such as opti- produce the required captions. We start by gathering the con-
cal flow estimation [79], semantic segmentation [8, 101], and cept list C from some existing datasets, such as ImageNet-
depth estimation [107]. Another line of work [19, 55] adapt 21k [24] and Places-365 [108]. For each concept c ∈ C, we
pre-trained diffusion models for zero-shot image recognition consider three straightforward templates to generate captions
via analysis-by-synthesis. These approaches may need to effectively.
be adapted when the architectures of the generative models • c –> caption. As the most direct and simple approach, we
change or a new family of generative model emerge. Our have the Llama-2 model sample a sentence for the concept
approach treats images as universal interfaces with the hope c.
of better generality. • c, bg –> caption. We combine the visual concept c with
Learning from synthetic data from generative models. a background or setting bg. A naïve approach would ran-
Synthetic data has been explored to train machine learn- domly select both c and bg, where bg may correspond to a
ing models in various domains [31, 53, 62, 63, 74, 75, 83, class name from a places dataset like [108]. However, this
87, 102]. In computer vision, the utilization of synthetic method often leads to unlikely combinations in the real
data for training models is common, ranging from optical world, such as a blue whale in a football field. Our abla-
flow [61] and autonomous driving [1] to semantic segmentation experiments demonstrate that this strategy results in
tion [15] and human pose estimation [94]. Others [48, 58] suboptimal performance, likely because the generated cap-
have explored synthetic data for representation learning, tions fall far outside the training distribution of g2 . Instead,
with the predominant approach of altering the latent vari- we employ GPT-4 [67] to generate a list of suitable back-
ables of deep generative models. Our approach aligns with grounds for the chosen concepts. This approach increases
this research paradigm, but it diverges in its use of text-to- the likelihood of generating more plausible combinations,
image models, which have also been investigated by other re- such as a tiger in a forest or a cat in a kitchen, enhancing
searchers [45, 78, 111]. But they use synthetic data for super- the overall quality of the results.
vised learning [30, 78]. The closet work is StableRep [91], • c, rel –> caption. Given a visual concept c, we consider
which also conducts representation learning but still needs a pairing it with a positional relationship word, rel. Take
real text dataset. for instance, if c signifies cat and rel translates to in front

3
Templates In context examples
c –> caption revolver –> Multiple antique revolvers lie on a wooden table, gleaming under soft, ambient light.
closet –> The compact closet, brimming with clothes and shoes, exudes a feeling of organization.
zebra –> A zebra is gallantly trotting across the vast, sunlit plains of the African savannah, creating a
captivating black and white spectacle.
bus station –> The bustling bus station thrums with restless energy, as travelers navigate through the crowded
space, awaiting their journeys amid the echoes of departing buses.
c,bg –> caption tiger, forest –> Two tigers are running together in the forest.
lighter, motorhome –> In the cozy, cluttered environment of a well-traveled motorhome, a sleek silver lighter
holds dominion on the rustic wooden table.
sunset, lake –> Golden sunset hues reflect on a calm lake, silhouetting a lone canoeist against a backdrop of
fiery clouds.
c,rel –> caption kit fox, in front of –> A group of small, fluffy, golden kit foxes is playfully gathered in front of a lush, green,
towering forest backdrop.
cabbage, besides –> A vibrant image portrays a lush, green cabbage, glistening with dewdrops, nestled
besides a rustic, wooden crate full of freshly harvested vegetables.

Table 1. We show examples for the three synthesis templates. Such examples are used as demonstrations for Llama-2 to perform the
in-context learning task. We have 176 such examples in total. Most of them are generated by prompting GPT-4 [67], while a handful of
others are human generated (in a 10M scale pilot study of synthetic captions, we do not notice significant differences between including or
excluding human generated examples.)

3.2. Synthesizing Images

For each text caption, we generate a variety of images by

initiating the reverse diffusion process with different random
noise. The Classifier-Free Guidance (CFG) scale is a crucial
factor in this process. A higher CFG scale enhances the qual-
ity of the samples and the alignment between text and image,
whereas a lower scale results in more diverse samples and
Figure 3. In-context caption generation using Llama-2 [93]. We
better adherence to the original conditional distribution of
randomly sample three in-context examples for each inference run. images based on the given text. Following the approach used
in StableRep [91], we opt for a lower CFG scale, specifically
2.5, and produce 4 images for each caption. Examples of
of, our objective is to prompt the LLM to create captions these images can be seen in Figure 4.
such as a cute yellow cat is enjoying the fish in front of the
sofa. To add variety, we have a selection of 10 different
positional relationship words that we randomly choose 3.3. Representation Learning
from.
Our representation learning method is built upon Sta-
For each of the three templates, we have prepared multi- bleRep [91]. The key component of our approach is the
ple demonstration examples that serve as instructions for the multi-positive contrastive learning loss [50] which works by
LLM to complete the caption synthesis task. Table 1 shows a aligning (in the embedding space) images generated from the
couple of examples for each template. In total, we have 106 same caption. We additionally combine multiple techniques
examples for c–>prompt, 50 examples for c, bg–>prompt, from other self-supervised learning methods, including a
and 20 examples for c, rel–>prompt. Such examples are patch-level masked image modeling objective. We briefly
mostly collected by prompting GPT-4, with a handful from review StableRep and elaborate on the added modules.
human. In a pilot study, we do not observe difference be- StableRep [91] minimizes the cross-entropy loss between
tween including or excluding human generated examples. a ground-truth assignment distribution and a contrastive as-
In the stage of generating captions in-context, we select a signment distribution. Consider an encoded anchor sample
concept and one of the three templates. Next, we randomly a and a set of encoded candidates {b1 , b2 , ..., bK }. The con-
pick three examples from the chosen template and frame the trastive assignment distribution q describes how likely the
caption generation as a text completion task. This process is model predicts a and each b to be generated from the same
illustrated in Figure 3. caption, and the ground-truth distribution is the actual match

4
A plate of paella, a mixed rice dish with chicken, beans, and seafood A vintage electric locomotive rolls along a railway line through a quaint paddy
field in a tranquil rural landscape.

An industrial power plant with its smokestacks belching black smoke. On a desk, a glass water bed is surrounded by a chaotic, messy workspace.

A fluffy, black and white junco bird perches on a snow-covered fence, A combine harvester pulling a trailer full of hay, driving along a narrow road
overlooking a dark forest. with a lake in the distance.

Figure 4. Random examples of synthetic captions and images generated in our SynCLR pipeline. Each caption comes with 4 images.

between a and b (a is allowed to match multiple b): For these local crops, we only employ the contrastive loss,
omitting the iBOT loss. Local crops are encoded only by
exp(a · bi /τ )
q i = PK (1) the student network, and matched to global crops from the
j=1 exp(a · bj /τ ) same caption encoded by the EMA model. Such reuse of
1match(a,bi ) global crops saves computation. For each image x, where
pi = PK (2) we generate a single global crop xg alongside n local crops
j=1 1match(a,bj )
xl , the final loss can be expressed as follows:
where τ ∈ R+ is the scalar temperature, a and all b have
been ℓ2 normalized, and the indicator function 1match(·,·)
n
1X
L(xg ) + L(xli ) + LiBOT (xg ) (4)
indicates whether two samples are from the same caption. n i=1
The contrastive loss for a is given as
K
X
3.4. Implementation
L(a) = H(p, q) = − pi log qi (3) Concept list. We concatenate class names from various
i=1
datasets, including IN-1k [24], IN-21k (we keep the most fre-
iBOT [110] is a masked image modeling objective, wherein quent 13k classes), Aircraft [60], Cars [51], DTD [18], Flow-
a localized patch is masked, and the model is tasked with ers [64], Pets [69], Sun397 [98], Caltech-101 [34], Food-
predicting the tokenized representation of said masked patch. 101 [7], and Places-365 [108]. If the concept is a place (i.e.
It adapts the DINO [11] objective from the image level into SUN397 and Places) or a texture (i.e. DTD), we only apply
the patch level. We follow [76] to replace the softmax- the c –> caption template. For fine-grained classes such
centering method with the iterative Sinkhorn-Knopp (SK) as pets or flowers, we employ GPT-4 to generate a consol-
algorithm [22]. We run SK for 3 iterations to build the idated list of probable backgrounds, rather than producing
prediction target. distinct lists for each specific class. We favor more frequent
Exponential Moving Average (EMA) is firstly introduced sampling from IN-1k, Food101, Cars, Aircraft, and Flowers.
into self-supervised learning by MoCo [43]. We use EMA to Batches. For each training batch, we sample 2048 captions
encode crops as b and to produce the targets for iBOT loss. (except when noted), and use all of the 4 images generated
We update the EMA model as θema ← λθema + (1 − λ)θ, by each caption. We generate 1 global and 4 local crops for
following a cosine schedule for λ from 0.994 to 1 during each image. As a result, each batch contains 8192 global
training [39, 68]. We find the EMA module not only in- crops, which is similar with prior work [13, 14, 39, 91].
creases the final performance, but also improves the training Masking. For the iBOT loss, we randomly choose 50%
stability for long training schedules. images inside a batch to mask, and randomly mask 50% of
Multi-crop strategy is introduced by [10] as a smart way to the tokens in each chosen image. We use 65536 prototypes.
improve computation efficiency, and is adopted in this paper. While the target from the EMA model is ascertained using

5
StableRep SynCLR method EMA iBOT MC IN avg. ADE20k
captions
IN avg. IN avg.
StableRep 75.8 85.7 -
cc12m 73.0 81.6 77.1 85.3
✓ 76.7 86.7 48.0
IN+h+Places 75.4 80.0 78.7 83.0
✓ ✓ 77.6 87.1 50.5
IN+Places+LLM 73.7 76.9 77.6 81.8
✓ ✓ 78.6 87.8 49.5
IN+OurBG+LLM 75.3 78.5 78.2 81.9
SynCLR ✓ ✓ ✓ 78.8 88.1 50.8
our final config. 75.8 85.7 78.8 88.1
Table 4. Important components for our model. ViT-B/16 models
Table 2. Comparison of different caption synthesis strategies. are trained for 85000 iterations. We study the modules that af-
We report top-1 ImageNet linear evaluation accuracy and the aver- fect the ImageNet linear evaluation, the fine-grained classification
age accuracy over 9 fine-grained datasets. Every item here includes (avg.), and ADE20k segmentation.
10M captions and 4 images per caption.

method IN avg.
CFG 2 3 4
Supervised CE 71.9 75.0
IN top-1 72.8 72.6 72.6 SimCLR 63.6 67.9
Table 3. Classifier-free guidance scale (CFG). Contrastive loss SynCLR 75.3 78.5
prefers small CFG scale but is not very sensitive to it.
Table 5. Comparison of different learning objectives. These
the SK algorithm, we apply softmax normalization to the objectives assume different level of classification granularity, as
output of the student model. shown in Figure 2. Our modeling, i.e., defining classes as captions,
Projection heads. We follow the design in MoCo v3 [14] outperforms the other two. To accomondate Supervised CE training,
and DINO [11] for the contrastive and iBOT loss heads, all items here used IN+OurBG+LLM entry in Table 2.
respectively, ensuring consistency with established methods.
Other hyper-parameters. We set the temperature in the con- ration specified in Section 3.1. For each of the config, we
trastive loss to 0.08. For the temperature used in the iBOT generate 10M captions. If not enough, we do duplication.
loss, we linearly increase it from 0.04 to 0.07 over 4000
Results are summarized in Table 2, where we train both
iterations, and keep it as 0.07 afterwards, as in DINO [11].
StableRep and SynCLR to avoid biases favored by a single
Additionally, the weight decay parameter is incrementally
method. Compared to a real caption dataset cc12m, sim-
adjusted from 0.04 to 0.2, adhering to a cosine schedule.
ply concatenating IN and Places class names improves the
ImageNet linear accuracy but reduces the fine-grained classi-
4. Experiment fication performance. Interestingly, naively asking Llama to
We first perform an ablation study to evaluate the efficacy of combine IN and Places classes into captions yields the worst
various designs and modules within our pipeline. Then we performance. Replacing random background from places
proceed to scale up the volume of synthetic data. with GPT generated background improves the accuracy. This
shows the importance of synthesizing captions that follow
4.1. Study different components the distribution of real captions, which were used to train the
We analyze each component of SynCLR, and ablate their text-to-image model. Finally, our full configuration achieves
effectiveness in two measurements: (1) linear probing perfor- the best accuracy on both ImageNet and fine-grained classi-
mance on IN-1k; (2) average accuracy of linear transfer on fication. Another advantage of our synthesis method is its
fine-grained datasets Aircraft [60], Cars [51], DTD [18], scalability – scale up to hundreds of millions of captions with
Flowers [64], Pets [69], Sun397 [98], Caltech-101 [34], little duplication. In contrast, if we concatenate IN classes
Food-101 [7], and Pascal VOC [29]. For analysis conducted with Places classes, there are at most 365k unique captions.
in this subsection, we train ViT-B/16 [28] models for 85000 Synthesize images. There are two major parameters in this
iterations, and use the cls token as image representation. process: number of images per caption and classifier free
Synthesize captions. Following [91], we use cc12m [12] guidance scale. For the former, we find generating 4 images
real captions as our baseline, which has 10M sentences. is almost able to reproduce StableRep [91]’s performance
To synthesize captions, we design the following variants: (10 images) when using cc12m captions (ours 73.0% v.s.
(a) IN+h+Places randomly combines one IN class plus its StableRep 73.5% on ImageNet). Thus we stick to 4. For
hypernyms in WordNet graph, with one place class; (b) guidance scale, we briefly find the contrastive loss is not very
IN+Places+LLM uses the c, bg –> caption in-context syn- sensitive to CFG in a pilot study, as shown in Table 3. Thus
thesis template with c from IN and bg from places; (c) we stick to 2.5, similar as StableRep [91].
IN+ourBG+LLM uses the background classes output by Model components. We present the improvement of accu-
GPT-4, instead of Places; (d) ours means our full configu- racy brought by different modules in Table 4. Compared

6
Caltech-101
ImageNet

VOC2007
Food-101
SUN397

Average
Flowers
Aircraft

DTD
Cars

Pets
text img # imgs
StableRep real syn 100M ViT-B/16 75.7 59.2 83.5 80.1 97.3 88.3 74.3 94.7 85.1 87.9 83.4
ViT-B/16 80.2 59.5 86.7 79.2 98.1 93.1 78.4 94.7 92.8 89.2 85.7
CLIP real real 400M
ViT-L/14 83.9 69.4 90.9 82.1 99.2 95.1 81.8 96.5 95.2 89.6 88.9
400M ViT-B/16 78.9 61.1 92.3 81.9 98.2 91.5 77.9 95.2 90.9 88.0 86.3
OpenCLIP real real 400M ViT-L/14 82.3 67.1 94.0 83.6 98.8 92.5 81.0 96.4 93.4 88.8 88.4
2B ViT-L/14 83.4 71.7 95.3 85.3 99.0 94.2 82.2 97.5 94.1 88.9 89.8
ViT-B/14 83.9† 79.4 88.2 83.3 99.6 96.2 77.3 96.1 92.8 88.2 89.0
DINO v2* - real 142M
ViT-L/14 85.7† 81.5 90.1 84.0 99.7 96.6 78.7 97.5 94.3 88.3 90.1
ViT-B/16 80.7 81.7 93.8 79.9 99.1 93.6 76.2 95.3 91.6 89.4 89.0
SynCLR syn syn 600M
ViT-L/14 83.0 85.6 94.2 82.1 99.2 94.1 78.4 96.1 93.4 90.3 90.4

Table 6. Comparison on ImageNet linear evaluation and fine-grained classificaton. SynCLR achieves comparable results with OpenAI’s
CLIP and DINO v2 models, despite only using synthetic data. *DINO v2 modes are distilled from a ViT-g model, thus advantageous in this
comparison. † we rerun only using cls token instead of concatenating multiple layers presented in the original DINO v2 paper [68].

to the baseline StableRep, adding a teacher EMA model 4.2. Scaling up

improves the IN linear accuracy by 0.9%. Further adding
iBOT local objective or the multi-crop strategy increases the After we have ablated different components, we scale up our
accuracy by 0.9% and 1.9%, respectively. Combining all experiments. Specifically, we synthesize a dataset of 150M
of them results in our full SynCLR model, which achieves captions, called SynCaps-150M, from which we generate
78.8% top-1 IN linear accuracy. The fine-grained classifica- 600M images. We train both ViT-B/16 and ViT-L/14 (no
tion performance follows a similar trend, and reaches 88.1%. SwiGLU [82] or LayerScale [92]), and extend the training
Besides, we test the transfer ability to semantic segmenta- schedules to 500k steps with a batch size of 8192 captions.
tion on ADE20k. The iBOT objective brings 1.0 more mIoU We use 224x224 resolution for all pre-training tasks.
than multi-crop strategy, demonstrating the effectiveness of We compare SynCLR with OpenAI’s CLIP [71], Open-
masked image modeling for dense prediction tasks. CLIP [17], and DINO v2 [68], which represent learning
from data. We note that ViT-B/14 and ViT-L/14 from DINO
Compare to SimCLR and supervised training. We com- v2 are distilled from a ViT-g [104] model, which makes
pare the three different representation learning objectives DINO v2 advantageous in our comparison. We also includes
shown in Figure 2, which classify images at different lev- StableRep [91], which uses the hybrid paradigm.
els of granularity. Since supervised cross-entropy training
ImageNet linear evaluation. For fair comparison, cls
requires a fixed set of balanced classes (indeed both fixed
token from the last block is used as representation across all
set and balance are limitations of such method), we use
models (whereas in DINO v2, results are from concatenating
the IN+ourBG+LLM configuration where we have 1000
multiple layers). As shown in Table 6, SynCLR achieves
balanced classes (i.e., each class has 40k images). The su-
80.7% with ViT-B and 83.0% with ViT-L. This is similar
pervised training recipe follows [86]. For a fair compari-
as CLIP, but still lags behind DINO v2 by 3.2% and 2.7%,
son with SimCLR, we remove all unmatched modules (i.e.,
respectively, partially because of the extra distillation in
EMA, iBOT, and MC) to make sure that the only difference
DINO v2. We note SynCLR has already outperformed other
between SimCLR and our SynCLR is the classification gran-
self-supervised methods pre-trained directly on ImageNet-
ularity defined by the contrastive loss. For all of them, we
1k (e.g., DINO achieves 78.2% with ViT-B/16 and iBOT
do pre-training and then linear probing on the target dataset.
reaches 81.0% with ViT-L/16).
Table 5 presents the comparison. Our multi-positive ob- Fine-grained classification. On the nine fine-grained
jective, which defines images as the same class if they are datasets we have evaluated in Table 6, SynCLR achieves
generated by the same caption, achieves the best perfor- very similar average accuracy as DINO v2, e.g., 89.0% v.s.
mance. It outperforms supervised cross-entropy training 89.0% for ViT-B, and 90.1% vs 90.4% for ViT-L. Both Syn-
and SimCLR by 3.4% and 11.7% for top-1 accuracy on CLR and DINO v2 have curated the pre-training data to
ImageNet linear evaluation, and by 3.5% and 10.6% on fine- include the distribution for these datasets (but in different
grained classification tasks. Besides, our objective does not ways and portions), and end up with similar performance.
require balance between samples from a fixed set of classes, Interestingly, SynCLR outperforms others on Aircraft and
making it easier to scale up. Cars, possibly because we favor more frequent sampling

7
Country211

RESISC45
method pre-train data distill ViT-B ViT-L

EuroSAT

Average
GTSRB

MNIST

KITTI
StableRep hybrid, 100M 49.4 -
MoCo v3 real, IN1K-1M 47.3 49.1
BEiT real, IN1K-1M+DALLE 47.1 53.3 ViT-B/16 97.1 86.6 33.3 99.0 92.7 64.7 78.9
CLIP
MAE real, IN1K-1M 48.1 53.6 ViT-L/14 98.2 92.5 42.9 99.2 94.1 69.2 82.7
iBOT real, IN1K-1M 50.0 - ViT-B/14 96.0 72.8 21.6 98.6 92.5 75.3 76.1
DINO v2
CLIP real, WIT-400M 52.6 - ViT-L/14 96.7 74.1 24.1 98.2 93.8 76.9 77.3
BEiT v2 real, WIT-400M, IN1K ✓ 53.1 56.7 ViT-B/16 96.6 78.6 21.0 98.4 93.7 77.3 77.6
DINO v2 real, LVD-142M ✓ 54.4 † 57.5† SynCLR
ViT-L/14 96.7 79.2 24.3 98.5 93.8 78.0 78.4
SynCLR synthetic, 600M 54.3 57.7 †
Table 9. Generalization to concepts not seen by DINO v2 and
Table 7. ADE20K semantic segmentation (mIoU) using UperNet, SynCLR. SynCLR outperforms DINO v2. CLIP achieves the
with single scale at 512x512 resolution. † use patch size of 14x14, best accuracy, possibly because its training data includes similar
thus adapt to 518x518 resolution. concepts as these datasets.

method pre-train data ViT-B ViT-L SynCLR CLIP

IN avg. IN avg.
MoCo v3 real, IN1K-1M 83.2 84.1
SimMIM real, IN1K-1M 83.8 - SynCaps-150M 80.7 89.0 78.3 87.7
MAE real, IN1K-1M 83.6 85.9 Laion-400M 78.9 86.5 76.6 84.9
PeCo real, IN1K-1M 83.6 85.9
data2vec real, IN1K-1M 84.2 86.6 Table 10. Compare SynCLR with CLIP on the same synthetic
iBOT real, IN21K-14M 84.4 86.6 data. We observe that: (1) SynCLR outperforms CLIP; (2) in our
BEiT v2 real, WIT-400M+IN1k-1M 85.5 87.3 setup, i.e., generating 4 images per caption, SynCaps-150M yields
CLIP real, WIT-400M 85.2 87.5† better representations for both SynCLR and CLIP.
real, LAION-400M 85.0 86.6†
OpenCLIP
real, LAION-2B - 87.1† state of the art self-supervised methods [4, 5, 14, 27, 44, 100,
SynCLR synthetic, 600M 85.8 87.9† 110] in Table 8. Our SynCLR outperforms models trained on
ImageNet images or large scale image datasets. Specifically,
Table 8. Top-1 accuracy on ImageNet with fine-tuning evalu- SynCLR outperforms OpenCLIP ViT-L trained on Laion-2B,
ation.. Models are fine-tuned at 224x224 resolution. † use patch which is the dataset Stable Diffusion (the text2image model
size of 14x14. we used) is trained on. This contrasts with [30, 78], which
shows that directly training a classifier on synthetic images
towards them. This can be an advantage for synthetic data yields bad classification accuracy. Our finding suggests syn-
when we know what downstream tasks to solve. Besides, thetic images are good for training representations, which
SynCLR outperforms CLIP and StableRep by 3.3% and by later can be easily adapted to a downstream task with limited
5.6% for ViT-B, respectively. amount of real data.
Semantic segmentation. To evaluate the pixel-level under-
standing ability of SynCLR, we fine-tune the pre-trained 4.3. Further analysis
models on ADE20k [109], following the setup in [5, 44]. SynCLR requires a list of concepts C to start off. But how
UperNet [99] is used as the task layer, and we evaluate with will SynCLR transfer to concepts outside our list?
a single-scale, i.e. 512x512. Besides CLIP and DINO v2, Generalize to unseen concepts. We consider additional
we also compare to self-supervised methods pre-trained on datasets whose classes are outside the synthesis list, in-
ImageNet, as well as BEiT v2 [70], which distills from CLIP. cluding EuroSAT [46], GTSRB [85], Country211 [71],
Table 7 shows that our SynCLR outperforms self-supervised MNIST [54], RESISC45 [16], and KITTI distances [35].
methods trained on IN-1k by a clear marge, e.g., 4.3 higher These datasets, except for KITTI, are also outside the cura-
mIoU than iBOT. Despite not involving a high resolution pre- tion list of DINO v2. Therefore, it is also a generalization
training period like DINO v2 (e.g., 518x518), SynCLR per- test for DINO v2. Table 9 shows the linear probing results.
forms similarly with DINO v2 (0.1 lower for ViT-B possibly SynCLR outperforms DINO v2 by 1.5% for ViT-B and 1.1%
because DINO v2 uses a smaller patch size of 14x14, but for ViT-L, respectively. This suggests the representations of
0.2 higher for ViT-L). This suggests SynCLR pre-training is SynCLR generalize. CLIP outperforms SynCLR and DINO
suitable for dense prediction tasks. v2, with most gains coming from Country211. An explana-
ImageNet fine-tuning. We evaluate the fine-tuning transfer tion is CLIP’s training data contains similar country flags
ability of SynCLR on ImageNet. We compare with other which are not in the training sets of SynCLR and DINO v2.

8
Dino v2 SynCLR (ours) Dino v2 SynCLR (ours) Dino v2 SynCLR (ours)

(a) (b) (c)

Figure 5. PCA visualization. Follow DINO v2 [68], we compute a PCA between the patches of the images from the same set and colorize
by their first 3 components. Compared to DINO v2, SynCLR produces more accurate maps for cars (e.g., zoom-in to see the two bars on the
roof of the first car, and the three side windows of the third car) and airplanes (e.g., the boundaries), while being slightly worse for dogs (e.g.,
heads). We use ViT-L/14 for both methods. Images are resized to 336x448 resolution before being fed into the networks, yielding 24x32
visualization grids.

82 ViT-B/16 out of the 4 synthesized images in each iteration. Following

ViT-L/14
common practice [71], we train for 32 epochs with a batch
ImageNet Linear Acc. (%)

80 size of 32768. This model achieves 44.4% zero-shot accu-

racy on IN-1k. The SynCaps-150M row in Table 10 presents
78 the linear probing results. Synthetic CLIP learns reason-
ably good features, reaching 78.3% on IN-1k and 87.7% on
76 fine-grained datasets. However, SynCLR is still better.
We have also repeated our experiments with Laion-400M
74
captions, i.e., generate 4 images for each caption and train
1 3 10 40 150
Number of Synthetic Captions (M) SynCLR and CLIP. The comparison between rows SynCaps-
150M and Laion-400M in Table 10 suggests synthetic cap-
Figure 6. ImageNet linear accuracy w/ different training scales. tions are also favorable on a large scale.
PCA visualization. Following the method used in DINO
90 v2 [68], we present visualizations derived from the Principal
ViT-B/16
ViT-L/14 Component Analysis (PCA) conducted on patch features
89
Fine-grained Cls. Acc. (%)

extracted using our model SynCLR. As depicted in Figure 5,

88 a comparative analysis is conducted between SynCLR and
DINO v2, both utilizing the ViT-L/14 architecture. The
87
results demonstrate that SynCLR effectively accentuates the
86 features of cars and planes, while efficiently minimizing
background clutter.
85
Scaling behavior. We train ViT-BViT-L models using ran-
1 3 10 40 150 dom subsets of varying sizes: 1M, 3M, 10M, 40M, and the
Number of Synthetic Captions (M)
comprehensive 150M (measured in the number of captions).
Figure 7. Fine-grained classification w/ different training scales. These models are trained over a reduced schedule of 300,000
steps and utilizes a smaller batch size of 2048. The outcomes
of linear probing are illustrated in Figures 6 and 7. These
Given that both captions and images are synthesized, a results indicate that the ViT-B model delivers robust perfor-
natural question arises: how would CLIP training perform mance at the 10M scale, with diminishing returns observed
on such data? beyond this point. In contrast, the ViT-L model exhibits a
Compare to CLIP training. We use the same data to train a greater demand for data (i.e., it underperforms ViT-B at the
ViT-B CLIP model. For each caption, we randomly choose 1 3M scale) and scales better with data.

9
5. Discussions and Conclusion atao Gu, and Michael Auli. Data2vec: A general framework
for self-supervised learning in speech, vision and language.
Why learn from generative models? One compelling rea- In ICML, 2022. 3, 8
son is that a generative model can act like hundreds of [5] Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit:
datasets simultaneously. Traditionally, researchers have to Bert pre-training of image transformers. arXiv preprint
spend separate effort collecting datasets for different image arXiv:2106.08254, 2021. 3, 8, 10, 14
categories, e.g., cars, flowers, cats, dogs, and so on. DINO [6] Suzanna Becker and Geoffrey E Hinton. Self-organizing
v2 [68] achieves robust representations by curating and amal- neural network that discovers surfaces in random-dot stere-
gamating numerous such datasets. Such a process introduces ograms. Nature, 1992. 3
complexities such as clustering and search challenges. In [7] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool.
contrast, advanced text-to-image generative models like Sta- Food-101–mining discriminative components with random
ble Diffusion [72] or Imagen [77] have the capability to forests. In ECCV, 2014. 5, 6
generate many diverse datasets. These models provide the [8] Emmanuel Asiedu Brempong, Simon Kornblith, Ting Chen,
flexibility to produce an infinite number of samples (albeit Niki Parmar, Matthias Minderer, and Mohammad Norouzi.
finite diversity) and control the generation process through Denoising pretraining for semantic segmentation. In CVPR,
textual input. Thus, generative models offer a convenient and 2022. 3
effective method for curating training data. In our study, we [9] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah,
harness this advantage to synthesize images encompassing a Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan-
broad spectrum of visual concepts.
guage models are few-shot learners. NeurIPS, 2020. 3
What can be further improved? Enhanced caption sets
[10] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal,
can be achieved through various methods, such as enriching
Piotr Bojanowski, and Armand Joulin. Unsupervised learn-
the set of in-context examples, optimizing the sampling ra- ing of visual features by contrasting cluster assignments. In
tios among different concepts, and utilizing more advanced NeurIPS, 2020. 5
LLMs. In terms of the learning process, one approach is to [11] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jé-
distill knowledge from a larger model, and incorporate an ad- gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin.
ditional high-resolution training phase (as discussed in [68]) Emerging properties in self-supervised vision transformers.
or an intermediate IN-21k fine-tuning stage (as per [5, 70]). In ICCV, 2021. 3, 5, 6, 14
Regarding architectural improvements, the integration of [12] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
SwiGLU and LayerScale, coupled with superior model ini- Soricut. Conceptual 12m: Pushing web-scale image-text
tialization strategies (referenced in [32]), can be beneficial. pre-training to recognize long-tail visual concepts. In CVPR,
However, due to limited resources and the scope of this 2021. 6
paper not being focused on achieving the highest possible [13] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge-
metrics, we propose these areas for further exploration in offrey Hinton. A simple framework for contrastive learning
future research endeavors. of visual representations. In ICML, 2020. 2, 3, 5, 15
In summary, this paper studies a new paradigm for visual [14] Xinlei Chen, Saining Xie, and Kaiming He. An empirical
representation learning – learning from generative models. study of training self-supervised vision transformers. In
Without using any real data, SynCLR learns visual represen- ICCV, 2021. 5, 6, 8, 14
tations that are comparable with those achieved by state of [15] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool.
the art general-purpose visual representation learners. Learning semantic segmentation from synthetic data: A
geometrically guided input-output adaptation approach. In
CVPR, 2019. 3
References [16] Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote
[1] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars sensing image scene classification: Benchmark and state of
Mescheder, Andreas Geiger, and Carsten Rother. Aug- the art. Proceedings of the IEEE, 2017. 8
mented reality meets computer vision: Efficient data genera- [17] Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell
tion for urban driving scenes. IJCV, 2018. 3 Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh-
[2] Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bo- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible
janowski, Florian Bordes, Pascal Vincent, Armand Joulin, scaling laws for contrastive language-image learning. In
Mike Rabbat, and Nicolas Ballas. Masked siamese networks CVPR, 2023. 7
for label-efficient learning. In ECCV, 2022. 3 [18] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy
[3] Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mo- Mohamed, and Andrea Vedaldi. Describing textures in the
hammad Norouzi, and David J Fleet. Synthetic data from wild. In CVPR, 2014. 5, 6
diffusion models improves imagenet classification. arXiv [19] Kevin Clark and Priyank Jaini. Text-to-image diffu-
preprint arXiv:2304.08466, 2023. 2 sion models are zero-shot classifiers. arXiv preprint
[4] Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Ji- arXiv:2303.15233, 2023. 3

10
[20] Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christo- [36] Spyros Gidaris, Praveer Singh, and Nikos Komodakis. Un-
pher D Manning. Electra: Pre-training text encoders supervised representation learning by predicting image rota-
as discriminators rather than generators. arXiv preprint tions. In ICLR, 2018. 2
arXiv:2003.10555, 2020. 14 [37] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra
[21] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Malik. Rich feature hierarchies for accurate object detection
Le. Randaugment: Practical automated data augmentation and semantic segmentation. In CVPR, 2014. 3
with a reduced search space. In CVPR workshops, 2020. 14 [38] Priya Goyal, Dhruv Mahajan, Abhinav Gupta, and Ishan
[22] Marco Cuturi. Sinkhorn distances: Lightspeed computation Misra. Scaling and benchmarking self-supervised visual
of optimal transport. In NeurIPS, 2013. 5 representation learning. In ICCV, 2019. 1
[23] Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr [39] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin
Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch,
Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdul- Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
mohsin, et al. Scaling vision transformers to 22 billion laghi Azar, et al. Bootstrap your own latent-a new approach
parameters. In ICML, 2023. 3 to self-supervised learning. In NeurIPS, 2020. 3, 5, 14, 15
[24] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
[40] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimension-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image
ality reduction by learning an invariant mapping. In CVPR,
database. In CVPR, 2009. 1, 3, 5
2006. 3
[25] Jeff Donahue and Karen Simonyan. Large scale adversarial
representation learning. NeurIPS, 2019. 3 [41] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[26] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Deep residual learning for image recognition. In CVPR,
Ning Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep 2016. 3
convolutional activation feature for generic visual recogni- [42] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking
tion. In ICML, 2014. 3 imagenet pre-training. In ICCV, 2019. 3
[27] Xiaoyi Dong, Jianmin Bao, Ting Zhang, Dongdong Chen, [43] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Weiming Zhang, Lu Yuan, Dong Chen, Fang Wen, Nenghai Girshick. Momentum contrast for unsupervised visual rep-
Yu, and Baining Guo. Peco: Perceptual codebook for bert resentation learning. In CVPR, 2020. 3, 5, 14
pre-training of vision transformers. In AAAI, 2023. 8 [44] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Pi-
[28] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, otr Dollár, and Ross Girshick. Masked autoencoders are
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, scalable vision learners. In CVPR, 2022. 3, 8, 14
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- [45] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing
vain Gelly, et al. An image is worth 16x16 words: Trans- Zhang, Philip Torr, Song Bai, and Xiaojuan Qi. Is synthetic
formers for image recognition at scale. arXiv preprint data from generative models ready for image recognition?
arXiv:2010.11929, 2020. 2, 6 arXiv preprint arXiv:2210.07574, 2022. 2, 3
[29] Mark Everingham, Luc Van Gool, Christopher KI Williams, [46] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
John Winn, and Andrew Zisserman. The pascal visual object Damian Borth. Eurosat: A novel dataset and deep learning
classes (voc) challenge. IJCV, 2010. 6 benchmark for land use and land cover classification. IEEE
[30] Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Journal of Selected Topics in Applied Earth Observations
Phillip Isola, and Yonglong Tian. Scaling laws of synthetic and Remote Sensing, 2019. 8
images for model training ... for now. arXiv:2312.04567,
[47] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
2023. 2, 3, 8
Weinberger. Deep networks with stochastic depth. In ECCV,
[31] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and
2016. 14
Yonglong Tian. Improving clip training with language
rewrites. In NeurIPS, 2023. 3 [48] Ali Jahanian, Xavier Puig, Yonglong Tian, and Phillip Isola.
[32] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- Generative models as a data source for multiview represen-
long Wang, and Yue Cao. Eva-02: A visual representation tation learning. arXiv preprint arXiv:2106.05258, 2021. 2,
for neon genesis. arXiv preprint arXiv:2303.11331, 2023. 3
10 [49] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
[33] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Duerig. Scaling up visual and vision-language representa-
Eva: Exploring the limits of masked visual representation tion learning with noisy text supervision. In ICML, 2021.
learning at scale. In CVPR, 2023. 3 3
[34] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- [50] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna,
ative visual models from few training examples: An incre- Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and
mental bayesian approach tested on 101 object categories. Dilip Krishnan. Supervised contrastive learning. In NeurIPS,
In CVPR, 2004. 5, 6 2020. 2, 4
[35] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we [51] Jonathan Krause, Jia Deng, Michael Stark, and Li Fei-Fei.
ready for autonomous driving? the kitti vision benchmark Collecting a large-scale dataset of fine-grained cars. tech
suite. In CVPR, 2012. 8 report, 2013. 5, 6

11
[52] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 3, 5, 7, 9, 10,
Imagenet classification with deep convolutional neural net- 15
works. In NeurIPS, 2012. 3 [69] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
[53] Varun Kumar, Ashutosh Choudhary, and Eunah Cho. Data CV Jawahar. Cats and dogs. In CVPR, 2012. 5, 6
augmentation using pre-trained transformer models. arXiv [70] Zhiliang Peng, Li Dong, Hangbo Bao, Qixiang Ye, and Furu
preprint arXiv:2003.02245, 2020. 3 Wei. Beit v2: Masked image modeling with vector-quantized
[54] Yann LeCun. The mnist database of handwritten digits. visual tokenizers. arXiv preprint arXiv:2208.06366, 2022.
http://yann. lecun. com/exdb/mnist/, 1998. 8 8, 10
[55] Alexander C Li, Mihir Prabhudesai, Shivam Duggal, Ellis [71] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Brown, and Deepak Pathak. Your diffusion model is secretly Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
a zero-shot classifier. arXiv preprint arXiv:2303.16203, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
2023. 3 transferable visual models from natural language supervi-
[56] Tianhong Li, Huiwen Chang, Shlok Mishra, Han Zhang, sion. In ICML, 2021. 1, 2, 3, 7, 8, 9
Dina Katabi, and Dilip Krishnan. Mage: Masked generative [72] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
encoder to unify representation learning and image synthesis. Patrick Esser, and Björn Ommer. High-resolution image
In CVPR, 2023. 3 synthesis with latent diffusion models. In CVPR, 2022. 10
[57] Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaim- [73] Robin Rombach, Andreas Blattmann, Dominik Lorenz,
ing He, and Ross Girshick. Benchmarking detection Patrick Esser, and Björn Ommer. High-resolution image
transfer learning with vision transformers. arXiv preprint synthesis with latent diffusion models. In CVPR, 2022. 3
arXiv:2111.11429, 2021. 3 [74] Andrew Rosenberg, Yu Zhang, Bhuvana Ramabhadran, Ye
[58] Hao Liu, Tom Zahavy, Volodymyr Mnih, and Satinder Singh. Jia, Pedro Moreno, Yonghui Wu, and Zelin Wu. Speech
Palm up: Playing in the latent manifold for unsupervised recognition with augmented synthesized speech. In ASRU,
pretraining. arXiv preprint arXiv:2210.10913, 2022. 3 2019. 3
[59] Ilya Loshchilov and Frank Hutter. Decoupled weight decay [75] Nick Rossenbach, Albert Zeyer, Ralf Schlüter, and Hermann
regularization. arXiv preprint arXiv:1711.05101, 2017. 14, Ney. Generating synthetic audio data for attention-based
15 speech recognition systems. In ICASSP, 2020. 3
[60] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew [76] Yangjun Ruan, Saurabh Singh, Warren Morningstar, Alexan-
Blaschko, and Andrea Vedaldi. Fine-grained visual clas- der A Alemi, Sergey Ioffe, Ian Fischer, and Joshua V Dillon.
sification of aircraft. arXiv:1306.5151, 2013. 5, 6 Weighted ensemble self-supervised learning. arXiv preprint
[61] Nikolaus Mayer, Eddy Ilg, Philip Hausser, Philipp Fischer, arXiv:2211.09981, 2022. 5
Daniel Cremers, Alexey Dosovitskiy, and Thomas Brox. A [77] Chitwan Saharia, William Chan, Saurabh Saxena, Lala
large dataset to train convolutional networks for disparity, Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour,
optical flow, and scene flow estimation. In CVPR, 2016. 3 Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans,
[62] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han. Gener- et al. Photorealistic text-to-image diffusion models with
ating training data with language models: Towards zero-shot deep language understanding. In NeurIPS, 2022. 10
language understanding. arXiv preprint arXiv:2202.04538, [78] Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and
2022. 3 Yannis Kalantidis. Fake it till you make it: Learning trans-
[63] Masato Mimura, Sei Ueno, Hirofumi Inaguma, Shinsuke ferable representations from synthetic imagenet clones. In
Sakai, and Tatsuya Kawahara. Leveraging sequence-to- CVPR, 2023. 2, 3, 8
sequence speech synthesis for enhancing acoustic-to-word [79] Saurabh Saxena, Charles Herrmann, Junhwa Hur, Abhishek
speech recognition. In SLT, 2018. 3 Kar, Mohammad Norouzi, Deqing Sun, and David J Fleet.
[64] Maria-Elena Nilsback and Andrew Zisserman. Automated The surprising effectiveness of diffusion models for opti-
flower classification over a large number of classes. In cal flow and monocular depth estimation. arXiv preprint
Indian Conference on Computer Vision, Graphics & Image arXiv:2306.01923, 2023. 3
Processing, 2008. 5, 6 [80] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
[65] Mehdi Noroozi and Paolo Favaro. Unsupervised learning of Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
visual representations by solving jigsaw puzzles. In ECCV, Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
2016. 3 man, et al. Laion-5b: An open large-scale dataset for training
[66] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- next generation image-text models. In NeurIPS, 2022. 1
sentation learning with contrastive predictive coding. arXiv [81] Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan,
preprint arXiv:1807.03748, 2018. 3 and Stefan Carlsson. Cnn features off-the-shelf: an astound-
[67] OpenAI. Gpt-4 technical report. arXiv preprint ing baseline for recognition. In CVPR workshops, 2014.
arXiv:2303.08774, 2023. 3, 4 3
[68] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy [82] Noam Shazeer. Glu variants improve transformer. arXiv
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, preprint arXiv:2002.05202, 2020. 7
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. [83] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis
Dinov2: Learning robust visual features without supervision. Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lu-

12
cas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the [100] Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin
game of go without human knowledge. Nature, 2017. 3 Bao, Zhuliang Yao, Qi Dai, and Han Hu. Simmim: A simple
[84] Karen Simonyan and Andrew Zisserman. Very deep convo- framework for masked image modeling. In CVPR, 2022. 3,
lutional networks for large-scale image recognition. arXiv 8
preprint arXiv:1409.1556, 2014. 3 [101] Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao-
[85] Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and long Wang, and Shalini De Mello. Open-vocabulary panop-
Christian Igel. The german traffic sign recognition bench- tic segmentation with text-to-image diffusion models. In
mark: a multi-class classification competition. In IJCNN, CVPR, 2023. 3
2011. 8 [102] Yiben Yang, Chaitanya Malaviya, Jared Fernandez, Swabha
[86] Andreas Steiner, Alexander Kolesnikov, Xiaohua Zhai, Ross Swayamdipta, Ronan Le Bras, Ji-Ping Wang, Chandra Bha-
Wightman, Jakob Uszkoreit, and Lucas Beyer. How to train gavatula, Yejin Choi, and Doug Downey. Generative data
your vit? data, augmentation, and regularization in vision augmentation for commonsense reasoning. arXiv preprint
transformers. arXiv preprint arXiv:2106.10270, 2021. 7 arXiv:2004.11546, 2020. 3
[87] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, [103] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk
Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regu-
Hashimoto. Alpaca: A strong, replicable instruction- larization strategy to train strong classifiers with localizable
following model. Stanford Center for Research on Founda- features. In ICCV, 2019. 14
tion Models., 2023. 3 [104] Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and
[88] Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- Lucas Beyer. Scaling vision transformers. In CVPR, 2022.
trastive multiview coding. arXiv:1906.05849, 2019. 3, 14 7
[89] Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, [105] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and
Cordelia Schmid, and Phillip Isola. What makes for good David Lopez-Paz. mixup: Beyond empirical risk minimiza-
views for contrastive learning? In NeurIPS, 2020. 3 tion. arXiv preprint arXiv:1710.09412, 2017. 14
[90] Yonglong Tian, Olivier J Henaff, and Aäron van den Oord. [106] Richard Zhang, Phillip Isola, and Alexei A Efros. Colorful
Divide and contrast: Self-supervised learning from uncu- image colorization. In ECCV, 2016. 2
rated data. In ICCV, 2021. 1 [107] Wenliang Zhao, Yongming Rao, Zuyan Liu, Benlin Liu,
[91] Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, and Jie Zhou, and Jiwen Lu. Unleashing text-to-image dif-
Dilip Krishnan. Stablerep: Synthetic images from text-to- fusion models for visual perception. arXiv preprint
image models make strong visual representation learners. In arXiv:2303.02153, 2023. 3
NeurIPS, 2023. 1, 2, 3, 4, 5, 6, 7, 14 [108] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva,
[92] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, and Antonio Torralba. Learning deep features for discrimi-
Gabriel Synnaeve, and Hervé Jégou. Going deeper with native localization. In CVPR, 2016. 3, 5
image transformers. In ICCV, 2021. 7 [109] Bolei Zhou, Hang Zhao, Xavier Puig, Tete Xiao, Sanja
[93] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Fidler, Adela Barriuso, and Antonio Torralba. Semantic
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, understanding of scenes through the ade20k dataset. IJCV,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2019. 8, 14
Llama 2: Open foundation and fine-tuned chat models. arXiv [110] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
preprint arXiv:2307.09288, 2023. 3, 4 Xie, Alan Yuille, and Tao Kong. ibot: Image bert pre-training
[94] Gul Varol, Javier Romero, Xavier Martin, Naureen Mah- with online tokenizer. arXiv preprint arXiv:2111.07832,
mood, Michael J Black, Ivan Laptev, and Cordelia Schmid. 2021. 2, 3, 5, 8
Learning from synthetic humans. In CVPR, 2017. 3 [111] Yongchao Zhou, Hshmat Sahak, and Jimmy Ba. Training on
[95] Tongzhou Wang and Phillip Isola. Understanding contrastive thin air: Improve image classification with generated data.
representation learning through alignment and uniformity arXiv preprint arXiv:2305.15316, 2023. 3
on the hypersphere. In ICML, 2020. 3
[96] Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan
Yuille, and Christoph Feichtenhofer. Masked feature predic-
tion for self-supervised visual pre-training. In CVPR, 2022.
3
[97] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin.
Unsupervised feature learning via non-parametric instance
discrimination. In CVPR, 2018. 3
[98] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
and Antonio Torralba. Sun database: Large-scale scene
recognition from abbey to zoo. In CVPR, 2010. 5, 6
[99] Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and
Jian Sun. Unified perceptual parsing for scene understanding.
In ECCV, 2018. 8, 14

13
A. Concept Sampling which tries to concatenate cls token with average pooled
patch tokens and sweep over whether to use multiple layers.
The concepts used to synthesize captions are randomly sam-
We follow prior work [11, 14] to train the linear classifier.
pled from the names of various datasets. The rough ratios
It has been generally observed that regularization such as
are presented in Table 11. It is likely that different combina-
weight decay hurts the performance [43, 88]. Therefore,
tions of these ratios lead to different results, but we do not
we set weight decay as 0, and we sweep the base_lr over
optimize over this dimension. For example, we simply con-
{0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50} × 10−2 .
catenate IN-21k concepts with the classes of other datasets
(e.g., Caltech-101, Pets), and do uniform sampling from the config value
concatenated list. This may lead to under-sampling for other batch size 1024
datasets, as the list is dominated by IN-21 classes. optimizer SGD
base learning rate sweep
source prob. peak learning rate blr × bsz/256
IN-1k 0.47 weight decay 0
Aircraft 0.05 optimizer momentum 0.9
Cars 0.05 learning rate schedule cosine decay
epochs 90
Food 0.05
augmentation RandomResizedCrop, Flip
Flowers 0.03
Places-365, SUN397 0.09
IN-21k and others 0.26 Table 13. ImageNet linear probing settings.

Table 11. Rough concept sampling probabilities. B.3. End-to-End ImageNet fine-tuning
Following common practice [5, 44], we append a linear
B. Implementation Details classifier on top of the CLS token of the last transformer
block, and fine-tune the whole network. We use layer-wise
B.1. Pre-training lr decay [20]. Table 14 shows the settings.
The setting for our final long schedule training in Section
4.2 is summarized in Table 12, where models are trained for config value
500k steps with a batch size of 8192 captions. For ablation optimizer AdamW [59]
study present in Section 4.1, we only train for 85k steps with base learning rate 5e-5
peak learning rate blr × bsz/256
a batch size of 2048 captions; for the scaling plots in Section
optimizer momentum β1 , β2 =0.9, 0.999
4.3, we train all models for 300k steps with a batch size of
layer-wise lr decay 0.65 (B), 0.8 (L)
2048. batch size 1024
config value learning rate schedule cosine decay
warmup epochs 20 (B), 5 (L)
batch size 8192
epochs 100 (B), 50 (L)
optimizer AdamW [59] RandAugment [21] 9/0.5
peak learning rate 2e-3 (B), 1.5e-3 (L) label smoothing 0.1 (B), 0.2 (L)
weight decay 0.04 –> 0.2, cosine erasing prob. 0.25
optimizer momentum β1 , β2 =0.9, 0.999 mixup [105] 0.8
learning rate schedule cosine decay cutmix [103] 1.0
steps 500k stoch. depth [47] 0.1 (B), 0.3 (L)
warmup steps 80k test crop ratio 0.95 (B), 1.0 (L)
stoch. depth [47] 0.1 (B), 0.4 (L) ema 0.9999
augmentation Downsample [91] + BYOL Aug. [39]
Table 14. ImageNet end-to-end fine-tuning settings.
Table 12. SynCLR pre-training settings.

B.4. Semantic segmentation on ADE20k

B.2. ImageNet linear probing
We conduct the experiments on ADE20k [109]. Follow-
We use the cls token from the final transformer block as ing [5, 44], we use UperNet [99] as the task adaptation layer.
the image representation. This is different from DINO v2, We use the common single-scale [5] setup, with a resolution

14
of 512×512 for models with a patch size of 16×16 and a res-
olution of 518×518 for models with a patch size of 14×14.
The hyper-parameters are summarized in Table 15.

config value
batch size 32 (B), 16 (L)
optimizer AdamW [59]
peak learning rate 8e-5
optimizer momentum β1 , β2 =0.9, 0.999
weight decay 0.05
layer-wise lr decay 0.6 (B), 0.8 (L)
steps 60k (B), 160k (L)
warmup steps 1500
stoch. depth 0.1 (B), 0.2 (L)

Table 15. ADE20k semantic segmentation settings.

B.5. Fine-grained linear classification

Following prior works [13, 39], we train a regularized multi-
nomial logistic regression model upon the output CLS to-
ken. In training and testing, we do not perform any data
augmentation; images are resized to 224 pixels along the
shorter side, followed by a center crop of 224×224. We
minimize the cross-entropy objective using L-BFGS with
ℓ2 -regularization. We select this ℓ2 -regularization constant
on the validation set over 45 logarithmically spaced values
between 10−6 and 105 . The maximum number of L-BFGS
iterations is set to 1000, similar as that in DINO v2 [68].

C. In-context Learning Examples

All of the three types of in-context examples are summarized
in Table 16, Table 17, and Table 18, respectively.

15
Table 16. Detailed in-context learning examples for Template 1: c –> Caption. Here c is the concept.

1 coucal –> A vibrant coucal is perched on the branch of a lush green tree, surrounded by wildflowers.
2 bee eater –> A lively bee eater is elegantly perched on a branch, peering intently.
3 three-toed sloth –> A three-toed sloth is lazily hanging from a sturdy, tropical rainforest tree.
4 hay –> In the serene countryside, hundreds of neatly stacked hay bales lay scattered under the
softly glowing golden sunset sky.
5 station wagon –> A shiny, red station wagon is parked under the dappled shade of a large oak tree,
highlighting its spacious and family-friendly design.
6 zebra –> A zebra is gallantly trotting across the vast, sunlit plains of the African savannah, creating
a captivating black and white spectacle.
7 vase –> In the well-lit living room, a beautifully designed, delicate vase stands out as the center-
piece, exuding an aura of elegance.
8 barber chair –> A shiny black barber chair sits invitingly in a bustling, well-lit barbershop.
9 carbonara –> A heaping plate of creamy carbonara pasta topped with fresh parsley sprigs.
10 mink –> In the midst of a dense forest with shimmering green leaves, a sleek mink gracefully
navigates the underbrush, showcasing its rich, brown fur.
11 small white butterfly –> A small white butterfly gracefully flutters amongst vibrant, blooming summer flowers.
12 christmas stocking –> A vibrant red Christmas stocking is hanging delicately from a festively decorated man-
telpiece.
13 horse-drawn vehicle –> An antique horse-drawn vehicle is stationed amidst a peaceful country landscape, its
rustic wooden structure gleaming under the warm afternoon sun.
14 ruler measuring stick –> A manual craftsman is precisely measuring a wooden log with a ruler stick.
15 picket fence –> A tranquil suburban scene featuring multiple white picket fences surrounding well-
maintained green lawns, punctuated by diverse, colorful flowerbeds.
16 suspension bridge –> Depicting a long suspension bridge, its steel cables elegantly stretching towards the sky,
connecting two ends over a scenic river.
17 brain coral –> A vibrant brain coral stands out amidst the serene backdrop of underwater marine life.
18 revolver –> Multiple antique revolvers lie on a wooden table, gleaming under soft, ambient light.
19 slip-on shoe –> A pair of slip-on shoes, with their sleek, black leather exterior and comfortable, cushioned
interior, are neatly placed on a wooden floor.
20 hand-held computer –> A hand-held computer, compact and portable, rests on a well-lit desk, surrounded by
various technological paraphernalia and a steaming cup of coffee.
21 mattress –> A teddy bear lying face down on a bedspread covered mattress in front of a window.
22 refrigerator –> A nicely decorated kitchen with metallic refrigerator and blue counter.
23 ball –> Silver balls are lined up in the sand as people mill about in the background.
24 wheel –> The motorcycle’s gleaming steering wheel, vivid red door reflected in the side mirror,
and a youth passing by, creating a dynamic urban tableau.
25 plane –> A group of trick planes turned upside down leaving smoke trails.
26 vehicle –> Army vehicles, including a U.S. Army jeep and aircraft in a hangar or on display
27 boy –> a little boy wearing sunglasses laying on a shelf in a basement.
28 fence –> a man standing near a fence as reflected in a side-view mirror of a red car.
29 wood table –> A footed glass with water in front of a glass with ice tea, and green serpentine bottle
with pink flowers, all on a wood table in front of chair, with a window to city view.
30 toilet –> A black and white toilet sitting in a bathroom next to a plant filled with waste.
31 table lamp –> A textured brass table lamp, casting a warm, golden glow, accents a cozy reading nook
beside a leather armchair and a stack of books.
32 hair dryer –> A modern sleek and white hair dryer, with a textured grip, stands next to a set of
hairbrushes.
33 street sign –> The street signs indicate which way a car can and cannot turn while the signal light
controls traffic.
34 instrument –> Man dressed in Native American clothes protecting musical instruments from the rain
with an umbrella.

16
35 train –> A man and a cow’s faces are near each other as a train passes by on a bridge.
36 giraffe –> A couple of large giraffe standing next to each other.
37 red admiral butterfly –> a red admiral butterfly, alights upon a dew-kissed sunflower, wings glistening under the
soft morning light.
38 stupa –> Surrounded by verdant foliage, a white stupa rises, adorned with golden accents and
intricate patterns, while devotees circle its base offering prayers.
39 elephant –> A group of elephants being led into the water.
40 bottle –> Motorcycles parked on a street with a bottle sitting on the seat of the nearest the camera.
41 trombone –> On a polished wooden stage, a gleaming brass trombone rests, its slide extended, next to
scattered sheet music and a muted trumpet.
42 keyboard –> Sleek black keyboard with illuminated backlit keys, a soft wrist rest, and a nearby
wireless mouse on a textured matte desk surface.
43 bear –> The brown bear sits watching another bear climb the rocks
44 snowboard –> A man standing next to his snowboard posing for the camera.
45 railway –> a woman and her son walking along the tracks of a disused railway.
46 sand –> the waves and the sand on the beach close up
47 pixel –> very colorful series of squares or pixels in all the colors of the spectrum , from light to
dark
48 cigar –> a burning cigar in a glass ashtray with a blurred background.
49 music –> happy girl listening music on headphones and using tablet in the outdoor cafe.
50 earring –> this gorgeous pair of earrings were featured in april issue.
51 cliff –> Steep cliff, jagged edges against azure sky, with seabirds soaring and waves crashing
below.
52 corn cob –> Fresh corn cob, golden kernels glistening with dew, nestled amid green husks in a sunlit
field.
53 archaeological exca- –> In this intriguing scene, archaeologists meticulously uncover ancient relics at an archaeo-
vation logical excavation site filled with historical secrets and enigmas.
54 formal garden –> This is an immaculately kept formal garden, with perfectly trimmed hedges, colorful,
well-arranged flower beds, and classic statuary, giving a vibe of tranquil sophistication.
55 veterinarians office –> The busy veterinarian’s office is a hive of activity with pets awaiting treatment and care.
56 elevator –> A modern, well-lit elevator interior with shiny metal walls and sleek buttons.
57 heliport –> Situated in a lively area, the heliport stands out with numerous helicopters taking off and
landing against the city’s skyline.
58 airport terminal –> In the spacious airport terminal, travelers hurriedly navigate through check-ins and
security, making it a hive of constant activity.
59 car interior –> Inside the car, the leather seats exude luxury, contrasted by the high-tech dashboard,
creating an atmosphere of sleek comfort and convenience.
60 train interior –> The inside of the train offers a spacious setting with numerous comfortable seats.
61 candy store –> The sweet aroma of sugared treats fills the air in a vibrant candy store, adorned with
colourful candies and cheerful customers.
62 bus station –> The bustling bus station thrums with restless energy, as travelers navigate through the
crowded space, awaiting their journeys amid the echoes of departing buses.
63 castle –> Nestled amidst towering mountains, the majestic castle spews ancient grandeur, with its
stone walls and towering turrets exuding tranquility and timeless mystique.
64 palace –> The grand palace exudes regality, radiant under the sun, showcasing ornate decorations,
intricate sculptures, and exquisite architectural sophistication.
65 kitchen –> The heart of the home unfolds in the kitchen, characterized by stainless steel appliances,
navy blue cabinets, and a patterned tile backsplash.
66 raceway –> The high-speed adrenaline-filled atmosphere of the raceway is pulsing with the roars of
powerful engines and excited cheering fans.
67 bakery –> The warm, inviting bakery is filled with the intoxicating aroma of fresh bread, assorted
pastries, and brewing coffee.

17
68 medina –> This ancient, labyrinth-like medina exudes an air of mystique with its vibrantly decorated
shops lining narrow, stone-cobbled pathways.
69 skyscraper –> The city skyline is dominated by towering skyscrapers, creating a captivating blend of
technology and architectural innovation.
70 supermarket –> The supermarket scene is lively, filled with individuals scanning shelves, children reach-
ing for treats, and clerks restocking fresh produce.
71 closet –> The compact closet, brimming with clothes and shoes, exudes a feeling of organization.
72 assembly line –> In the heart of a busy factory, an orderly assembly line hums with continuous activity,
filled with workers focused on their precision tasks.
73 palace room –> A man in military dress uniform stands in an ornate palace room with antique furniture
and Christmas decorations.
74 barn doorway –> A farmer holding an animal back while another farmer stands in a barn doorway.
75 food court –> A bustling food court with a variety of culinary stalls, featuring vibrant signage, aromatic
dishes, and communal seating, creates a diverse dining experience.
76 mountain –> Majestic mountains, their peaks dusted with snow, overlook a serene alpine lake where
hikers and photographers gather to enjoy the breathtaking scenery.
77 squash court –> Against a clear glass wall, a squash court with gleaming wooden floors, white boundary
lines, and two rackets awaits players.
78 subway station –> Dimly lit subway station with graffiti-covered walls, commuters waiting
79 restaurant –> Cozy restaurant with wooden tables, ambient lighting, patrons chatting, and plates filled
with colorful dishes, framed by exposed brick walls and hanging green plants.
80 field –> there is a large heard of cows and a man standing on a field.
81 aquarium –> Amidst vivid coral formations, an aquarium teems with colorful fish, shimmering under
soft blue lights.
82 market –> A large group of bananas on a table outside in the market.
83 park –> a young boy is skating on ramps at a park
84 beach –> old fishing boats beached on a coastal beach in countryside.
85 grass –> little boy sitting on the grass with drone and remote controller.
86 woven –> The woven basket’s intricate pattern creates a visually captivating and tactile surface.
87 knitted –> The knitted blanket envelops with cozy warmth
88 flecked –> The stone surface was flecked, giving it a uniquely speckled and rough appearance.
89 bubbly –> The liquid gleamed, showcasing its bubbly, effervescent texture vividly.
90 cobwebbed –> The dusty corner was cobwebbed, displaying years of untouched, eerie beauty.
91 stained –> A weather-worn wall manifests an intriguing pattern of stained texture.
92 scaly –> The image showcases a close-up of a lizard’s scaly, rough texture.
93 meshed –> A patterned image depicting the intricate, tightly-knit texture of meshed fabric.
94 waffled –> A fresh, golden-brown waffle displays its distinct crisply waffled texture invitingly.
95 pitted –> The image portrays an intriguing terrain, characterized by a pitted, moon-like surface.
96 studded –> A studded leather jacket gleams, highlighting its rough, tactile texture.
97 crystalline –> The picture showcases an exquisite, crystalline texture with stunning brilliance and
clarity.
98 gauzy –> A delicate veil of gauzy texture enhances the ethereal, dreamy atmosphere.
99 zigzagged –> The photo captures the zigzagged texture, emphasizing the rhythmic, sharp-edged pat-
terns.
100 pleated –> A flowing skirt delicately showcasing the intricate detail of pleated texture.
101 veined –> A detailed image showcasing the intricate, veined texture of a leaf.
102 spiralled –> The spiralled texture of the seashell creates a captivating, tactile pattern.
103 lacelike –> The delicate veil features an intricate, lacelike texture, exuding elegant sophistication.
104 smeared –> A wall coated with thick, smeared paint exudes a rough texture.
105 crosshatched –> A worn, vintage book cover, richly crosshatched, exuding old-world charm.
106 particle –> abstract background of a heart made up of particles.

18
Table 17. Detailed in-context learning examples for Template 2: c,bg –> caption. Here c is the concept, and bg is the background.

107 stick insect, under- –> A stick insect, masterfully camouflaged, clings to a fern amidst the sprawling, dense
growth undergrowth of a lush, tropical forest.
108 black swan, public –> In the peaceful ambiance of a lush public garden, a majestic black swan gracefully glides
garden across a shimmering emerald-green pond.
109 st. bernard, family- –> In the heartwarming family photo, a gregarious St. Bernard dog is seen joyfully nestled
photo among his adoring human companions.
110 measuring cup, food –> In the food prep area, multiple transparent measuring cups are neatly organized on the
prep area marble countertop.
111 can opener, hotel –> A sleek, stainless steel can opener is sitting on the glossy dark-wood kitchenette counter
room of a modern, well-appointed hotel room.
112 small white butterfly, –> A delicate, small white butterfly flutters gracefully above the tranquil pond side, creating
pond side a serene image amidst lush greenery.
113 hair dryer, theatre –> A sleek, professional hair dryer is positioned center stage amidst the dramatic velvet
curtains and ornate details of a bustling theatre.
114 water bottle, airport –> A reusable water bottle sits on the glossy surface of a bustling airport terminal counter,
amidst a backdrop of hurried travelers and departure screens.
115 leonberger, horse –> Several Leonbergers are joyfully romping around a bustling horse ranch.
ranch
116 lighter, motorhome –> In the cozy, cluttered environment of a well-traveled motorhome, a sleek silver lighter
holds dominion on the rustic wooden table.
117 slug, foliage –> A solitary, glistening slug meanders slowly amidst lush, dense green foliage, leaving a
slimy trail on dewy leaves in its path.
118 ring binder, educa- –> The ring binder, filled with important documents, sits prominently on a well-organized
tion department desk in the bustling education department.
119 weimaraner, pet store –> A sleek, silver-gray Weimaraner is spotted curiously sniffing around various pet supplies
in a well-stocked and vibrant pet store.
120 norfolk terrier, coun- –> A lively Norfolk terrier joyfully bounds across a lush, green countryside, its red fur
tryside contrasting vividly with the vast open surroundings.
121 dalmatian, apple or- –> A lively Dalmatian is playfully darting amongst the lush rows of a bountiful apple
chard orchard, its spots contrasting against the ruby fruits.
122 television, mountain –> A sleek, modern television sits prominently against the rustic, wooden walls of an
lodge inviting mountain lodge, surrounded by pine-furnished decor.
123 guillotine, horror –> In the shadowy landscape of a suspenseful horror story, a grim, menacing guillotine
story looms ominously, exuding a petrifying sense of imminent dread.
124 hot tub, condo- –> A luxurious hot tub is nestled in the private balcony of a high-rise condominium, boasting
minium spectacular cityscape views.
125 leaf beetle, plant nurs- –> A vibrant leaf beetle is diligently navigating through a lush plant nursery, its metallic
eries sheen contrasting against the abundant green foliage.
126 carolina anole, hiking –> A small Carolina Anole lizard basks in the warm sunlight, gracefully draped over a
trails gnarled tree root next to a bustling hiking trail.
127 girl, laboratory –> teenage girl and boy working in a laboratory on an experiment.
128 tiger, forest –> Two tigers are running together in the forest.
129 sunset, lake –> Golden sunset hues reflect on a calm lake, silhouetting a lone canoeist against a backdrop
of fiery clouds.
130 building, mountain –> town of skyline over roofs of historic buildings with the mountains in the background.
131 block plane, weath- –> A block plane, its sharp blade gleaming, rests on weathered wood
ered wood
132 olive tree, soil –> single olive tree planted in the center of a dry and cracked soil
133 hamster, pet store –> A curious hamster peers out, with pet store shelves stacked with supplies behind.
134 bag, factory –> plastic bags production line in a factory.

19
135 restaurant, ocean –> young pretty couple dining in a romantic atmosphere at restaurant on the boat with ocean
on the background
136 helicopter, burning –> a helicopter flies over a portion of burning forest.
forest
137 pipe organ, commem- –> striking pipe organ dominates with its notes resonating, while a somber commemoration
oration event event unfolds in the backdrop
138 rotisserie, wedding –> Rotisserie turning golden meats, with a bustling wedding reception, twinkling lights, and
reception guests mingling.
139 duck, taiga –> A group of ducks paddle on a tranquil pond, dense taiga and towering conifers looming
in the background.
140 tiger beetle, rice –> Amidst verdant rice fields, a shimmering tiger beetle perches prominently on a dew-
fields kissed blade of grass.
141 girl, barn –> slow motion clip of a girl walking with her horse through a barn
142 headmaster, gradua- –> the headmaster addresses the graduating seniors during graduation ceremonies.
tion ceremony
143 businessperson, mu- –> businessperson and guest attend music festival.
sic festival
144 fountain, park –> Water cascades from an ornate fountain, surrounded by autumn-hued trees in a serene
park.
145 speedboat, water –> A sleek speedboat glides on shimmering waters, powered by twin high-horsepower
outboard motors.
146 pipe, beach –> a rusty water pipe on the beach.
147 pretzel, home kitchen –> Golden pretzel rests on a wooden board, with a cozy home kitchen, pots and tiled
backsplash, behind.
148 forklift, paper mill –> A forklift transports hefty paper rolls amidst the industrial bustling paper mill.
149 lotion, therapy center –> Blue lotion bottles lined up at a thalasso therapy center by the ocean.
150 guinea pig, sand –> Guinea pig exploring vast golden sand dunes, with tiny footprints trailing behind.
dunes
151 groom, wedding cere- –> father of groom congratulating him after the wedding ceremony.
mony
152 fishing boat, village –> fishing boats moored at fishing village a suburb of capital of the state,
153 red fox, yard –> wild red fox sitting on a partially snow covered front yard of a house in the suburbs of a
small city
154 grey wolf, woodland –> A grey wolf prowls silently, eyes alert, through dense, misty woodland areas with
areas moss-covered trees.
155 cheetah, edges of –> A cheetah crouches, poised and watchful, at the lush edges of murky swamplands.
swamplands
156 wine bottle, living –> in the living room, a person si opening a wine bottle with corkscrew with wooden barrel
room

Table 18. Detailed in-context learning examples for Template 3: c,rel –> caption. Here c is the concept, and rel is the relation.

157 product packet / pack- –> A vibrant product packet, adorned with colorful labels and intricate designs, is neatly
aging, next to placed next to an elegant crystal glass.
158 croquet ball, behind –> A vivid, red croquet ball rests serenely, hiding behind a worn, rustic wooden fence in a
sun-kissed, lush green lawn.
159 bassoon, in front of –> A beautifully crafted bassoon stands elegantly in front of a backdrop of velvet curtains,
ready to perform at a concert.
160 grand piano, above –> A gorgeous, antique chandelier is suspended above the glossy black grand piano, illumi-
nating it with warm, opulent light.
161 bolo tie, behind –> A beautifully crafted bolo tie is casually hung, indicating its previous use, behind a rustic,
well-polished wooden shelf.

20
162 waffle iron, next to –> A large, black waffle iron is placed next to a sparkling glass jar filled with golden maple
syrup on a wooden countertop.
163 komodo dragon, be- –> A young child grins excitedly, peering down from a secure bridge, as a colossal Komodo
low dragon sprawls lazily below in the wildlife park.
164 vaulted or arched ceil- –> Besides the grand marble statue, glimpses of an intricate vaulted or arched ceiling add to
ing, besides the room’s majestic charm.
165 gossamer-winged –> A lovely, vibrant gossamer-winged butterfly is gently perched next to a dew-kissed red
butterfly, next to rose in an early morning garden.
166 kit fox, in front of –> A group of small, fluffy, golden kit foxes is playfully gathered in front of a lush, green,
towering forest backdrop.
167 koala, in –> A cute, fuzzy koala is visibly relaxed, nestled contentedly in the crook of a towering,
lush green eucalyptus tree.
168 centipede, above –> A vibrant green centipede is effortlessly crawling on a tree branch, positioned distinctly
above a patch of untouched fern leaves.
169 mountain bike, above –> A mountain bike is displayed prominently above the rustic mantlepiece, showcasing its
sleek design and intricate details.
170 wallaby, above –> A fluffy, brown wallaby is leaping high, appearing as if it is effortlessly floating above a
lush, green Australian field.
171 giant panda, on –> A playful giant panda is perched on a sturdy tree branch, munching on fresh green
bamboo amidst the tranquil forest ambiance.
172 beagle, on –> A pack of adorable beagles are spotted lounging on an expansive, sunbathed meadow
with colorful wildflowers sprouting around them.
173 beach, on –> A vivid sunset is on display over a sprawling beach, casting warm hues on the waves
gently lapping at the sandy shore.
174 grey whale, on –> A voluminous grey whale is majestically breaching, its massive body on display against
the azure backdrop of the expansive ocean.
175 tractor, in front of –> A bright red tractor is parked in front of a rustic, weathered barn, casting long shadows
under the golden afternoon sun.
176 cabbage, besides –> A vibrant image portrays a lush, green cabbage, glistening with dewdrops, nestled
besides a rustic, wooden crate full of freshly harvested vegetables.

Digital Printable ADHD Manual - The ADHD Brain Explained
97% (35)
Digital Printable ADHD Manual - The ADHD Brain Explained
81 pages
Robert Gagne's Conditions of Learning
No ratings yet
Robert Gagne's Conditions of Learning
21 pages
Cranial Nerve Examination OSCE Guide
No ratings yet
Cranial Nerve Examination OSCE Guide
26 pages
Adult Learning Theory and Principles University of Queensland
100% (1)
Adult Learning Theory and Principles University of Queensland
11 pages
Logical Question Paper
No ratings yet
Logical Question Paper
2 pages
Final ML
No ratings yet
Final ML
49 pages
CS3491 - Aiml - Unit V Neural Networks
No ratings yet
CS3491 - Aiml - Unit V Neural Networks
42 pages
Edexcel iGCSE ICT Software Mindmap
No ratings yet
Edexcel iGCSE ICT Software Mindmap
1 page
L03 Perceptron Slides
No ratings yet
L03 Perceptron Slides
66 pages
Orion the Hunter: Giant
From Everand
Orion the Hunter: Giant
Scott Davis
No ratings yet
Orbit: The Cast of Doctor Who #1
From Everand
Orbit: The Cast of Doctor Who #1
Paul J. Salamoff
5/5 (1)
Extreme Rhyming Poetry: Over 400 Inspirational Poems of Wit, Wisdom, and Humor (Five Books in One)
From Everand
Extreme Rhyming Poetry: Over 400 Inspirational Poems of Wit, Wisdom, and Humor (Five Books in One)
Darrell L. Price
No ratings yet
Sweet Pea Saves the Rainbows
From Everand
Sweet Pea Saves the Rainbows
Kim Berreckman
No ratings yet
Wakan Cekiye Odowan (The Dakota Hymnal)
From Everand
Wakan Cekiye Odowan (The Dakota Hymnal)
The Episcopal Diocese of South Dakota
No ratings yet
The Christmas Fish
From Everand
The Christmas Fish
Melba Harris
No ratings yet
Orbit: Mark Zuckerberg, Creator of Facebook
From Everand
Orbit: Mark Zuckerberg, Creator of Facebook
Jerome Maida
No ratings yet
Mighty Morphin Power Rangers Archive Vol. 1
From Everand
Mighty Morphin Power Rangers Archive Vol. 1
Kyle Higgins
No ratings yet
Juliet and the Two Talking Tennis Balls Who Made Her a World Champion!
From Everand
Juliet and the Two Talking Tennis Balls Who Made Her a World Champion!
Don DeNevi
No ratings yet
Blessed Days, Volume 6: Blessed Days, #6
From Everand
Blessed Days, Volume 6: Blessed Days, #6
Inky Moondrop
No ratings yet
Political Power: Ted Kennedy
From Everand
Political Power: Ted Kennedy
Brent Sprecher
No ratings yet
Three-Fold Cord: Creation Redemption Dominion
From Everand
Three-Fold Cord: Creation Redemption Dominion
Michael P Hays
No ratings yet
Blackbeard Legacy #2 Volume 2
From Everand
Blackbeard Legacy #2 Volume 2
Eric Arvin
No ratings yet
The Way of Courage
From Everand
The Way of Courage
Janet Hallagin
No ratings yet
Biblical Lessons from Grandpa: Preparing the Next Generations
From Everand
Biblical Lessons from Grandpa: Preparing the Next Generations
Michael F. Schmidt
No ratings yet
Tribute: Jerry Garcia
From Everand
Tribute: Jerry Garcia
Michael L. Frizell
No ratings yet
Eastern National Parks and Seashores: Test Your Knowledge
From Everand
Eastern National Parks and Seashores: Test Your Knowledge
Donald W. Linzey
No ratings yet
Odyssey Presents: Anthology #2
From Everand
Odyssey Presents: Anthology #2
Chad Rebmann
No ratings yet
Blackbeard Legacy #2 Volume 1
From Everand
Blackbeard Legacy #2 Volume 1
Darren G. Davis
No ratings yet
Logan's Run: Aftermath #1
From Everand
Logan's Run: Aftermath #1
William F. Nolan
No ratings yet
Legend of Isis Gallery #2
From Everand
Legend of Isis Gallery #2
Derek Ruiz
No ratings yet
Flying Saucers Vs. the Earth #4
From Everand
Flying Saucers Vs. the Earth #4
Ryan Burton
No ratings yet
Flying Saucers Vs. the Earth #1
From Everand
Flying Saucers Vs. the Earth #1
Ryan Burton
No ratings yet
The Adventures of Eli and Jake
From Everand
The Adventures of Eli and Jake
Linda Hoffman
No ratings yet
The Full Christmas Story
From Everand
The Full Christmas Story
Danny Haag
No ratings yet
Violet Rose #0
From Everand
Violet Rose #0
Emma Davis
No ratings yet
The Art of Southwest Landscaping
From Everand
The Art of Southwest Landscaping
Dawn Layna Fried
No ratings yet
Odyssey Presents: Gallery
From Everand
Odyssey Presents: Gallery
Chad Rebmann
No ratings yet
Vincent Price Presents: Gallery #4
From Everand
Vincent Price Presents: Gallery #4
Joel Robinson
No ratings yet
What Squirt Teaches Me about Jesus: Kids Learning about Jesus while Playing with Fido
From Everand
What Squirt Teaches Me about Jesus: Kids Learning about Jesus while Playing with Fido
Verneda S. Harris
No ratings yet
Vincent Price Presents: Phibes
From Everand
Vincent Price Presents: Phibes
Mel Smith
No ratings yet
Legend of Isis #10: Volume 2
From Everand
Legend of Isis #10: Volume 2
Aaron Stueve
No ratings yet
Your Guide To: Fearless Entrepreneurship
From Everand
Your Guide To: Fearless Entrepreneurship
Nina Nova
No ratings yet
Primordia
From Everand
Primordia
John Fultz
No ratings yet
Tales from William F. Nolan's Dark Universe
From Everand
Tales from William F. Nolan's Dark Universe
William F. Nolan
No ratings yet
10th use: Giant
From Everand
10th use: Giant
Darren G. Davis
No ratings yet
Vincent Price Presents #04
From Everand
Vincent Price Presents #04
Chad Helder
No ratings yet
15 Minutes: Kim Kardashian
From Everand
15 Minutes: Kim Kardashian
Marc Shapiro
No ratings yet
Odyssey Presents: Anthology #1
From Everand
Odyssey Presents: Anthology #1
Chad Rebmann
No ratings yet
Mindfulness Full: Relaxing word search puzzles for adults that will keep your mind calm and positive, 50 brain teasers with more than 600 words
From Everand
Mindfulness Full: Relaxing word search puzzles for adults that will keep your mind calm and positive, 50 brain teasers with more than 600 words
Asomoo Ebooks
No ratings yet
Birds: Our Fine Feathered Friends: Seen by Sue and Drew
From Everand
Birds: Our Fine Feathered Friends: Seen by Sue and Drew
Gene Crumbley
No ratings yet
Legend of Isis: Image Introduces
From Everand
Legend of Isis: Image Introduces
Darren G. Davis
No ratings yet
The Adventures of Lizzy and Chuck
From Everand
The Adventures of Lizzy and Chuck
Maria Stanley
No ratings yet
From the Heart
From Everand
From the Heart
J. Bauman
No ratings yet
Morals for Minions
From Everand
Morals for Minions
Dr. Debra Wilson
No ratings yet
Flying Saucers Vs. the Earth #2
From Everand
Flying Saucers Vs. the Earth #2
Ryan Burton
No ratings yet
Space Women Beyond the Stratosphere #3
From Everand
Space Women Beyond the Stratosphere #3
Scott Amundson
No ratings yet
Crossed Wires: Team-Up
From Everand
Crossed Wires: Team-Up
Chad Rebmann
No ratings yet
Monster’s Among Us: A War of Witches
From Everand
Monster’s Among Us: A War of Witches
Andrew Shayde
No ratings yet
Our Dream House
From Everand
Our Dream House
Janet Lombard Clements
No ratings yet
Female Force: Gabrielle Giffords
From Everand
Female Force: Gabrielle Giffords
CW Cooke
No ratings yet
Adventure Awaits: Around the World
From Everand
Adventure Awaits: Around the World
Samuel West
No ratings yet
Run to Win
From Everand
Run to Win
Eric D. Johnson
No ratings yet
Breastfeeding and Parenting: Your baby will teach you how
From Everand
Breastfeeding and Parenting: Your baby will teach you how
Sue Cox
No ratings yet
Blackbeard Legacy Gallery
From Everand
Blackbeard Legacy Gallery
Darren G. Davis
No ratings yet
Everything Is Predetermined
No ratings yet
Everything Is Predetermined
24 pages
Cognitive Behavioral Therapy
No ratings yet
Cognitive Behavioral Therapy
7 pages
Define Constructivist Theory
No ratings yet
Define Constructivist Theory
3 pages
Neural Control and Coordination
No ratings yet
Neural Control and Coordination
3 pages
Year 12 LP Dement and Kleitman (1957)
No ratings yet
Year 12 LP Dement and Kleitman (1957)
23 pages
7 Euthenics Part 8 EmotionsMentalHealth MariaYvette
No ratings yet
7 Euthenics Part 8 EmotionsMentalHealth MariaYvette
12 pages
Emotion: Nature and Management
No ratings yet
Emotion: Nature and Management
33 pages
Research Question:What Are The Different Historical Perspectives of The Case of John Wayne Gacy
No ratings yet
Research Question:What Are The Different Historical Perspectives of The Case of John Wayne Gacy
5 pages
MILIEU THERAPY
100% (5)
MILIEU THERAPY
35 pages
Cultivating Compassion ADHD Project: A Mentalization Informed Psychodynamic Psychotherapy Approach
No ratings yet
Cultivating Compassion ADHD Project: A Mentalization Informed Psychodynamic Psychotherapy Approach
12 pages
Case History Performa Modified
No ratings yet
Case History Performa Modified
8 pages
Effect of Gate Pass of Bongabon Essential School Students English Comprehension
No ratings yet
Effect of Gate Pass of Bongabon Essential School Students English Comprehension
9 pages
GR 3 Bridging Through Ten Activities
No ratings yet
GR 3 Bridging Through Ten Activities
12 pages
The Dynamic Self in Psychoanalysis Neuroscientific Foundations and Clinical Cases - 1st Edition Scribd Download
100% (16)
The Dynamic Self in Psychoanalysis Neuroscientific Foundations and Clinical Cases - 1st Edition Scribd Download
15 pages
Create, Label, and Illustrate The Brain
No ratings yet
Create, Label, and Illustrate The Brain
6 pages
Responsive Settling
No ratings yet
Responsive Settling
4 pages
Introduction To Psychology (Steven Pinker)
No ratings yet
Introduction To Psychology (Steven Pinker)
13 pages
Factors Affecting Classical Conditioning
No ratings yet
Factors Affecting Classical Conditioning
5 pages
3 Everyday Memory and Memory Errors
No ratings yet
3 Everyday Memory and Memory Errors
53 pages
Assessment of Core competencies-Age-Based
No ratings yet
Assessment of Core competencies-Age-Based
66 pages
Lesson 11 - Self-Efficacy
100% (2)
Lesson 11 - Self-Efficacy
44 pages
EI and Servant Leadership
No ratings yet
EI and Servant Leadership
21 pages
Kelleys Theory of Attribution
No ratings yet
Kelleys Theory of Attribution
14 pages
Prem Vocab Drills
No ratings yet
Prem Vocab Drills
194 pages
Freud Without Oedipus - The Cognitive Unconscious Alfred I. Tauber
No ratings yet
Freud Without Oedipus - The Cognitive Unconscious Alfred I. Tauber
12 pages
Resveratrol Therapy For Epilepsy
No ratings yet
Resveratrol Therapy For Epilepsy
37 pages

Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR

Uploaded by

Learning Vision From Models Rivals Learning Vision From Data (Google, MIT 2023 ) SynCLR

Uploaded by

Learning Vision from Models Rivals Learning Vision from Data

e.g. CLIP 80.2%

as positive pairs. The resulting representations transfer well IN lin. acc

to many downstream tasks, competing favorably with other

significant margin, e.g., improving over MAE and iBOT by 76.

3.2. Synthesizing Images

For each text caption, we generate a variety of images by

to the baseline StableRep, adding a teacher EMA model 4.2. Scaling up

method pre-train data ViT-B ViT-L SynCLR CLIP

(a) (b) (c)

82 ViT-B/16 out of the 4 synthesized images in each iteration. Following

80 size of 32768. This model achieves 44.4% zero-shot accu-

extracted using our model SynCLR. As depicted in Figure 5,

B.4. Semantic segmentation on ADE20k

Table 15. ADE20k semantic segmentation settings.

B.5. Fine-grained linear classification

C. In-context Learning Examples

You might also like