0% found this document useful (0 votes)
7 views

Hello 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Hello 3

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

Figure 1.2: Larger models make increasingly efficient use of in-context information.

We show in-context learning


performance on a simple task requiring the model to remove random symbols from a word, both with and without a
natural language task description (see Sec. 3.9.2). The steeper “in-context learning curves” for large models demonstrate
improved ability to learn a task from contextual information. We see qualitatively similar behavior across a wide range
of tasks.

sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing
to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans
to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy
dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.
One potential route towards addressing these issues is meta-learning1 – which in the context of language models means
the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities
at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+ 19]
attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form
of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task
and is then expected to complete further instances of the task simply by predicting what comes next.
While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example
[RWC+ 19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind
the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of
solving language tasks.
Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer
language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters
[DCLT18], to 1.5 billion parameters [RWC+ 19], to 8 billion parameters [SPP+ 19], 11 billion parameters [RSR+ 19],
and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream
NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a
smooth trend of improvement with scale [KMH+ 20]. Since in-context learning involves absorbing many skills and
tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong
gains with scale.
1
In the context of language models this has sometimes been called “zero-shot transfer”, but this term is potentially ambiguous:
the method is “zero-shot” in the sense that no gradient updates are performed, but it often involves providing inference-time
demonstrations to the model, so is not truly learning from zero examples. To avoid this confusion, we use the term “meta-learning”
to capture the inner-loop / outer-loop structure of the general method, and the term “in context-learning” to refer to the inner
loop of meta-learning. We further specialize the description to “zero-shot”, “one-shot”, or “few-shot” depending on how many
demonstrations are provided at inference time. These terms are intended to remain agnostic on the question of whether the model
learns new tasks from scratch at inference time or simply recognizes patterns seen during training – this is an important issue which
we discuss later in the paper, but “meta-learning” is intended to encompass both possibilities, and simply describes the inner-outer
loop structure.

You might also like