0% found this document useful (0 votes)
14 views30 pages

Multilingual ICL

Uploaded by

hengyuan zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views30 pages

Multilingual ICL

Uploaded by

hengyuan zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

The Impact of Demonstrations on Multilingual In-Context Learning:

A Multidimensional Analysis
Miaoran Zhang1 Vagrant Gautam1 Mingyang Wang2,3,4 Jesujoba O. Alabi1

Xiaoyu Shen5 Dietrich Klakow1 Marius Mosbach6
1
Saarland University, Saarland Informatic Campus
2
Bosch Center for AI 3 LMU Munich 4 Munich Center for Machine Learning (MCML)
5
Eastern Institute of Technology, Ningbo 6 Mila, McGill University
{mzhang,vgautam,jalabi,dietrich.klakow}@lsv.uni-saarland.de
[email protected] [email protected] [email protected]
Abstract in better understanding the factors that influence
In-context learning is a popular inference strat-
its success, such as demonstration selection (Liu
egy where large language models solve a task et al., 2022; Rubin et al., 2022; Wang et al., 2023c),
using only a few labeled demonstrations with- prompt design (Min et al., 2022a; Wei et al., 2022),
arXiv:2402.12976v2 [cs.CL] 7 Jun 2024

out needing any parameter updates. Although and more generally on understanding how and why
there have been extensive studies on English in- in-context learning works (Xie et al., 2022; Bansal
context learning, multilingual in-context learn- et al., 2023; Hendel et al., 2023; Pan et al., 2023;
ing remains under-explored, and we lack an Wang et al., 2023b).
in-depth understanding of the role of demon-
However, most recent work on in-context learn-
strations in this context. To address this gap,
we conduct a multidimensional analysis of ing predominantly focuses on English, and the ex-
multilingual in-context learning, experimenting ploration of multilingual in-context learning gen-
with 5 models from different model families, erally lags behind. This is problematic, as results
9 datasets covering classification and gener- that apply to English might not hold for other lan-
ation tasks, and 56 typologically diverse lan- guages, especially those that are less represented
guages. Our results reveal that the effectiveness in LLM training data. While there have been
of demonstrations varies significantly across
a few studies on in-context learning that go be-
models, tasks, and languages. We also find that
strong instruction-following models including
yond English, they either focus on benchmarking
Llama 2-Chat, GPT-3.5, and GPT-4 are largely LLMs on multilingual tasks without in-depth ex-
insensitive to the quality of demonstrations. In- ploration, e.g., MEGA (Ahuja et al., 2023) and
stead, a carefully crafted template often elimi- BUFFET (Asai et al., 2023), or zoom in on specific
nates the benefits of demonstrations for some capabilities such as mathematical reasoning (Shi
tasks and languages altogether. These findings et al., 2023b), machine translation (Zhu et al., 2023;
show that the importance of demonstrations Agrawal et al., 2023), or code-switching (Zhang
might be overestimated. Our work highlights
et al., 2023).
the need for granular evaluation across mul-
tiple axes towards a better understanding of In this work, we take a multidimensional ap-
in-context learning.1 proach (Ruder et al., 2022) that unifies these strands
of research and comprehensively evaluate the mul-
1 Introduction tilingual in-context learning abilities of LLMs. We
An intriguing property of large language models focus on dissecting the actual impact of in-context
(LLMs) is their ability to perform in-context learn- demonstrations, which is crucial for understanding
ing (Brown et al., 2020), i.e., solve a task condi- model behaviour. Our research covers various mod-
tioned on a few demonstrations at inference time, els, tasks, and languages, and we seek to answer
without updating the model parameters. It has the following research questions:
been shown to be an efficient alternative to fine- 1. Does multilingual performance benefit from
tuning when adapting models to diverse tasks and demonstrations? (§4)
domains (Dong et al., 2022; Min et al., 2022b; Si 2. Does demonstration quality matter? (§5)
et al., 2023, inter alia). In light of the success of in-
context learning, there has been increased interest 3. What is the interplay between demonstrations

and templates? (§6)
Corresponding author.
1
We release our code publicly at https://github.com/ 4. How do the answers to these questions vary
uds-lsv/multilingual-icl-analysis. across languages and models? (§4, §5, §6)
Classification task
You are an NLP assistant for sentiment analysis in Chinese. Give your answer as
"positive", "negative" or "neutral".
Demonstration(s): 今天是让 放松的 天。What is the sentiment of this statement? positive
Answer: positive
In-context learning
Test Input: 我喜欢的队伍赢了这场 赛。What is the sentiment of this statement?
Answer:

Generation task
You are an NLP assistant for question answering in English. The answer should be
directly extracted from the passage.
Demonstration(s): [passage] Q: What team was the winner of Super Bowl XXXIII?
The correct answer to the given passage is: Broncos LLM 374
Test Input: [passage] Q: How many companies were listed on the WSE on August
2009? The correct answer to the given passage is:

Generation task
You are an NLP assistant for question answering in German. The answer should Zero-shot learning Es waren 374
be directly extracted from the passage.
Test Input: [passage] Q: Wie viele Firmen waren am August 2009 bei der WSE
Firmen gelistet.
gelistet? The correct answer to the given passage is:



Figure 1: An overview of the components of multilingual in-context learning (§2) with a comparison to zero-shot
learning. Sources of variation include tasks, languages, models, and the template, i.e., the task instruction, patterns
for formatting inputs, and verbalized labels.

Specifically, we address our research questions by templates to faithfully represent its effectiveness.
evaluating 5 LLMs including base models that are Given the vast variance across models, tasks, and
only pre-trained on unlabeled text corpora (XGLM languages, it is also important to cautiously frame
and Llama 2), and chat models that are further claims about in-context learning.
refined with instruction tuning and reinforcement
learning (Llama 2-Chat, GPT-3.5, and GPT-4). We 2 Preliminaries
evaluate on 9 multilingual datasets that include 2.1 In-context learning
both classification and generation tasks, covering
In-context learning (ICL) is a popular inference
56 typologically different languages.
strategy where models solve2 a task without any
Our main findings are: (1) The effectiveness
parameter updates (Brown et al., 2020). Instead,
of demonstrations varies widely depending on the
the model performs the task by conditioning on
model, task, and language used. For base mod-
labeled demonstrations. Demonstrations are typi-
els, in-context learning barely outperforms zero-
cally formatted using “pattern-verbalizer pairs,” as
shot learning on many tasks. In general, in-context
this has been shown to be effective in eliciting good
learning matters more for generation tasks with
task performance (Schick and Schütze, 2021; Bach
loosely-specified prompts; (2) Even with sophisti-
et al., 2022). Here, a pattern is used to format
cated demonstration selection methods, in-context
the input for the model, and a verbalizer maps the
learning is not always beneficial and can some-
label to a textual representation. Additionally for
times be worse than using no demonstrations at
instruction-tuned LLMs, a task instruction is of-
all; (3) Chat models are less sensitive to seeing
ten added to provide information about the task
correctly-labeled demonstrations than base models,
beyond individual demonstrations (Mishra et al.,
suggesting that for the former, demonstrations pri-
2022b; Wang et al., 2022; Ouyang et al., 2022).
marily help the model understand the task format,
Formally, given a test sample xt , k demonstra-
while for the latter, demonstrations also impart task-
tions {(xi , yi )}ki=1 , a pattern P, a verbalizer V and
specific knowledge; (4) Using a formatting-focused
a task instruction I, the model (parameterized by
template can even eliminate the need for demonstra-
θ) makes its prediction as follows:
tions with chat models. The relative significance
of demonstrations versus prompt templates varies yt ∼ pθ (y|I, {(P(xi ), V(yi ))}ki=1 , P(xt )). (1)
based on inherent model capabilities.
2
In sum, we suggest that the benefits of adding The extent to which models actually “solve” tasks is an
open question as ICL, similar to fine-tuning, has generalization
demonstrations may be overestimated. Future work issues despite its impressive results (Mosbach et al., 2023).
on in-context learning should carefully compare Regardless, we use the word “solve” in the rest of this paper
their results with zero-shot learning and on multiple for simplicity.
Taken together, the pattern, the verbalizer and the 3 Experimental setup
optional task instruction comprise the template
Models. We evaluate two types of LLMs: pre-
with which demonstrations and the test sample are
trained base models and chat models. Our base
formatted as the input prompt for model inference.
models include XGLM (Lin et al., 2022) and
The effectiveness of demonstrations is thus linked
Llama 2 (Touvron et al., 2023). Our chat mod-
with the template used to present them to the model.
els are Llama 2-Chat, GPT-3.5 (Ouyang et al.,
2.2 Multilingual prompting 2022) and GPT-4 (OpenAI et al., 2023). Specif-
Previous studies highlight that the selection of ically, we use xglm-7.5B, Llama-2-13b, and
demonstrations and prompt templates can signif- Llama-2-13b-chat on Huggingface (Wolf et al.,
icantly influence model performance (Liu et al., 2020), and we access gpt-3.5-turbo-16k and
2022; Fu et al., 2023b; Sclar et al., 2024). In mul- gpt-4-32k APIs via Microsoft Azure.3
tilingual in-context learning, the variation in input Tasks and datasets. We experiment on a di-
prompts is further complicated by the language of verse range of multilingual classification and gen-
demonstrations, templates and test samples, all of eration tasks, using 9 datasets covering 56 lan-
which are important design choices. guages in total. Our dataset selection largely fol-
For the template language, Lin et al. (2022) and lows MEGA (Ahuja et al., 2023), but we add
Ahuja et al. (2023) found that English templates datasets for extremely under-represented African
generally perform better than native language languages. Our classification tasks include natu-
templates, possibly due to superior instruction- ral language inference (NLI), paraphrase identi-
following abilities on existing LLMs on English fication, commonsense reasoning and sentiment
compared to other languages. Following this, we analysis, with the following datasets: XNLI (Con-
use English templates in our study. neau et al., 2018), IndicXNLI (Aggarwal et al.,
For the language of few-shot demonstrations 2022), PAWS-X (Yang et al., 2019), XCOPA (Ponti
and test samples, there are three popular settings. et al., 2020), XStoryCloze (Lin et al., 2022) and
Given a test sample in a certain language, the most AfriSenti (Muhammad et al., 2023). Our gen-
straightforward approach is to use demonstrations eration tasks are extractive question answering
in the same language (referred to as in-language (QA) and machine translation (MT), for which
demonstrations). This setting directly measures we use XQuAD (Artetxe et al., 2020), TyDiQA-
the model’s inherent ability to solve problems in GoldP (Clark et al., 2020), and MAFAND (Adelani
that language. Another choice is to use English et al., 2022). See Appendix A.1 for more details.
demonstrations regardless of the language of the
test sample. This is a cross-lingual transfer setup, In-context learning. For each test sample, we
where the goal is to transfer knowledge from a select k ∈ {0, 2, 4, 8}4 different demonstrations,
pivot language to a target language via in-context which are randomly sampled unless otherwise spec-
learning. As highlighted in Shi et al. (2023b) and ified. All demonstrations are in the same language
Ahuja et al. (2023), in-language demonstrations of- as the test sample, and all templates are in English.
ten outperform English demonstrations on diverse We employ appropriate task-specific templates for
multilingual tasks. Yet another option is to translate different model types. All templates and data splits
the test sample into English – an approach called are shown in Appendix A.2.
translate-test, where the demonstrations are also in Metrics. For classification tasks, we report the
English. While translate-test leads to strong perfor- rank classification accuracy5 for open-source base
mance (Ahuja et al., 2023), this approach heavily models (Muennighoff et al., 2023; Lin et al., 2022).
relies on a translation system for data processing 3
We also experiment with BLOOMZ and mT0 (Muen-
and centers the English proficiency of LLMs. In nighoff et al., 2023). Results in Appendix B.1 show that
this work, we are interested in dissecting the intrin- their zero-shot performance significantly surpasses few-shot
sic multilingual capabilities of LLMs, therefore we performance, which we ascribe to their training scheme.
4
For QA datasets, we select a maximum of 4 demonstra-
choose to use in-language demonstrations. tions due to context size limitations.
All these design choices are represented visually 5
The scoring function is the average of per-token log prob-
in Figure 1, which gives an overview of multilin- abilities (ignoring the common prefix of different candidates).
The candidate with the highest score is chosen as the predic-
gual in-context learning. Detailed setup informa- tion.
tion is provided in the next section.
XNLI IndicXNLI PAWS-X XCOPA XStoryCloze
70

70 90
60 70 80
Accuracy

Accuracy

Accuracy

Accuracy

Accuracy
60 80
50
50 60 60 70
40
40 60
50
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
AfriSenti XQuAD TyDiQA MAFAND (en-xx) MAFAND (xx-en)

60 30
40
60 60
Accuracy

ChrF++

ChrF++
50 20
F1

F1
40 40 20
40 10

20 20
30 0 0
0 2 4 8 0 2 4 0 2 4 0 2 4 8 0 2 4 8
k-shot k-shot k-shot k-shot k-shot
XGLM Llama 2 Llama 2-Chat GPT-3.5 GPT-4

Figure 2: Average performance across languages with different numbers of demonstrations. We average and report
standard deviations over 3 seeds for all models except GPT-4. Note that the standard deviations are relatively
small, possibly because of averaging over languages. en-xx: translating from English to another language, xx-en:
translating from another language to English.

For chat models, we measure the exact match 25 mean


Performance difference (%)

between generated outputs6 and verbalized la- 20

bels (Ahuja et al., 2023). As for generation tasks, 15

we use the F1 score for QA datasets and ChrF++ 10

5
score (Popović, 2017) for MAFAND. Implemen-
0
tation details for our evaluation are provided in
−5
Appendix A.3.
XNLI

PAWS-X

XCOPA

XQuAD

TyDiQA
IndicXNLI

MT (en-xx)

MT (xx-en)
XStoryCloze

AfriSenti
4 Do (more) demonstrations benefit
multilingual performance?
Figure 3: Performance difference between 4-shot and
In this section, we systematically compare ICL 0-shot. Each marker represents the average performance
and zero-shot learning as this question is under- across models for each language in a given task. MT
explored in previous studies of multilingual denotes the MAFAND dataset.
ICL (Ahuja et al., 2023; Asai et al., 2023). We ex-
amine model performance on diverse multilingual and MAFAND datasets, particularly when translat-
tasks while varying the number of demonstrations, ing English to African languages, lags significantly
and show the results for classification tasks and behind other tasks, showing that language discrep-
generation tasks in Figure 2. ancies remain even in the best models.
We begin with the overall trends across models An important pattern across datasets and models
and datasets. OpenAI’s GPT-3.5 and GPT-4 models is that in-context learning does not always im-
achieve the best multilingual in-context learning prove over zero-shot learning – in particular, it
performance on all our datasets, which is unsurpris- helps with generation tasks, but results on classifi-
ing as they are currently the state-of-the-art on a cation tasks are mixed. For the AfriSenti dataset,
large suite of NLP benchmarks. The next best mod- many models show noticeable improvements with
els are Llama 2 and Llama 2-Chat, which demon- ICL. However, with other tasks such as IndicXNLI,
strate comparable or superior performance to the XNLI and PAWS-X, the same models, especially
multilingual XGLM model despite being trained base models, perform much worse compared to
primarily on English corpora (Touvron et al., the zero-shot setting. We also see marginal im-
2023). This indicates that their task-solving abil- provements in some cases, e.g., XGLM and Llama
ities can transfer across languages. Regardless of 2 on XCOPA. In comparison to chat models, the
the model, however, performance on the AfriSenti addition of demonstrations typically reduces the
6
We extract verbalized labels from the generated outputs performance of base models across many tasks.
using regular expressions before calculating the exact match.
Model XNLI IndicXNLI PAWS-X XCOPA XStoryCloze AfriSenti XQuAD TyDiQA MT (en-xx) MT (xx-en)
XGLM 4.59 2.49 0.24▽ 0.03 0.97▽ 5.62 1.77 4.21 1.31 0.66
Llama 2 6.61 4.17 2.35 −0.11 0.33 4.17 1.32 0.54 2.15 1.35
Llama 2-Chat −0.28 −1.36 −1.71▽ 0.32 0.43 2.17 1.02 2.42 0.74 0.66
GPT-3.5 0.18 0.71 −2.07 0.86 −0.61 −0.66▽ −0.34 2.98 0.72 0.43
GPT-4 0.76 −0.19 0.07 −0.36 0.05 −0.68 −0.77 1.88 1.21 0.65

Table 1: Performance difference of 4-shot ICL with TOP - K vs. RANDOM selection. Positive numbers show that
TOP - K is better than RANDOM (expected), and highlighted cells show where top-k is even worse than random
selection. ▽: TOP - K performance is even worse than zero-shot learning. For RANDOM, we average over 3 seeds
(except for GPT-4).

Model XNLI IndicXNLI PAWS-X XCOPA XStoryCloze AfriSenti XQuAD TyDiQA MT (en-xx) MT (xx-en)
XGLM 0.46 −0.05 0.44 0.51 0.62∗ 3.78∗ 24.56∗ 26.64∗ 3.18∗ 6.73∗
Llama 2 0.96∗ 0.43 1.16 0.61∗ 1.12∗ 2.27∗ 26.68∗ 29.20∗ 4.79∗ 8.34∗
Llama 2-Chat −0.34 0.04 1.48 0.03 −0.23 0.77∗ 5.94∗ 4.37∗ 1.13∗ 1.53∗
GPT-3.5 0.39 1.02 0.64 0.26 0.58∗ −0.62 5.46∗ 5.61∗ 1.39∗ 0.48∗
GPT-4 −0.86 −0.04 0.57 0.86 1.13 0.90 9.60 6.97 1.24 0.64

Table 2: Performance difference of 4-shot ICL with RANDOM vs. RANDOM - CORRUPTED demonstrations. Positive
numbers show that RANDOM is better than RANDOM - CORRUPTED (expected), and highlighted cells show where
corrupted labels perform even better than ground-truth labels. We average over 3 seeds (except for GPT-4). ∗: a
significant difference (p = 0.05).

10
XGLM Llama 2 Llama 2-Chat GPT-3.5 GPT-4
languages and models can behave very differently
Performance difference (%)

even on just one dataset, and a pattern which holds


5
for one language with one model does not necessar-
0 ily apply to a different language. For example, the
−5
ICL performance of Llama 2 outperforms its zero-
shot performance by 2.3 points on Japanese and
−10
English German Japanese French Spanish Chinese Korean
1.3 points on Korean. However, demonstrations
degrade performance for other languages, e.g., En-
Figure 4: Performance difference between 4-shot and glish performance degrades by 10.3 points. In sum,
0-shot for individual languages in PAWS-X. Error bars the effectiveness of demonstrations varies widely
represent standard deviations calculated over 3 seeds. depending on the model, task, and language.

5 Does demonstration quality matter?


When examining the cases where ICL improves
performance, we see that improvements saturate Our previous experiments evaluated ICL using ran-
quickly with 2 to 4 demonstrations. This aligns domly selected demonstrations. To ablate for the
with Chen et al. (2023), who found that reducing effects of demonstration quality, this section exper-
the number of demonstrations to one does not sig- iments with the choice of demonstrations as well as
nificantly deteriorate chain-of-thought reasoning. the importance of ground truth labels, i.e., the input-
Looking at the improvements over zero-shot per- label mapping. Inspired by work on demonstration
formance (for all models and languages combined) selection (Liu et al., 2022; Rubin et al., 2022) and
across tasks in Figure 3, we observe that there input-label mapping (Min et al., 2022c; Yoo et al.,
are large fluctuations between individual languages 2022) in English, we compare the following three
that are not captured by the average. The PAWS-X types of demonstrations:
dataset in particular shows an average degradation, • R ANDOM: demonstrations are randomly se-
but in fact some languages benefit from ICL while lected from clean data
others degrade. For a more nuanced understand-
ing of language-specific differences within a task, • T OP - K: the k most semantically similar8 ex-
we zoom into this dataset in Figure 4 to inspect amples to a given test sample are selected (Liu
these language-specific differences.7 We see that 8
We quantify semantic similarity using LaBSE (Feng et al.,
7 2022), a multilingual sentence embedding model trained on
Plots for other datasets are provided in Appendix B.2. 109+ languages.
AfriSenti XQuAD
80 80
Top-k Random Random-corrupted
60 60
Accuracy

Llama 2

F1
40 40

20 20

0 0
80 80

60 60
Accuracy

GPT-3.5

F1
40 40

20 20

0 0
Hausa

Igbo

Twi
Oromo

Tsonga
Swahili

Amharic

Yoruba

Tigrinya

English

German

Russian

Spanish

Chinese

Turkish

Arabic

Greek

Thai

Hindi
Mo. Arabic

Al. Arabic
Kinyarwanda

Ni. Pidgin

Portuguese

Vietnamese

Romanian
Figure 5: Performance of 4-shot ICL using different types of demonstrations for individual languages on AfriSenti
and XQuAD. The top row shows Llama 2 results, and the bottom row shows GPT-3.5 results.

et al., 2022) et al., 2022c). On generation tasks, however, all


models perform worse with corrupted labels, but
• R ANDOM - CORRUPTED: demonstrations are to vastly different extents. XGLM and Llama 2
randomly selected but the labels are corrupted perform significantly worse with corrupted labels,
by replacement with random labels9 (Min especially on the machine translation task, whereas
et al., 2022c) chat models do not rely as much on correct la-
Table 1 shows that top-k selection performs bet- bels. This might be explained by ICL helping the
ter than random selection in many cases, especially model understand the task format and activating
for the base models XGLM and Llama 2. For chat prior knowledge acquired by the model, rather than
models, the largest improvements are on generation the model learning the task from demonstrations.
tasks. For example, GPT-3.5 achieves a 2.98-point The observed model insensitivity to correct labels
improvement on TyDiQA. Nevertheless, top-k se- on certain tasks implies that random labels can
lection often degrades performance on many other serve as a strong baseline for demonstration gener-
tasks, e.g., GPT-3.5 is 2.07 points worse on PAWS- ation before exploring more complex methods (Lyu
X compared to random selection. When compared et al., 2023; Wan et al., 2023).
to zero-shot performance, ICL with top-k selection To investigate how these patterns split up across
is even worse in some cases, such as XGLM on languages, Figure 5 shows language-specific re-
PAWS-X and XStoryCloze. In cases where ran- sults on AfriSenti and XQuAD with Llama 2 and
dom selection performs worse than zero-shot, even GPT-3.5.10 On AfriSenti, top-k selection outper-
top-k selection gives only marginal improvements forms random selection with Llama 2 across most
(see detailed numbers in Table 5 in Appendix C.1). languages; however, in the case of Swahili and
These findings indicate that sophisticated demon- Tsonga, there is a performance drop of 3.2 and 1.2
stration selection methods are not always bene- points, respectively. With GPT-3.5, top-k selection
ficial and can sometimes be worse than using no does not help across most languages, but it does
demonstrations at all. help with Mozambican Portuguese and Twi. Simi-
Exploring this further, in Table 2, we compare larly, the impact of corrupted labels varies. Llama
randomly selected demonstrations with ground 2 is affected dramatically by corrupted labels on all
truth labels and corrupted labels. We find that us- languages in XQuAD, whereas GPT-3.5 is much
ing corrupted labels does not hurt performance on less affected, although to varying degrees across
multilingual classification tasks much, which is different languages. We urge NLP practitioners
consistent with previous research on English (Min to attend to these discrepancies when creating
language-specific applications, and leave it to fu-
9
For classification tasks, we randomly choose a label from ture work to explore where they come from.
the fixed label set. For generation tasks, we randomly choose
a label from the label space of the entire demonstration data. 10
See Appendix C.2 for other models and datasets.
XCOPA AfriSenti
6 Better templates further reduce the 0-shot 60

benefits of demonstrations 80 4-shot

Accuracy
60 40
In-context learning performance depends not only 40
on the demonstrations, which we have varied so 20
20

far, but also on how they are formatted using tem- 0 0


Llama 2-Chat GPT-3.5 GPT-4 Llama 2-Chat GPT-3.5 GPT-4
plates. Previous work (Gonen et al., 2023; Mizrahi
XQuAD TyDiQA
et al., 2024) has shown that modifying the template
80 80
changes task performance. This section thus seeks
60 60
to examine the interplay between template choice

F1
40 40
and demonstrations.
20 20

Template design. In the zero-shot setting, we 0 0


Llama 2-Chat GPT-3.5 GPT-4 Llama 2-Chat GPT-3.5 GPT-4
observe that chat models tend to generate verbose
responses (e.g., “Sure! I can help you with that”) or Figure 6: Effect of using different templates on 0-shot
explanations (e.g., “The reason is that ...”) that pose and 4-shot performance. Formatting-focused templates
a challenge for automatic evaluation. We observe a (with hatching) improve 0-shot performance over orig-
reduction in this behaviour with ICL, which leads inal templates (solid colours), and reduce the gap be-
us to question whether demonstrations are merely tween 0-shot and 4-shot performance. Few-shot results
are averaged across 3 seeds except for GPT-4.
a means to format model responses. To see if we
can achieve the same effect with minor template
engineering, we augment the original templates Model Demo. Label
XQuAD TyDiQA
O F O F
with instructions that focus on output formatting.
Original 38.9±0.1 43.8±0.7 40.0±0.3 40.6±0.8
We call these formatting-focused templates which Llama 2-Chat Corrupted 33.0±0.4 38.6±0.3 35.6±0.1 36.3±0.5
are shown in Table 9. ∆ 5.9 5.2 4.4 4.3
Original 68.2±0.4 72.2±0.4 64.8±0.5 70.5±0.5
In this section, we focus on XCOPA, AfriSenti, GPT-3.5 Corrupted 62.7±0.2 69.9±0.2 59.2±0.3 67.1±0.7
XQuAD, and TyDiQA, as these are the classifica- ∆ 5.5 2.3 5.6 3.4

tion and generation tasks that seem to benefit most Original 73.2 79.3 72.8 78.3
GPT-4 Corrupted 63.6 79.8 65.8 77.6
from in-context demonstrations (see Section 4). ∆ 9.6 -0.5 7.0 0.7
However, as Figure 6 shows, the performance
gap between zero-shot and in-context learning Table 3: Effect of using different templates on 4-shot
performance with RANDOM and RANDOM - CORRUPTED
diminishes with formatting-focused templates.
demonstrations. When using formatting-focused tem-
The gap reduction is more substantial for QA plates (F) over the original templates (O), the perfor-
datasets (i.e., the generation tasks) than for XCOPA mance gap (∆) between original and corrupted labels
and AfriSenti (i.e., the classification tasks). We decreases. We average and report standard deviations
speculate that it is simpler for the model to gen- over 3 seeds for all models except GPT-4.
erate label words for classification tasks with a
pre-defined label space than to answer questions in
With our new formatting-focused templates, we
a way that is easy to evaluate automatically. In the
revisit the impact of the input-label mapping dis-
latter case, formatting-focused templates can teach
cussed in Section 5. As Table 3 shows, all
output styling, largely eliminating the benefits of
models perform worse with corrupted labels, but
demonstrations.
formatting-focused templates largely mitigate this
Compared to GPT-3.5 and GPT-4, Llama 2-Chat
degradation. Notably, GPT-4 using corrupted la-
performs worse in both zero-shot and few-shot set-
bels performs on par with ground truth labels.
tings, and formatting-focused templates have a less
This strengthens our finding that the correct input-
pronounced impact. On QA datasets, GPT-3.5 and
label mapping is not that important, while also
GPT-4 even achieve better zero-shot performance
highlighting the crucial role that templates play in
with formatting-focused templates than ICL with
in-context learning.
original templates, a pattern that is not observed
Figure 7 shows the language-specific effects of
with Llama 2-Chat. This suggests that the relative
formatting-focused templates on XQuAD (results
significance of demonstrations and templates
for other tasks are in Appendix D.1). For Llama
varies based on the inherent abilities of models
2-Chat, demonstrations remain essential even with
at solving tasks and following instructions.
Llama 2-Chat
0-shot
the underlying mechanisms of ICL (Xie et al., 2022;
80

60
4-shot Von Oswald et al., 2023; Wang et al., 2023b; Hen-
del et al., 2023), motivated by its successes. Our
F1

40

20 results show that ICL is not always effective, and


0
GPT-3.5
that its performance changes depending on mul-
80
tiple factors including the choice of model, task
60 and language. The failures of ICL need as much
F1

40 scrutiny as its successes for a more fundamental un-


20
derstanding of the learning mechanisms of LLMs.
0
GPT-4

80
Optimizing demonstrations or templates. With
60
the increasing popularity of research on demonstra-
F1

40 tion selection (Liu et al., 2022; Rubin et al., 2022;


20 Li et al., 2023b) and prompt engineering (Mishra
0
et al., 2022a; White et al., 2023; Khattab et al.,
English

German

Russian

Spanish

Chinese

Turkish

Arabic

Greek

Thai

Hindi
Vietnamese

Romanian

2023), it is important to understand the interplay of


the two. We show that good demonstrations help
Figure 7: Effect of using different templates on 0-shot base models perform better on certain tasks, but
and 4-shot XQuAD performance. Formatting-focused that formatting-focused prompting has a much big-
templates (with hatching) improve 0-shot performance ger impact on chat models. These results show that
over original templates (solid colours), and reduce the the impact of demonstrations cannot be fairly eval-
gap between 0-shot and 4-shot performance. Few-shot uated in isolation from the choice of prompt. These
results are averaged across 3 seeds except for GPT-4.
findings have implications both for researchers in-
terested in fairly evaluating ICL, and for practition-
a formatting-focused template for most languages, ers to choose to spend time optimizing demonstra-
but not Greek and Hindi. GPT-3.5 and GPT-4 tions, templates or both.
also show variance across languages. Moreover,
for most languages, zero-shot learning with minor Evaluating multilingual ICL. Compared to the
template engineering can match and even exceed extensive research on ICL in English (Zhao et al.,
in-context learning performance, aligning with pre- 2021; Dong et al., 2022; Min et al., 2022b; Mos-
vious work on GPT-3 (Reynolds and McDonell, bach et al., 2023), multilingual ICL remains under-
2021). The fact that we can achieve the same ef- explored. There is no widely accepted setup to
fects through template engineering or demonstra- robustly evaluate the effectiveness of ICL across
tions reinforces our hypothesis that models are languages, since the choice of multilingual models
not actually learning tasks on the fly. Instead, and tasks is limited. Based on our findings, we have
some combination of demonstrations and templates some recommendations for the nascent field of mul-
serves to activate prior knowledge of a task and en- tilingual ICL. First, critical evaluation is important.
courage a consistent output format for automatic We need to compare ICL strategies to zero-shot
evaluation. learning, and ablate them with multiple templates.
Second, as there is so much variance across mod-
7 Discussion els, tasks and languages, it is important to carefully
Our systematic study provides strong evidence that scope claims about ICL. Last but not least, every
the importance of in-context demonstrations on ex- language is different, so granular per-language anal-
isting multilingual datasets might be overestimated, ysis is a must in multilingual research.
as it highly depends on the model, task, and lan- 8 Related Work
guage used. For strong instruction-following mod-
els, the effect of demonstrations is superficial and Multilingual in-context learning. Most multi-
can be eliminated with minor template engineering. lingual in-context learning studies focus on bench-
These findings open up new questions, which we marking LLMs on diverse tasks and comparing
discuss below. them with smaller fine-tuned models (Ahuja et al.,
2023; Asai et al., 2023; Zhang et al., 2023; Zhu
Understanding the failures of ICL. There has et al., 2023). As these works focus on benchmark-
been a surge of research interest in understanding
ing, their analysis of the role of demonstrations Limitations
is limited. Ahuja et al. (2023) explore different
Data contamination. Since LLMs are trained
prompting strategies by adjusting the language of
with a vast amount of data scraped from the internet,
templates and demonstrations. Zhang et al. (2023)
this might result in data contamination, i.e., when
find that demonstrations sometimes do not con-
the training data includes test datasets. Ahuja et al.
tribute to or even degrade model performance on
(2023) suspect that many multilingual datasets ap-
code-switching. Zhu et al. (2023) look at machine
pear in the training data of GPT-4, which might
translation and analyze the effects of template and
lead to an overestimation of the model’s capabili-
demonstration selection with XGLM. In the context
ties. In the context of our work, our prompt might
of cross-lingual transfer, Shi et al. (2022), Tanwar
just be reminding LLMs of a task they have already
et al. (2023), and Agrawal et al. (2023) investigate
seen, whereas on an unseen task, the impact of
demonstration selection for specific applications.
demonstrations might be different. We do not ex-
In contrast, we take a much broader perspective
amine the impact of potential data contamination in
and investigate the actual impact of demonstrations
our paper and leave an exploration of this to future
across a wide range of models, tasks and languages.
work.
English-centric demonstration analysis. Most
Other demonstration choices. In this work, we
of the current demonstration analysis literature fo-
choose to use demonstrations that are in the same
cuses on English: Lu et al. (2022) analyze the
languages as the test sample, due to our focus on
sensitivity of ICL to the order of demonstrations,
evaluating inherent multilingual abilities of LLMs,
Min et al. (2022c) and Yoo et al. (2022) explore
as explained in Section 2.2. However, using En-
whether the ground truth labels matter for classifi-
glish demonstrations for cross-lingual transfer or
cation tasks, and Wei et al. (2023) investigate the
translating test samples into English has its own
sensitivity of various model families to different
practical value for NLP applications. Additionally,
input-label mappings. Similarly, Pan et al. (2023)
it is worth exploring selecting demonstrations from
disentangle task recognition and task learning by
a mixture of languages. Expanding our study to
manipulating the label space. Beyond this, Shi et al.
more setups would provide additional insights into
(2023a) and Wang et al. (2023a) modify the valid-
multilingual and cross-lingual LLM abilities.
ity of chain-of-thought (CoT) reasoning steps in
demonstrations and explore the impact of this mod- Other prompting methods. In Section 6, we
ification on mathematical reasoning. Also focusing only experiment with manually augmented tem-
on CoT, Chen et al. (2023) investigate how varying plates to illustrate how the choice of template can
the number of demonstrations affects performance. reduce the effectiveness of demonstrations. There
is a broad literature on prompt engineering and
9 Conclusion prompt sensitivity (White et al., 2023; Gonen et al.,
In this paper, we conduct an in-depth multidimen- 2023), suggesting that it is plausible that another
sional analysis on the impact of demonstrations prompt could reduce the gap between few-shot
in multilingual in-context learning. We find that and zero-shot performance even further. Chain-of-
the use of demonstrations does not always provide thought (CoT) prompting is another approach with
benefits compared to zero-shot learning, and that promising multilingual abilities (Shi et al., 2023b;
there is a large variance in performance across mod- Huang et al., 2023) that might affect our findings.
els, datasets and languages. While the quality of Our manually-augmented templates are intended
demonstrations influences the performance of base only as a starting point for further analysis, which
LLMs on certain tasks, the impact is significantly we leave to future work.
reduced for LLMs tuned with alignment techniques. Beyond automatic evaluation. When examining
We also examine the interplay between demonstra- model responses, we noticed some cases where a
tions and templates, finding that a carefully crafted correct answer as evaluated by a human was not
template can further decrease the benefits of demon- fully captured by automatic evaluation metrics. Hu-
strations. Our granular analysis contributes novel man evaluation is time-consuming, expensive, and
insights with nuance and paves the way for a more hard to source for the wide range of languages that
thoughtful multilingual ICL evaluation. we explore in our work. Another option is LLM
evaluation, which is becoming increasingly popu- Linguistics: ACL 2023, pages 8857–8873, Toronto,
lar (Fu et al., 2023a; Chan et al., 2024), but is also Canada. Association for Computational Linguistics.
an expensive approach. More importantly, we have Kabir Ahuja, Harshita Diddee, Rishav Hada, Milli-
no guarantees about LLMs’ multilingual capabil- cent Ochieng, Krithika Ramesh, Prachi Jain, Ak-
ities. As a trade-off between cost and evaluation shay Nambi, Tanuja Ganu, Sameer Segal, Mohamed
quality, we stick to automatic evaluation in our Ahmed, Kalika Bali, and Sunayana Sitaram. 2023.
MEGA: Multilingual evaluation of generative AI.
work for all tasks and languages. In Proceedings of the 2023 Conference on Empir-
ical Methods in Natural Language Processing, pages
Acknowledgments 4232–4267, Singapore. Association for Computa-
tional Linguistics.
We thank Matan Eyal for his valuable feedback.
Our use of Microsoft Azure is sponsored by the Mi- Mikel Artetxe, Sebastian Ruder, and Dani Yogatama.
crosoft Accelerating Foundation Models Research 2020. On the cross-lingual transferability of mono-
lingual representations. In Proceedings of the 58th
(AFMR) program. Miaoran Zhang and Marius Annual Meeting of the Association for Computational
Mosbach received funding from the DFG (German Linguistics, pages 4623–4637, Online. Association
Research Foundation) under project 232722074, for Computational Linguistics.
SFB 1102. Vagrant Gautam and Jesujoba O. Alabi
Akari Asai, Sneha Kudugunta, Xinyan Velocity Yu,
were supported by the BMBF’s (German Federal Terra Blevins, Hila Gonen, Machel Reid, Yulia
Ministry of Education and Research) SLIK project Tsvetkov, Sebastian Ruder, and Hannaneh Hajishirzi.
under the grant 01IS22015C. 2023. BUFFET: Benchmarking large language mod-
els for few-shot cross-lingual transfer. arXiv.

Stephen Bach, Victor Sanh, Zheng Xin Yong, Albert


References Webson, Colin Raffel, Nihal V. Nayak, Abheesht
David Adelani, Jesujoba Alabi, Angela Fan, Julia Sharma, Taewoon Kim, M Saiful Bari, Thibault
Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Fevry, Zaid Alyafeai, Manan Dey, Andrea Santilli,
Dietrich Klakow, Peter Nabende, Ernie Chang, Tajud- Zhiqing Sun, Srulik Ben-david, Canwen Xu, Gun-
deen Gwadabe, Freshia Sackey, Bonaventure F. P. jan Chhablani, Han Wang, Jason Fries, Maged Al-
Dossou, Chris Emezue, Colin Leong, Michael Beuk- shaibani, Shanya Sharma, Urmish Thakker, Khalid
man, Shamsuddeen Muhammad, Guyo Jarso, Oreen Almubarak, Xiangru Tang, Dragomir Radev, Mike
Yousuf, Andre Niyongabo Rubungo, Gilles Hacheme, Tian-jian Jiang, and Alexander Rush. 2022. Prompt-
Eric Peter Wairagala, Muhammad Umair Nasir, Ben- Source: An integrated development environment and
jamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade repository for natural language prompts. In Proceed-
Abbott, Mohamed Ahmed, Millicent Ochieng, An- ings of the 60th Annual Meeting of the Association
uoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, for Computational Linguistics: System Demonstra-
Fatoumata Ouoba Kabore, Godson Kalipe, Derguene tions, pages 93–104, Dublin, Ireland. Association for
Mbaye, Allahsera Auguste Tapo, Victoire Memd- Computational Linguistics.
jokam Koagne, Edwin Munkoh-Buabeng, Valen-
cia Wagner, Idris Abdulmumin, Ayodele Awokoya, Hritik Bansal, Karthik Gopalakrishnan, Saket Dingliwal,
Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, Sravan Bodapati, Katrin Kirchhoff, and Dan Roth.
and Sam Manthalu. 2022. A few thousand transla- 2023. Rethinking the role of scale for in-context
tions go a long way! leveraging pre-trained mod- learning: An interpretability-based case study at 66
els for African news translation. In Proceedings of billion scale. In Proceedings of the 61st Annual Meet-
the 2022 Conference of the North American Chap- ing of the Association for Computational Linguis-
ter of the Association for Computational Linguistics: tics (Volume 1: Long Papers), pages 11833–11856,
Human Language Technologies, pages 3053–3070, Toronto, Canada. Association for Computational Lin-
Seattle, United States. Association for Computational guistics.
Linguistics. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Divyanshu Aggarwal, Vivek Gupta, and Anoop Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Kunchukuttan. 2022. IndicXNLI: Evaluating multi- Neelakantan, Pranav Shyam, Girish Sastry, Amanda
lingual inference for Indian languages. In Proceed- Askell, et al. 2020. Language models are few-shot
ings of the 2022 Conference on Empirical Methods in learners. Advances in neural information processing
Natural Language Processing, pages 10994–11006, systems, 33:1877–1901.
Abu Dhabi, United Arab Emirates. Association for Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu,
Computational Linguistics. Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu.
Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke 2024. ChatEval: Towards better LLM-based eval-
Zettlemoyer, and Marjan Ghazvininejad. 2023. In- uators through multi-agent debate. In The Twelfth
context examples selection for machine translation. International Conference on Learning Representa-
In Findings of the Association for Computational tions.
Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. Haoyang Huang, Tianyi Tang, Dongdong Zhang, Xin
2023. How many demonstrations do you need for Zhao, Ting Song, Yan Xia, and Furu Wei. 2023. Not
in-context learning? In Findings of the Association all languages are created equal in LLMs: Improv-
for Computational Linguistics: EMNLP 2023, pages ing multilingual capability by cross-lingual-thought
11149–11159, Singapore. Association for Computa- prompting. In Findings of the Association for Com-
tional Linguistics. putational Linguistics: EMNLP 2023, pages 12365–
12394, Singapore. Association for Computational
Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, Linguistics.
and He He. 2022. Meta-learning via language model
in-context tuning. In Proceedings of the 60th Annual Omar Khattab, Arnav Singhvi, Paridhi Maheshwari,
Meeting of the Association for Computational Lin- Zhiyuan Zhang, Keshav Santhanam, Sri Vard-
guistics (Volume 1: Long Papers), pages 719–730, hamanan, Saiful Haq, Ashutosh Sharma, Thomas T.
Dublin, Ireland. Association for Computational Lin- Joshi, Hanna Moazam, Heather Miller, Matei Zaharia,
guistics. and Christopher Potts. 2023. DSPy: Compiling
declarative language model calls into self-improving
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan pipelines. arXiv.
Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and
Jennimaria Palomaki. 2020. TyDi QA: A benchmark Viet Lai, Nghia Ngo, Amir Pouran Ben Veyseh, Hieu
for information-seeking question answering in typo- Man, Franck Dernoncourt, Trung Bui, and Thien
logically diverse languages. Transactions of the As- Nguyen. 2023. ChatGPT beyond English: Towards
sociation for Computational Linguistics, 8:454–470. a comprehensive evaluation of large language mod-
els in multilingual learning. In Findings of the As-
Alexis Conneau, Ruty Rinott, Guillaume Lample, Adina sociation for Computational Linguistics: EMNLP
Williams, Samuel Bowman, Holger Schwenk, and 2023, pages 13171–13189, Singapore. Association
Veselin Stoyanov. 2018. XNLI: Evaluating cross- for Computational Linguistics.
lingual sentence representations. In Proceedings of
the 2018 Conference on Empirical Methods in Nat- Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu,
ural Language Processing, pages 2475–2485, Brus- Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang,
sels, Belgium. Association for Computational Lin- and Xing Xie. 2023a. Large language models un-
guistics. derstand and can be enhanced by emotional stimuli.
arXiv.
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiy-
ong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Xiaonan Li, Kai Lv, Hang Yan, Tianyang Lin, Wei Zhu,
Zhifang Sui. 2022. A survey for in-context learning. Yuan Ni, Guotong Xie, Xiaoling Wang, and Xipeng
arXiv. Qiu. 2023b. Unified demonstration retriever for in-
context learning. In Proceedings of the 61st Annual
Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Ari- Meeting of the Association for Computational Lin-
vazhagan, and Wei Wang. 2022. Language-agnostic guistics (Volume 1: Long Papers), pages 4644–4668,
BERT sentence embedding. In Proceedings of the Toronto, Canada. Association for Computational Lin-
60th Annual Meeting of the Association for Compu- guistics.
tational Linguistics (Volume 1: Long Papers), pages
878–891, Dublin, Ireland. Association for Computa- Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu
tional Linguistics. Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na-
man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav
Liu. 2023a. GPTScore: Evaluate as you desire. Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle-
arXiv. moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy-
anov, and Xian Li. 2022. Few-shot learning with
Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and multilingual generative language models. In Proceed-
Tushar Khot. 2023b. Complexity-based prompting ings of the 2022 Conference on Empirical Methods
for multi-step reasoning. In The Eleventh Interna- in Natural Language Processing, pages 9019–9052,
tional Conference on Learning Representations. Abu Dhabi, United Arab Emirates. Association for
Hila Gonen, Srini Iyer, Terra Blevins, Noah Smith, and Computational Linguistics.
Luke Zettlemoyer. 2023. Demystifying prompts in Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan,
language models via perplexity estimation. In Find- Lawrence Carin, and Weizhu Chen. 2022. What
ings of the Association for Computational Linguis- makes good in-context examples for GPT-3? In
tics: EMNLP 2023, pages 10136–10148, Singapore. Proceedings of Deep Learning Inside Out (DeeLIO
Association for Computational Linguistics. 2022): The 3rd Workshop on Knowledge Extrac-
tion and Integration for Deep Learning Architectures,
Roee Hendel, Mor Geva, and Amir Globerson. 2023.
pages 100–114, Dublin, Ireland and Online. Associa-
In-context learning creates task vectors. In Find-
tion for Computational Linguistics.
ings of the Association for Computational Linguis-
tics: EMNLP 2023, pages 9318–9333, Singapore. Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel,
Association for Computational Linguistics. and Pontus Stenetorp. 2022. Fantastically ordered
prompts and where to find them: Overcoming few- Computational Linguistics: ACL 2023, pages 12284–
shot prompt order sensitivity. In Proceedings of the 12314, Toronto, Canada. Association for Computa-
60th Annual Meeting of the Association for Compu- tional Linguistics.
tational Linguistics (Volume 1: Long Papers), pages
8086–8098, Dublin, Ireland. Association for Compu- Niklas Muennighoff, Thomas Wang, Lintang Sutawika,
tational Linguistics. Adam Roberts, Stella Biderman, Teven Le Scao,
M Saiful Bari, Sheng Shen, Zheng Xin Yong, Hai-
Xinxi Lyu, Sewon Min, Iz Beltagy, Luke Zettlemoyer, ley Schoelkopf, Xiangru Tang, Dragomir Radev,
and Hannaneh Hajishirzi. 2023. Z-ICL: Zero-shot Alham Fikri Aji, Khalid Almubarak, Samuel Al-
in-context learning with pseudo-demonstrations. In banie, Zaid Alyafeai, Albert Webson, Edward Raff,
Proceedings of the 61st Annual Meeting of the As- and Colin Raffel. 2023. Crosslingual generaliza-
sociation for Computational Linguistics (Volume 1: tion through multitask finetuning. In Proceedings
Long Papers), pages 2304–2317, Toronto, Canada. of the 61st Annual Meeting of the Association for
Association for Computational Linguistics. Computational Linguistics (Volume 1: Long Papers),
pages 15991–16111, Toronto, Canada. Association
Sewon Min, Mike Lewis, Hannaneh Hajishirzi, and for Computational Linguistics.
Luke Zettlemoyer. 2022a. Noisy channel language
model prompting for few-shot text classification. In Shamsuddeen Muhammad, Idris Abdulmumin, Abinew
Proceedings of the 60th Annual Meeting of the As- Ayele, Nedjma Ousidhoum, David Adelani, Seid Yi-
sociation for Computational Linguistics (Volume 1: mam, Ibrahim Ahmad, Meriem Beloucif, Saif Mo-
Long Papers), pages 5316–5330, Dublin, Ireland. As- hammad, Sebastian Ruder, Oumaima Hourrane, Ali-
sociation for Computational Linguistics. pio Jorge, Pavel Brazdil, Felermino Ali, Davis David,
Salomey Osei, Bello Shehu-Bello, Falalu Lawan,
Sewon Min, Mike Lewis, Luke Zettlemoyer, and Han- Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Be-
naneh Hajishirzi. 2022b. MetaICL: Learning to learn lay, Wendimu Messelle, Hailu Balcha, Sisay Chala,
in context. In Proceedings of the 2022 Conference of Hagos Gebremichael, Bernard Opoku, and Stephen
the North American Chapter of the Association for Arthur. 2023. AfriSenti: A Twitter sentiment analysis
Computational Linguistics: Human Language Tech- benchmark for African languages. In Proceedings
nologies, pages 2791–2809, Seattle, United States. of the 2023 Conference on Empirical Methods in
Association for Computational Linguistics. Natural Language Processing, pages 13968–13981,
Singapore. Association for Computational Linguis-
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, tics.
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
moyer. 2022c. Rethinking the role of demonstrations: Jessica Ojo, Kelechi Ogueji, Pontus Stenetorp, and
What makes in-context learning work? In Proceed- David I. Adelani. 2023. How good are large lan-
ings of the 2022 Conference on Empirical Methods in guage models on African languages? arXiv.
Natural Language Processing, pages 11048–11064,
Abu Dhabi, United Arab Emirates. Association for OpenAI, :, Josh Achiam, Steven Adler, Sandhini Agar-
Computational Linguistics. wal, Lama Ahmad, Ilge Akkaya, Florencia Leoni
Aleman, Diogo Almeida, Janko Altenschmidt, Sam
Swaroop Mishra, Daniel Khashabi, Chitta Baral, Yejin Altman, et al. 2023. GPT-4 technical report. arXiv.
Choi, and Hannaneh Hajishirzi. 2022a. Reframing
instructional prompts to GPTk’s language. In Find- Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
ings of the Association for Computational Linguistics: Carroll Wainwright, Pamela Mishkin, Chong Zhang,
ACL 2022, pages 589–612, Dublin, Ireland. Associa- Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
tion for Computational Linguistics. 2022. Training language models to follow instruc-
tions with human feedback. Advances in Neural
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Information Processing Systems, 35:27730–27744.
Hannaneh Hajishirzi. 2022b. Cross-task generaliza-
tion via natural language crowdsourcing instructions. Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen.
In Proceedings of the 60th Annual Meeting of the 2023. What in-context learning “learns” in-context:
Association for Computational Linguistics (Volume Disentangling task recognition and task learning.
1: Long Papers), pages 3470–3487, Dublin, Ireland. In Findings of the Association for Computational
Association for Computational Linguistics. Linguistics: ACL 2023, pages 8298–8319, Toronto,
Canada. Association for Computational Linguistics.
Moran Mizrahi, Guy Kaplan, Dan Malkin, Rotem Dror,
Dafna Shahaf, and Gabriel Stanovsky. 2024. State of Edoardo Maria Ponti, Goran Glavaš, Olga Majewska,
what art? A call for multi-prompt LLM evaluation. Qianchu Liu, Ivan Vulić, and Anna Korhonen. 2020.
arXiv. XCOPA: A multilingual dataset for causal common-
sense reasoning. In Proceedings of the 2020 Con-
Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Di- ference on Empirical Methods in Natural Language
etrich Klakow, and Yanai Elazar. 2023. Few-shot Processing (EMNLP), pages 2362–2376, Online. As-
fine-tuning vs. in-context learning: A fair compari- sociation for Computational Linguistics.
son and evaluation. In Findings of the Association for
Maja Popović. 2017. chrF++: words helping charac- Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang
ter n-grams. In Proceedings of the Second Confer- Wang, Jianfeng Wang, Jordan Lee Boyd-Graber, and
ence on Machine Translation, pages 612–618, Copen- Lijuan Wang. 2023. Prompting GPT-3 to be reli-
hagen, Denmark. Association for Computational Lin- able. In The Eleventh International Conference on
guistics. Learning Representations.
Laria Reynolds and Kyle McDonell. 2021. Prompt pro- Eshaan Tanwar, Subhabrata Dutta, Manish Borthakur,
gramming for large language models: Beyond the and Tanmoy Chakraborty. 2023. Multilingual LLMs
few-shot paradigm. In Extended Abstracts of the are better cross-lingual in-context learners with align-
2021 CHI Conference on Human Factors in Comput- ment. In Proceedings of the 61st Annual Meeting of
ing Systems, pages 1–7. the Association for Computational Linguistics (Vol-
ume 1: Long Papers), pages 6292–6307, Toronto,
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Canada. Association for Computational Linguistics.
2022. Learning to retrieve prompts for in-context
learning. In Proceedings of the 2022 Conference of Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
the North American Chapter of the Association for bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Computational Linguistics: Human Language Tech- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
nologies, pages 2655–2671, Seattle, United States. Bhosale, et al. 2023. Llama 2: Open foundation and
Association for Computational Linguistics. fine-tuned chat models. arXiv.

Sebastian Ruder, Ivan Vulić, and Anders Søgaard. Johannes Von Oswald, Eyvind Niklasson, Ettore Ran-
2022. Square one bias in NLP: Towards a multi- dazzo, Joao Sacramento, Alexander Mordvintsev,
dimensional exploration of the research manifold. Andrey Zhmoginov, and Max Vladymyrov. 2023.
In Findings of the Association for Computational Transformers learn in-context by gradient descent.
Linguistics: ACL 2022, pages 2340–2354, Dublin, In Proceedings of the 40th International Conference
Ireland. Association for Computational Linguistics. on Machine Learning, volume 202 of Proceedings
of Machine Learning Research, pages 35151–35174.
Timo Schick and Hinrich Schütze. 2021. It’s not just PMLR.
size that matters: Small language models are also few-
shot learners. In Proceedings of the 2021 Conference Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun
of the North American Chapter of the Association Dai, Julian Eisenschlos, Sercan Arik, and Tomas Pfis-
for Computational Linguistics: Human Language ter. 2023. Universal self-adaptive prompting. In Pro-
Technologies, pages 2339–2352, Online. Association ceedings of the 2023 Conference on Empirical Meth-
for Computational Linguistics. ods in Natural Language Processing, pages 7437–
7462, Singapore. Association for Computational Lin-
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane guistics.
Suhr. 2024. Quantifying language models’ sensitiv-
ity to spurious features in prompt design or: How i Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen,
learned to start worrying about prompt formatting. You Wu, Luke Zettlemoyer, and Huan Sun. 2023a.
In The Twelfth International Conference on Learning Towards understanding chain-of-thought prompting:
Representations. An empirical study of what matters. In Proceedings
of the 61st Annual Meeting of the Association for
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Computational Linguistics (Volume 1: Long Papers),
Scales, David Dohan, Ed H Chi, Nathanael Schärli, pages 2717–2739, Toronto, Canada. Association for
and Denny Zhou. 2023a. Large language models Computational Linguistics.
can be easily distracted by irrelevant context. In In-
ternational Conference on Machine Learning, pages Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou,
31210–31227. PMLR. Fandong Meng, Jie Zhou, and Xu Sun. 2023b. Label
words are anchors: An information flow perspective
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, for understanding in-context learning. In Proceed-
Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, ings of the 2023 Conference on Empirical Methods
Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, in Natural Language Processing, pages 9840–9855,
and Jason Wei. 2023b. Language models are multi- Singapore. Association for Computational Linguis-
lingual chain-of-thought reasoners. In The Eleventh tics.
International Conference on Learning Representa-
tions. Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark
Steyvers, and William Yang Wang. 2023c. Large
Peng Shi, Rui Zhang, He Bai, and Jimmy Lin. language models are latent variable models: Explain-
2022. XRICL: Cross-lingual retrieval-augmented in- ing and finding good demonstrations for in-context
context learning for cross-lingual text-to-SQL seman- learning. In Thirty-seventh Conference on Neural
tic parsing. In Findings of the Association for Com- Information Processing Systems.
putational Linguistics: EMNLP 2022, pages 5248–
5259, Abu Dhabi, United Arab Emirates. Association Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
for Computational Linguistics. labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Naik, Arjun Ashok, Arut Selvan Dhanasekaran,
Anjana Arunkumar, David Stap, Eshaan Pathak, in Natural Language Processing and the 9th Inter-
Giannis Karamanolakis, Haizhi Lai, Ishan Puro- national Joint Conference on Natural Language Pro-
hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, cessing (EMNLP-IJCNLP), pages 3687–3692, Hong
Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Kong, China. Association for Computational Linguis-
Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, tics.
Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma,
Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyun-
Shailaja Keyur Sampat, Siddhartha Mishra, Sujan soo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee,
Reddy A, Sumanta Patro, Tanay Dixit, and Xudong and Taeuk Kim. 2022. Ground-truth labels matter: A
Shen. 2022. Super-NaturalInstructions: Generaliza- deeper look into input-label demonstrations. In Pro-
tion via declarative instructions on 1600+ NLP tasks. ceedings of the 2022 Conference on Empirical Meth-
In Proceedings of the 2022 Conference on Empiri- ods in Natural Language Processing, pages 2422–
cal Methods in Natural Language Processing, pages 2437, Abu Dhabi, United Arab Emirates. Association
5085–5109, Abu Dhabi, United Arab Emirates. As- for Computational Linguistics.
sociation for Computational Linguistics.
Ruochen Zhang, Samuel Cahyawijaya, Jan Chris-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten tian Blaise Cruz, Genta Winata, and Alham Aji.
Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, 2023. Multilingual large language models are not
and Denny Zhou. 2022. Chain-of-thought prompt- (yet) code-switchers. In Proceedings of the 2023
ing elicits reasoning in large language models. In Conference on Empirical Methods in Natural Lan-
Advances in Neural Information Processing Systems, guage Processing, pages 12567–12582, Singapore.
volume 35, pages 24824–24837. Curran Associates, Association for Computational Linguistics.
Inc.
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Sameer Singh. 2021. Calibrate before use: Improv-
Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, ing few-shot performance of language models. In In-
Da Huang, Denny Zhou, et al. 2023. Larger language ternational Conference on Machine Learning, pages
models do in-context learning differently. arXiv. 12697–12706. PMLR.
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu,
Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Shujian Huang, Lingpeng Kong, Jiajun Chen, and
Spencer-Smith, and Douglas C. Schmidt. 2023. A Lei Li. 2023. Multilingual machine translation with
prompt pattern catalog to enhance prompt engineer- large language models: Empirical results and analy-
ing with chatgpt. arXiv. sis. arXiv.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
formers: State-of-the-art natural language processing.
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online. Association
for Computational Linguistics.
Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Zhiyong
Wu, Jiangtao Feng, Jingjing Xu, and Yu Qiao. 2023.
OpenICL: An open-source framework for in-context
learning. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics
(Volume 3: System Demonstrations), pages 489–498,
Toronto, Canada. Association for Computational Lin-
guistics.
Sang Michael Xie, Aditi Raghunathan, Percy Liang,
and Tengyu Ma. 2022. An explanation of in-context
learning as implicit bayesian inference. In Interna-
tional Conference on Learning Representations.
Yinfei Yang, Yuan Zhang, Chris Tar, and Jason
Baldridge. 2019. PAWS-X: A cross-lingual adversar-
ial dataset for paraphrase identification. In Proceed-
ings of the 2019 Conference on Empirical Methods
A Experimental details
A.1 Tasks and datasets
We conduct experiments on 9 multilingual datasets with a wide coverage of tasks and languages, as shown
in Table 4. All datasets are public research datasets and our experiments are consistent with their intended
use, i.e., NLP evaluation. For the machine translation dataset MAFAND, English serves as the pivot
language and there are two translation directions: en-xx (i.e., translating from English to another language)
and xx-en (i.e., translating from another language to English). As the black-box training data of OpenAI
APIs that we used is up to September 2021, we include the dataset release date in the table which can be
taken as a clue to the severity of dataset contamination.

Dataset Task Languages |Lang.| Release Date

XNLI natural language inference English, German, Russian, French, Spanish, Chinese, Vietnamese, 15 2019.09
Turkish, Arabic, Greek, Thai, Bulgarian, Hindi, Urdu, Swahili
IndicXNLI natural language inference Hindi, Bengali, Tamil, Marathi, Malayalam, Telugu, Kannada, Punjabi, 11 2022.04
Oriya, Assamese, Gujarati
PAWS-X paraphrase identification English, German, Japanese, French, Spanish, Chinese, Korean 7 2019.08
XCOPA commonsense reasoning Chinese, Italian, Vietnamese, Indonesian, Turkish, Thai, Estonian, 11 2020.04
Tamil, Swahili, Haitian, Quechua
XStoryCloze commonsense reasoning English, Russian, Spanish, Chinese, Indonesian, Arabic, Hindi, 11 2023.05
Basque, Telugu, Burmese, Swahili
AfriSenti sentiment analysis Swahili, Amharic, Hausa, Kinyarwanda, Yoruba, Tigrinya, Igbo, Oromo, 14 2023.05
Moroccan Arabic, Algerian Arabic, Nigerian Pidgin, Mozambican Portuguese,
Tsonga, Twi
XQuAD extractive QA English, German, Russian, Spanish, Chinese, Vietnamese, Turkish, Greek, 12 2019.10
Romanian, Thai, Hindi
TyDiQA-GoldP extractive QA English, Russian, Indonesian, Korean, Arabic, Finnish, Bengali, Telugu, Swahili 9 2020.02
MAFAND machine translation Amharic, Hausa, Kinyarwanda, Luganda, Luo, Chichewa, Nigerian Pidgin, 14 2022.06
Shona, Swahili, Setswana, Twi, Xhosa, Yoruba, Zulu

Table 4: Multilingual benchmarking datasets.

A.2 In-context learning


We sample few-shot demonstrations from the validation set and evaluate the test set. For datasets without
a test data split (XStoryCloze and TyDiQA), we sample few-shots from the train set and evaluate the
validation set. Since XQuAD only has a validation data split, we utilize it for both demonstration sampling
and evaluation, ensuring that the test sample itself is not included in its demonstrations. For chat models
(Llama 2-Chat, GPT-3.5, and GPT-4), we limit the test sample size to a maximum of 200 in order to
reduce inference expenses and ensure a fair comparison.
We use GPT-3 style prompting templates for XGLM and Llama 2 as shown in Table 6. The templates
for BLOOMZ and mT0 are shown in Table 7. For Llama 2-Chat, GPT-3.5 and GPT-4, default templates
are shown in Table 8 and task instructions are used to assign a system role to the model. Inspired by Lai
et al. (2023) and Li et al. (2023a), where emotional stimuli are able to enhance LLM understanding, we
design formatting-focused templates (discussed in Section 6) to reinforce LLM to generate formatted
outputs that are easier to evaluate automatically, as shown in Table 9.

A.3 Implementation
Our codebase is adapted from OpenICL (Wu et al., 2023). We use int8bit model quantization11 for all
models except OpenAI models. Experiments are conducted using a single NVIDIA A100-80GB GPU.
As models have a maximum context length, we preserve complete demonstrations that can fit within the
context window. We employ greedy decoding for model generation. For chat models, the maximum
new token is set to 50, while for machine translation, it is set to 100. For other models, the maximum
11
In our preliminary experiments, we found that int8 quantization led to a performance degradation of 1-2% on a few
classification datasets with Llama 2 and XGLM. Since this degradation is consistent across different setups, we believe that it
would not affect our overall findings.
new token is set to 20, while for machine translation, it is set to 50. We use three seeds (0, 33, 42) in our
experiments, and the single-seed results for BLOOMZ and mT0 are obtained with the seed 0.

B More results for varying numbers of demonstrations


In this section, we provide supplemental results for Section 4.

B.1 Results for BLOOMZ and mT0


In addition to the 5 models (base models and chat models) we discussed in the main content, we also
experiment with two instruction-tuned models: BLOOMZ and mT0 (Muennighoff et al., 2023). The
results for varying numbers of random demonstrations are shown in Figure 8. In line with findings
from Asai et al. (2023), we observe significant performance degradation when using demonstrations
compared to zero-shot learning in all cases. This decline can be attributed to their training scheme, where
models are fine-tuned on a large collection of existing datasets in a zero-shot manner. In contrast, several
studies (Chen et al., 2022; Wang et al., 2022) focus on enhancing the in-context learning ability of LLMs
by incorporating demonstrations into their training process. This suggests that we should be careful in
model selection for in-context learning research and take the model training process into consideration.
XNLI IndicXNLI PAWS-X XCOPA XStoryCloze
60 90
BLOOMZ 90 80
mT0
80
50 80
Accuracy

Accuracy

Accuracy

Accuracy

Accuracy
50 70
70 70

40 60
40 60 60
50
50 50
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
AfriSenti XQuAD TyDiQA MAFAND (en-xx) MAFAND (xx-en)
45.0 80 25.0 40
75
42.5
Accuracy

ChrF++

ChrF++
22.5
70
F1

F1

40.0 20.0 30
70
37.5 60 17.5
65 20
15.0
0 2 4 8 0 2 4 0 2 4 0 2 4 8 0 2 4 8
k-shot k-shot k-shot k-shot k-shot

Figure 8: Average performance across languages for BLOOMZ and mT0 with different numbers of demonstrations.
The results are obtained with a single random seed. Note that PAWS-X, XQuAD and TyDiQA are included in the
instruction-tuning datasets of BLOOMZ and mT0.

B.2 Results for individual languages


The language-specific results for each task are shown in Figure 9. The order of languages follows their
data ratio in the CommonCrawl corpus12 from high-resource to low-resource. We observe large variations
in model performance across different languages. For instance, there exists a large performance disparity
between English and Urdu in XNLI. In XCOPA, the performance of Quechua is significantly worse
compared to other languages.

C More results for ablating the quality of demonstrations


In this section, we provide supplemental results for Section 5.

C.1 Performance of different types of demonstrations


In Table 5, we show the model performance of three types of demonstrations, as well as the zero-shot
performance for comparative analysis. As we notice, top-k selection may not always be the optimal choice,
given the considerable effort in optimizing demonstrations. For QA, XGLM and Llama 2’s abilities in
solving this task almost collapse with corrupted labels. However, for chat models, demonstrations with
corrupted labels can achieve comparable performance with ground truth labels and largely improve the
12
http://commoncrawl.org
Model Demonstration XNLI IndicXNLI PAWS-X XCOPA XStoryCloze AfriSenti XQuAD TyDiQA MT (en-xx) MT (xx-en)
Z ERO - SHOT 45.87 38.27 54.79 57.51 65.19 32.71 18.16 26.01 0.79 1.89
T OP - K 45.99 38.85 51.72 58.76 63.99 44.30 27.54 34.32 9.08 15.05
XGLM
R ANDOM 41.400.50 36.360.33 51.480.33 58.730.43 63.020.09 38.680.39 25.770.06 30.110.36 7.770.09 14.390.02
R ANDOM - CORRUPTED 40.940.42 36.410.35 51.040.28 58.210.30 62.400.21 34.900.42 1.210.03 3.470.07 4.590.01 7.660.04
Z ERO - SHOT 44.25 37.66 59.21 56.02 65.17 32.71 15.33 16.81 10.06 11.27
T OP - K 47.10 40.15 59.35 57.69 66.16 47.25 32.37 35.36 17.29 22.92
Llama 2
R ANDOM 40.490.35 35.980.24 57.000.29 57.800.32 65.830.08 43.080.02 31.050.28 34.820.21 15.140.01 21.570.02
R ANDOM - CORRUPTED 39.530.20 35.550.44 55.850.92 57.190.06 64.710.25 40.810.36 4.360.25 5.620.27 10.350.04 13.230.04
Z ERO - SHOT 36.10 32.32 64.64 44.55 57.77 31.18 18.82 20.33 18.83 25.46
T OP - K 47.53 35.73 59.36 63.55 68.82 45.75 39.94 42.38 21.50 29.02
Llama 2-Chat
R ANDOM 47.810.85 37.092.57 61.071.2 63.230.91 68.390.14 43.580.23 38.920.09 39.960.30 20.760.23 28.360.12
R ANDOM - CORRUPTED 48.151.22 37.053.09 59.591.13 63.200.33 68.620.93 42.810.11 32.980.39 35.590.08 19.630.04 26.820.11
Z ERO - SHOT 63.23 48.23 66.57 73.50 84.55 53.32 45.25 48.52 27.39 37.77
T OP - K 63.27 50.45 69.29 80.77 85.23 51.86 67.82 67.76 29.20 39.99
GPT-3.5
R ANDOM 63.090.88 49.741.17 71.360.75 79.910.75 85.840.30 52.520.21 68.160.36 64.780.47 28.480.01 39.560.03
R ANDOM - CORRUPTED 62.701.05 48.730.51 70.710.66 79.650.74 85.260.07 53.140.47 62.700.19 59.170.27 27.090.18 39.080.08
Z ERO - SHOT 70.30 65.41 74.50 88.82 96.05 58.46 44.03 46.97 32.73 45.28
T OP - K 76.53 67.45 76.14 91.23 96.73 61.68 72.44 74.65 35.06 48.34
GPT-4
R ANDOM 75.77 67.64 76.07 91.59 96.68 62.36 73.21 72.77 33.85 47.69
R ANDOM - CORRUPTED 76.63 67.68 75.50 90.73 95.55 61.46 63.61 65.80 32.61 47.05

Table 5: Performance of different types of demonstrations. For RANDOM and RANDOM - CORRUPTED, we report the
mean and standard deviation across 3 seeds except for GPT-4. Best results for each model and dataset are boldfaced.

zero-shot performance. Overall, the base models are more sensitive to the type of demonstrations than
chat models.

C.2 Results for individual languages


In Figure 10, we show the language-specific results for each task, in which we can see language discrep-
ancies with different types of demonstrations.

D The interplay between demonstrations and templates


In this section, we provide supplemental results for Section 6.

D.1 Results for individual languages


We examine the effect of templates and show language-specific results for XCOPA, AfriSenti, XQuAD
and TyDiQA in Figure 11. In a few cases, we found that formatting-focused templates lead to a decline
in performance compared to original templates (e.g., Igbo and Mozambican Portuguese in AfriSenti
with GPT-3.5). This can be attributed to the model’s sensitivity to prompts, highlighting the potential of
automatic prompt engineering. Still, formatting-focused template can largely narrow the performance gap
between 0-shot and 4-shot in a broad context.

Task Pattern Verbalizer

NLI {premise}, right? {label}, {hypothesis} Yes || Also || No


PAWS-X {sentence1}, right? {label}, {sentence2} No || Yes
XCOPA {premise} {% if question == “cause" %}because{% else %} {choice1} || {choice2}
so{% endif %} {label}
XStoryCloze {input_sentence_1} {input_sentence_2} {sentence_quiz_1} ||
{input_sentence_3} {input_sentence_4} {label} {sentence_quiz_2}
AfriSenti {tweet} The sentiment of the previous sentence is {label} positive || neutral || negative
QA {context}\nQ:{question}\nA:{answer} {answer}
MT {source_sentence} = {target_sentence} {target_sentence}

Table 6: Prompting templates for XGLM and Llama 2 following Brown et al. (2020) and Lin et al. (2022).
Task Pattern Verbalizer

NLI {premise} Based on the previous passage, is it true that Yes || Maybe || No
{hypothesis}? Yes, No, or Maybe? {label}
PAWS-X Sentence 1: {sentence1}\n No || Yes
Sentence 2: {sentence2}\n
Question: Can we rewrite Sentence 1 to Sentence 2? Yes or No?
{label}
XCOPA {premise} {% if question == “cause" %}This happened because...
{% else %} As a consequence...{% endif %}\n
Help me pick the more plausible option:\n {choice1} || {choice2}
- {choice1}\n
- {choice2}\n
{label}
XStoryCloze {input_sentence_1} {input_sentence_2}
{input_sentence_3} {input_sentence_4}\n
What is a possible continuation for the story given the following {sentence_quiz_1} ||
options?\n {sentence_quiz_2}
- {sentence_quiz_1}\n
- {sentence_quiz_2}\n
{label}
AfriSenti {tweet} Would you rate the previous sentence as positive, positive || neutral || negative
neutral or negative? {label}
QA {context}\nQ:{question}\nReferring to the passage above, {answer}
the correct answer to the given question is:{answer}
MT Translate the following {src_language} text to {tgt_language}:\n {tgt_sentence}
{src_sentence}\n{tgt_sentence}

Table 7: Prompting templates for BLOOMZ and mT0 following Muennighoff et al. (2023) and Bach et al. (2022).
English German Russian French

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Spanish Chinese Vietnamese Turkish

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Arabic Greek Thai Bulgarian

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Hindi Urdu Swahili
XGLM
80 80 80
Llama 2
Llama 2-Chat
Accuracy

60 60 60
GPT-3.5
GPT-4
40 40 40

0 2 4 8 0 2 4 8 0 2 4 8

(a) XNLI
Hindi Bengali Tamil Marathi
80 80 80 80

70 70 70 70
Accuracy

60 60 60 60

50 50 50 50

40 40 40 40

30 30 30 30

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Malayalam Telugu Kannada Punjabi
80 80 80 80

70 70 70 70
Accuracy

60 60 60 60

50 50 50 50

40 40 40 40

30 30 30 30

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Oriya Assamese Gujarati
80 80 80 XGLM
70 70 70 Llama 2
Llama 2-Chat
Accuracy

60 60 60

50 50 50 GPT-3.5
GPT-4
40 40 40

30 30 30

0 2 4 8 0 2 4 8 0 2 4 8

(b) IndicXNLI
English German Japanese French

80 80 80 80
Accuracy

70 70 70 70

60 60 60 60

50 50 50 50

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Spanish Chinese Korean
XGLM
80 80 80
Llama 2
Llama 2-Chat
Accuracy

70 70 70
GPT-3.5
60 60 60 GPT-4

50 50 50

0 2 4 8 0 2 4 8 0 2 4 8

(c) PAWS-X

Chinese Italian Vietnamese Indonesian


100 100 100 100

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Turkish Thai Estonian Tamil
100 100 100 100

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Swahili Haitian Quechua
100 100 100 XGLM

80 80 80 Llama 2
Llama 2-Chat
Accuracy

60 60 60
GPT-3.5
40 40 40 GPT-4

20 20 20

0 2 4 8 0 2 4 8 0 2 4 8

(d) XCOPA
English Russian Spanish Chinese
100 100 100 100

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Indonesian Arabic Hindi Basque
100 100 100 100

80 80 80 80
Accuracy

60 60 60 60

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Telugu Burmese Swahili
100 100 100 XGLM

80 80 80 Llama 2
Llama 2-Chat
Accuracy

60 60 60
GPT-3.5
40 40 40 GPT-4

20 20 20

0 2 4 8 0 2 4 8 0 2 4 8

(e) XStoryCloze
Swahili Amharic Hausa Kinyarwanda
80 80 80 80

60 60 60 60
Accuracy

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Yoruba Tigrinya Igbo Oromo
80 80 80 80

60 60 60 60
Accuracy

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Moroccan Arabic Algerian Arabic Nigerian Pidgin Mozambican Portuguese
80 80 80 80

60 60 60 60
Accuracy

40 40 40 40

20 20 20 20

0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Tsonga Twi
80 80
XGLM
Llama 2
60 60
Llama 2-Chat
Accuracy

GPT-3.5
40 40
GPT-4

20 20

0 2 4 8 0 2 4 8

(f) AfriSenti
English German Russian Spanish
80 80 80 80

60 60 60 60
F1

40 40 40 40

20 20 20 20

0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Chinese Vietnamese Turkish Arabic
80 80 80 80

60 60 60 60
F1

40 40 40 40

20 20 20 20

0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Greek Romanian Thai Hindi
80 80 80 80

60 60 60 60
F1

40 40 40 40

20 20 20 20

0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4

(g) XQuAD
English Russian Indonesian Korean
80 80 80 80

60 60 60 60
F1

40 40 40 40

20 20 20 20

0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Arabic Finnish Bengali Telugu
80 80 80 80

60 60 60 60
F1

40 40 40 40

20 20 20 20

0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Swahili
80 XGLM
Llama 2
60
Llama 2-Chat
F1

40 GPT-3.5
GPT-4
20

0
0 2 4

(h) TyDiQA
Amharic Hausa Kinyarwanda Luganda Luo
60 60 60 60 60

40 40 40 40 40
ChrF++

20 20 20 20 20

0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Chichewa Nigerian Pidgin Shona Swahili Setswana
60 60 60 60 60

40 40 40 40 40
ChrF++

20 20 20 20 20

0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Twi Xhosa Yoruba Zulu
60 60 60 60
XGLM
Llama 2
40 40 40 40
Llama 2-Chat
ChrF++

GPT-3.5
20 20 20 20 GPT-4

0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8

(i) MAFAND (en-xx)

Amharic Hausa Kinyarwanda Luganda Luo

60 60 60 60 60
ChrF++

40 40 40 40 40

20 20 20 20 20

0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Chichewa Nigerian Pidgin Shona Swahili Setswana

60 60 60 60 60
ChrF++

40 40 40 40 40

20 20 20 20 20

0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Twi Xhosa Yoruba Zulu
XGLM
60 60 60 60 Llama 2
Llama 2-Chat
ChrF++

40 40 40 40
GPT-3.5
GPT-4
20 20 20 20

0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8

(j) MAFAND (xx-en)

Figure 9: Language-specific performance for both classification and generation tasks with different numbers of
demonstrations. We average and report standard deviations over 3 seeds for all models except GPT-4.
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy

0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
English
English
German

Russian
German
French

Zero
Zero

Spanish
Japanese
Chinese

Vietnamese

Top-k
Top-k

French Turkish

GPT-4
GPT-4

XGLM
XGLM

(a) XNLI

Llama 2
Llama 2

GPT-3.5
GPT-3.5
Arabic

(c) PAWS-X
Llama 2-Chat
Llama 2-Chat

Spanish Greek

Random
Random

Thai

Bulgarian
Chinese
Hindi

Urdu
Korean
Swahili

Random-corrupted
Random-corrupted

Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy

0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75

0
50
100
0
50
100
0
50
100
0
50
100
0
50
100
Chinese Hindi

Italian Bengali

Vietnamese Tamil
Zero

Zero
Indonesian Marathi

Turkish Malayalam
Top-k

Top-k
Thai Telugu
GPT-4
XGLM

GPT-4
XGLM
Llama 2

GPT-3.5

Llama 2

GPT-3.5

(d) XCOPA
Llama 2-Chat

Kannada

Llama 2-Chat
Estonian
(b) IndicXNLI
Random

Random
Tamil Punjabi

Swahili Oriya

Haitian Assamese

Quechua Gujarati
Random-corrupted

Random-corrupted
XGLM XGLM
100
Zero Top-k Random Random-corrupted 75 Zero Top-k Random Random-corrupted

Accuracy
Accuracy

50
50
25

0 0
Llama 2 Llama 2
100
75

Accuracy
Accuracy

50
50
25

0 0
Llama 2-Chat Llama 2-Chat
100
75

Accuracy
Accuracy

50
50
25

0
0
GPT-3.5 GPT-3.5
100 75

Accuracy
Accuracy

50
50
25

0
0
GPT-4
GPT-4
100 75
Accuracy
Accuracy

50
50
25

0
0
Amharic

Hausa
Swahili

Yoruba

Igbo
Tigrinya

Oromo

Tsonga

Twi
Kinyarwanda

Mo. Arabic

Al. Arabic

Ni. Pidgin

Portuguese
English

Russian

Spanish

Chinese

Arabic

Hindi

Basque

Telugu

Burmese
Indonesian

Swahili

(e) XStoryCloze (f) AfriSenti

XGLM XGLM
75 75
Zero Top-k Random Random-corrupted Zero Top-k Random Random-corrupted
50 50
F1

F1

25 25

0 0
Llama 2 Llama 2
75 75

50 50
F1

F1

25 25

0 0
Llama 2-Chat Llama 2-Chat
75 75

50 50
F1

F1

25 25

0 0
GPT-3.5 GPT-3.5
75 75

50 50
F1

F1

25 25

0 0
GPT-4 GPT-4
75 75

50 50
F1

F1

25 25

0 0
English

German

Russian

Spanish

Chinese

Turkish

Arabic

Greek

Thai

Hindi
Vietnamese

Romanian

English

Russian

Korean

Arabic

Finnish

Telugu
Indonesian

Bengali

Swahili

(g) XQuAD (h) TyDiQA


XGLM XGLM
60
60
Zero Top-k Random Random-corrupted Zero Top-k Random Random-corrupted
40
ChrF++

ChrF++
40
20 20

0 0
Llama 2 Llama 2
60
60
40
ChrF++

ChrF++

40
20 20

0 0
Llama 2-Chat Llama 2-Chat
60
60
40
ChrF++

ChrF++

40
20 20

0 0
GPT-3.5 GPT-3.5
60
60
40
ChrF++

ChrF++

40
20 20

0 0
GPT-4 GPT-4
60
60
40
ChrF++

ChrF++

40
20 20

0 0
Luo

Twi

Luo

Twi
Amharic

Hausa

Xhosa

Xhosa
Luganda

Chichewa

Shona

Yoruba

Zulu

Amharic

Hausa

Luganda

Chichewa

Shona

Yoruba

Zulu
Kinyarwanda

Ni. Pidgin

Swahili

Setswana

Kinyarwanda

Ni. Pidgin

Swahili

Setswana

(i) MAFAND (en-xx) (j) MAFAND (xx-en)

Figure 10: Language-specific performance of 4-shot ICL using different types of demonstrations. We average and
report standard deviations over 3 seeds for all models except GPT-4.
Llama 2-Chat
100
0-shot w/ original template 0-shot w/ formatting-focused template
75 4-shot w/ original template 4-shot w/ formatting-focused template
Accuracy

50

25

0
GPT-3.5
100

75
Accuracy

50

25

0
GPT-4
100

75
Accuracy

50

25

0
Chinese

Turkish

Thai

Tamil
Italian

Estonian

Swahili

Haitian

Quechua
Vietnamese

Indonesian

(a) XCOPA

Llama 2-Chat
80
0-shot w/ original template 0-shot w/ formatting-focused template
60 4-shot w/ original template 4-shot w/ formatting-focused template
Accuracy

40

20

0
GPT-3.5
80

60
Accuracy

40

20

0
GPT-4
80

60
Accuracy

40

20

0
Hausa

Igbo

Twi
Oromo

Tsonga
Swahili

Amharic

Yoruba

Tigrinya

Mo. Arabic

Al. Arabic
Kinyarwanda

Ni. Pidgin

Portuguese

(b) AfriSenti
Llama 2-Chat
100
0-shot w/ original template 0-shot w/ formatting-focused template
75 4-shot w/ original template 4-shot w/ formatting-focused template
F1

50

25

0
GPT-3.5
100

75
F1

50

25

0
GPT-4
100

75
F1

50

25

0
English

German

Russian

Spanish

Chinese

Turkish

Arabic

Greek

Thai

Hindi
Vietnamese

Romanian
(c) XQuAD

Llama 2-Chat
100
0-shot w/ original template 0-shot w/ formatting-focused template
80 4-shot w/ original template 4-shot w/ formatting-focused template
60
F1

40

20

0
GPT-3.5
100

80

60
F1

40

20

0
GPT-4
100

80

60
F1

40

20

0
English

Russian

Korean

Arabic

Finnish

Bengali

Telugu

Swahili
Indonesian

(d) TyDiQA

Figure 11: Effect of using different templates on 0-shot and 4-shot performance for XCOPA, AfriSenti, and TyDiQA.
Few-shot results are averaged across 3 seeds except for GPT-4.
Task Template

NLI task instruction: You are an NLP assistant whose purpose is to solve Natural Language Inference
(NLI) problems in <EVALUATION_LANGUAGE>. NLI is the task of determining the inference relation
between two (short, ordered) texts: entailment, contradiction, or neutral. Answer as concisely as
possible in the same format as the examples below:
pattern: {premise}\nQuestion: {hypothesis}\nTrue, False, or Neither?
verbalizer: True || Neither || False

PAWS-X task instruction: You are an NLP assistant whose purpose is to perform Paraphrase Identification in
<EVALUATION_LANGUAGE>. The goal of Paraphrase Identification is to determine whether a pair
of sentences have the same meaning. Answer as concisely as possible in the same format as the
examples below:
pattern: {sentence1}\nQuestion: {sentence2}\nTrue or False?
verbalizer: False || True

XCOPA task instruction: You are an NLP assistant whose purpose is to perform open-domain commonsense
causal reasoning in <EVALUATION_LANGUAGE>. You will be provided a premise and two alternatives,
where the task is to select the alternative that more plausibly has a causal relation with the premise.
Answer as concisely as possible in the same format as the examples below:
pattern:
Premise: {premise}\nWhat is the {question}? Pick the more plausible option:\n
1: {choice1}\n2: {choice2}\n
You should tell me the choice number in this format ’Choice number:’
verbalizer: Choice number: 1 || Choice number: 2

XStoryCloze task instruction: You are an NLP assistant whose purpose is to perform open-domain commonsense
causal reasoning in <EVALUATION_LANGUAGE>. You will be provided a four-sentence story and two
continuations, where the task is to select the correct ending. Answer as concisely as possible in the same
format as the examples below:
pattern:
Story: {input_sentence_1} {input_sentence_2} {input_sentence_3} {input_sentence_4}\n
What is a possible continuation for the story? Pick the more plausible option:\n
1: {sentence_quiz1}\n2: {sentence_quiz2}\n
You should tell me the choice number in this format ’Choice number:’
verbalizer: Choice number: 1 || Choice number: 2

AfriSenti task instruction: You are an NLP assistant whose purpose is to perform Sentiment Analysis in
<EVALUATION_LANGUAGE>. Sentiment Analysis is the task of determining the sentiment,
opinion or emotion expressed in a textual data. Give your answer as a single word, "positive", "neutral"
or "negative".
pattern: Does this statement “{tweet}” have a {positive neutral or negative} sentiment? Labels only
verbalizer: positive || neutral || negative

QA task instruction: You are an NLP assistant whose purpose is to solve reading comprehension
problems in <EVALUATION_LANGUAGE>. You will be provided questions on a set of passages and
you will need to provide the answer as it appears in the passage. The answer should be in the same
language as the question and the passage.
pattern:
{context}\nQ: {question}\nReferring to the passage above, the correct answer to the given question is:
verbalizer: {answer}

MT pattern: Translate the following {src_language} text to {tgt_language}: {src_sentence}


verbalizer: {tgt_sentence}

Table 8: Prompting templates for chat models following Ahuja et al. (2023) and Ojo et al. (2023). We add language
identifiers in task instructions as it is an effective strategy for improving multilingual prompting (Huang et al., 2023).
Task Template

XCOPA task instruction: You are an NLP assistant whose purpose is to perform open-domain commonsense
causal reasoning in <EVALUATION_LANGUAGE>. You will be provided a premise and two alternatives,
where the task is to select the alternative that more plausibly has a causal relation with the premise.
Answer as concisely as possible in the same format as the examples below:
pattern:
Premise: {premise}\nWhat is the {question}? Pick the more plausible option:\n
1: {choice1}\n2: {choice2}\n
This is very important: Do not repeat the question and no explanation.
You should tell me the choice number in this format ’Choice number:’
verbalizer: Choice number: 1 || Choice number: 2

AfriSenti task instruction: You are an NLP assistant whose purpose is to perform Sentiment Analysis in
<EVALUATION_LANGUAGE>. Sentiment Analysis is the task of determining the sentiment,
opinion or emotion expressed in a textual data. Give your answer as a single word, "positive", "neutral"
or "negative".
pattern: Does this statement “{tweet}” have a {positive neutral or negative} sentiment?
This is very important: Do not repeat the question and no explanation. Labels only
verbalizer: positive || neutral || negative

QA task instruction: You are an NLP assistant whose purpose is to solve reading comprehension
problems in <EVALUATION_LANGUAGE>. Answer the question from the given passage. Your answer
should be directly extracted from the passage and be a single entity, name, or number, not a sentence.
pattern:
{context}\nQ: {question}\nThis is very important: Your answer should be directly extracted from the
passage and be a single entity, name, or number, not a sentence.
verbalizer: {answer}

Table 9: Formatting-focused templates for chat models. We augmented the original templates in Table 8 with
formatting-focused instructions.

You might also like