Multilingual ICL
Multilingual ICL
A Multidimensional Analysis
Miaoran Zhang1 Vagrant Gautam1 Mingyang Wang2,3,4 Jesujoba O. Alabi1
∗
Xiaoyu Shen5 Dietrich Klakow1 Marius Mosbach6
1
Saarland University, Saarland Informatic Campus
2
Bosch Center for AI 3 LMU Munich 4 Munich Center for Machine Learning (MCML)
5
Eastern Institute of Technology, Ningbo 6 Mila, McGill University
{mzhang,vgautam,jalabi,dietrich.klakow}@lsv.uni-saarland.de
[email protected] [email protected] [email protected]
Abstract in better understanding the factors that influence
In-context learning is a popular inference strat-
its success, such as demonstration selection (Liu
egy where large language models solve a task et al., 2022; Rubin et al., 2022; Wang et al., 2023c),
using only a few labeled demonstrations with- prompt design (Min et al., 2022a; Wei et al., 2022),
arXiv:2402.12976v2 [cs.CL] 7 Jun 2024
out needing any parameter updates. Although and more generally on understanding how and why
there have been extensive studies on English in- in-context learning works (Xie et al., 2022; Bansal
context learning, multilingual in-context learn- et al., 2023; Hendel et al., 2023; Pan et al., 2023;
ing remains under-explored, and we lack an Wang et al., 2023b).
in-depth understanding of the role of demon-
However, most recent work on in-context learn-
strations in this context. To address this gap,
we conduct a multidimensional analysis of ing predominantly focuses on English, and the ex-
multilingual in-context learning, experimenting ploration of multilingual in-context learning gen-
with 5 models from different model families, erally lags behind. This is problematic, as results
9 datasets covering classification and gener- that apply to English might not hold for other lan-
ation tasks, and 56 typologically diverse lan- guages, especially those that are less represented
guages. Our results reveal that the effectiveness in LLM training data. While there have been
of demonstrations varies significantly across
a few studies on in-context learning that go be-
models, tasks, and languages. We also find that
strong instruction-following models including
yond English, they either focus on benchmarking
Llama 2-Chat, GPT-3.5, and GPT-4 are largely LLMs on multilingual tasks without in-depth ex-
insensitive to the quality of demonstrations. In- ploration, e.g., MEGA (Ahuja et al., 2023) and
stead, a carefully crafted template often elimi- BUFFET (Asai et al., 2023), or zoom in on specific
nates the benefits of demonstrations for some capabilities such as mathematical reasoning (Shi
tasks and languages altogether. These findings et al., 2023b), machine translation (Zhu et al., 2023;
show that the importance of demonstrations Agrawal et al., 2023), or code-switching (Zhang
might be overestimated. Our work highlights
et al., 2023).
the need for granular evaluation across mul-
tiple axes towards a better understanding of In this work, we take a multidimensional ap-
in-context learning.1 proach (Ruder et al., 2022) that unifies these strands
of research and comprehensively evaluate the mul-
1 Introduction tilingual in-context learning abilities of LLMs. We
An intriguing property of large language models focus on dissecting the actual impact of in-context
(LLMs) is their ability to perform in-context learn- demonstrations, which is crucial for understanding
ing (Brown et al., 2020), i.e., solve a task condi- model behaviour. Our research covers various mod-
tioned on a few demonstrations at inference time, els, tasks, and languages, and we seek to answer
without updating the model parameters. It has the following research questions:
been shown to be an efficient alternative to fine- 1. Does multilingual performance benefit from
tuning when adapting models to diverse tasks and demonstrations? (§4)
domains (Dong et al., 2022; Min et al., 2022b; Si 2. Does demonstration quality matter? (§5)
et al., 2023, inter alia). In light of the success of in-
context learning, there has been increased interest 3. What is the interplay between demonstrations
∗
and templates? (§6)
Corresponding author.
1
We release our code publicly at https://github.com/ 4. How do the answers to these questions vary
uds-lsv/multilingual-icl-analysis. across languages and models? (§4, §5, §6)
Classification task
You are an NLP assistant for sentiment analysis in Chinese. Give your answer as
"positive", "negative" or "neutral".
Demonstration(s): 今天是让 放松的 天。What is the sentiment of this statement? positive
Answer: positive
In-context learning
Test Input: 我喜欢的队伍赢了这场 赛。What is the sentiment of this statement?
Answer:
Generation task
You are an NLP assistant for question answering in English. The answer should be
directly extracted from the passage.
Demonstration(s): [passage] Q: What team was the winner of Super Bowl XXXIII?
The correct answer to the given passage is: Broncos LLM 374
Test Input: [passage] Q: How many companies were listed on the WSE on August
2009? The correct answer to the given passage is:
Generation task
You are an NLP assistant for question answering in German. The answer should Zero-shot learning Es waren 374
be directly extracted from the passage.
Test Input: [passage] Q: Wie viele Firmen waren am August 2009 bei der WSE
Firmen gelistet.
gelistet? The correct answer to the given passage is:
人
比
一
Figure 1: An overview of the components of multilingual in-context learning (§2) with a comparison to zero-shot
learning. Sources of variation include tasks, languages, models, and the template, i.e., the task instruction, patterns
for formatting inputs, and verbalized labels.
Specifically, we address our research questions by templates to faithfully represent its effectiveness.
evaluating 5 LLMs including base models that are Given the vast variance across models, tasks, and
only pre-trained on unlabeled text corpora (XGLM languages, it is also important to cautiously frame
and Llama 2), and chat models that are further claims about in-context learning.
refined with instruction tuning and reinforcement
learning (Llama 2-Chat, GPT-3.5, and GPT-4). We 2 Preliminaries
evaluate on 9 multilingual datasets that include 2.1 In-context learning
both classification and generation tasks, covering
In-context learning (ICL) is a popular inference
56 typologically different languages.
strategy where models solve2 a task without any
Our main findings are: (1) The effectiveness
parameter updates (Brown et al., 2020). Instead,
of demonstrations varies widely depending on the
the model performs the task by conditioning on
model, task, and language used. For base mod-
labeled demonstrations. Demonstrations are typi-
els, in-context learning barely outperforms zero-
cally formatted using “pattern-verbalizer pairs,” as
shot learning on many tasks. In general, in-context
this has been shown to be effective in eliciting good
learning matters more for generation tasks with
task performance (Schick and Schütze, 2021; Bach
loosely-specified prompts; (2) Even with sophisti-
et al., 2022). Here, a pattern is used to format
cated demonstration selection methods, in-context
the input for the model, and a verbalizer maps the
learning is not always beneficial and can some-
label to a textual representation. Additionally for
times be worse than using no demonstrations at
instruction-tuned LLMs, a task instruction is of-
all; (3) Chat models are less sensitive to seeing
ten added to provide information about the task
correctly-labeled demonstrations than base models,
beyond individual demonstrations (Mishra et al.,
suggesting that for the former, demonstrations pri-
2022b; Wang et al., 2022; Ouyang et al., 2022).
marily help the model understand the task format,
Formally, given a test sample xt , k demonstra-
while for the latter, demonstrations also impart task-
tions {(xi , yi )}ki=1 , a pattern P, a verbalizer V and
specific knowledge; (4) Using a formatting-focused
a task instruction I, the model (parameterized by
template can even eliminate the need for demonstra-
θ) makes its prediction as follows:
tions with chat models. The relative significance
of demonstrations versus prompt templates varies yt ∼ pθ (y|I, {(P(xi ), V(yi ))}ki=1 , P(xt )). (1)
based on inherent model capabilities.
2
In sum, we suggest that the benefits of adding The extent to which models actually “solve” tasks is an
open question as ICL, similar to fine-tuning, has generalization
demonstrations may be overestimated. Future work issues despite its impressive results (Mosbach et al., 2023).
on in-context learning should carefully compare Regardless, we use the word “solve” in the rest of this paper
their results with zero-shot learning and on multiple for simplicity.
Taken together, the pattern, the verbalizer and the 3 Experimental setup
optional task instruction comprise the template
Models. We evaluate two types of LLMs: pre-
with which demonstrations and the test sample are
trained base models and chat models. Our base
formatted as the input prompt for model inference.
models include XGLM (Lin et al., 2022) and
The effectiveness of demonstrations is thus linked
Llama 2 (Touvron et al., 2023). Our chat mod-
with the template used to present them to the model.
els are Llama 2-Chat, GPT-3.5 (Ouyang et al.,
2.2 Multilingual prompting 2022) and GPT-4 (OpenAI et al., 2023). Specif-
Previous studies highlight that the selection of ically, we use xglm-7.5B, Llama-2-13b, and
demonstrations and prompt templates can signif- Llama-2-13b-chat on Huggingface (Wolf et al.,
icantly influence model performance (Liu et al., 2020), and we access gpt-3.5-turbo-16k and
2022; Fu et al., 2023b; Sclar et al., 2024). In mul- gpt-4-32k APIs via Microsoft Azure.3
tilingual in-context learning, the variation in input Tasks and datasets. We experiment on a di-
prompts is further complicated by the language of verse range of multilingual classification and gen-
demonstrations, templates and test samples, all of eration tasks, using 9 datasets covering 56 lan-
which are important design choices. guages in total. Our dataset selection largely fol-
For the template language, Lin et al. (2022) and lows MEGA (Ahuja et al., 2023), but we add
Ahuja et al. (2023) found that English templates datasets for extremely under-represented African
generally perform better than native language languages. Our classification tasks include natu-
templates, possibly due to superior instruction- ral language inference (NLI), paraphrase identi-
following abilities on existing LLMs on English fication, commonsense reasoning and sentiment
compared to other languages. Following this, we analysis, with the following datasets: XNLI (Con-
use English templates in our study. neau et al., 2018), IndicXNLI (Aggarwal et al.,
For the language of few-shot demonstrations 2022), PAWS-X (Yang et al., 2019), XCOPA (Ponti
and test samples, there are three popular settings. et al., 2020), XStoryCloze (Lin et al., 2022) and
Given a test sample in a certain language, the most AfriSenti (Muhammad et al., 2023). Our gen-
straightforward approach is to use demonstrations eration tasks are extractive question answering
in the same language (referred to as in-language (QA) and machine translation (MT), for which
demonstrations). This setting directly measures we use XQuAD (Artetxe et al., 2020), TyDiQA-
the model’s inherent ability to solve problems in GoldP (Clark et al., 2020), and MAFAND (Adelani
that language. Another choice is to use English et al., 2022). See Appendix A.1 for more details.
demonstrations regardless of the language of the
test sample. This is a cross-lingual transfer setup, In-context learning. For each test sample, we
where the goal is to transfer knowledge from a select k ∈ {0, 2, 4, 8}4 different demonstrations,
pivot language to a target language via in-context which are randomly sampled unless otherwise spec-
learning. As highlighted in Shi et al. (2023b) and ified. All demonstrations are in the same language
Ahuja et al. (2023), in-language demonstrations of- as the test sample, and all templates are in English.
ten outperform English demonstrations on diverse We employ appropriate task-specific templates for
multilingual tasks. Yet another option is to translate different model types. All templates and data splits
the test sample into English – an approach called are shown in Appendix A.2.
translate-test, where the demonstrations are also in Metrics. For classification tasks, we report the
English. While translate-test leads to strong perfor- rank classification accuracy5 for open-source base
mance (Ahuja et al., 2023), this approach heavily models (Muennighoff et al., 2023; Lin et al., 2022).
relies on a translation system for data processing 3
We also experiment with BLOOMZ and mT0 (Muen-
and centers the English proficiency of LLMs. In nighoff et al., 2023). Results in Appendix B.1 show that
this work, we are interested in dissecting the intrin- their zero-shot performance significantly surpasses few-shot
sic multilingual capabilities of LLMs, therefore we performance, which we ascribe to their training scheme.
4
For QA datasets, we select a maximum of 4 demonstra-
choose to use in-language demonstrations. tions due to context size limitations.
All these design choices are represented visually 5
The scoring function is the average of per-token log prob-
in Figure 1, which gives an overview of multilin- abilities (ignoring the common prefix of different candidates).
The candidate with the highest score is chosen as the predic-
gual in-context learning. Detailed setup informa- tion.
tion is provided in the next section.
XNLI IndicXNLI PAWS-X XCOPA XStoryCloze
70
70 90
60 70 80
Accuracy
Accuracy
Accuracy
Accuracy
Accuracy
60 80
50
50 60 60 70
40
40 60
50
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
AfriSenti XQuAD TyDiQA MAFAND (en-xx) MAFAND (xx-en)
60 30
40
60 60
Accuracy
ChrF++
ChrF++
50 20
F1
F1
40 40 20
40 10
20 20
30 0 0
0 2 4 8 0 2 4 0 2 4 0 2 4 8 0 2 4 8
k-shot k-shot k-shot k-shot k-shot
XGLM Llama 2 Llama 2-Chat GPT-3.5 GPT-4
Figure 2: Average performance across languages with different numbers of demonstrations. We average and report
standard deviations over 3 seeds for all models except GPT-4. Note that the standard deviations are relatively
small, possibly because of averaging over languages. en-xx: translating from English to another language, xx-en:
translating from another language to English.
5
score (Popović, 2017) for MAFAND. Implemen-
0
tation details for our evaluation are provided in
−5
Appendix A.3.
XNLI
PAWS-X
XCOPA
XQuAD
TyDiQA
IndicXNLI
MT (en-xx)
MT (xx-en)
XStoryCloze
AfriSenti
4 Do (more) demonstrations benefit
multilingual performance?
Figure 3: Performance difference between 4-shot and
In this section, we systematically compare ICL 0-shot. Each marker represents the average performance
and zero-shot learning as this question is under- across models for each language in a given task. MT
explored in previous studies of multilingual denotes the MAFAND dataset.
ICL (Ahuja et al., 2023; Asai et al., 2023). We ex-
amine model performance on diverse multilingual and MAFAND datasets, particularly when translat-
tasks while varying the number of demonstrations, ing English to African languages, lags significantly
and show the results for classification tasks and behind other tasks, showing that language discrep-
generation tasks in Figure 2. ancies remain even in the best models.
We begin with the overall trends across models An important pattern across datasets and models
and datasets. OpenAI’s GPT-3.5 and GPT-4 models is that in-context learning does not always im-
achieve the best multilingual in-context learning prove over zero-shot learning – in particular, it
performance on all our datasets, which is unsurpris- helps with generation tasks, but results on classifi-
ing as they are currently the state-of-the-art on a cation tasks are mixed. For the AfriSenti dataset,
large suite of NLP benchmarks. The next best mod- many models show noticeable improvements with
els are Llama 2 and Llama 2-Chat, which demon- ICL. However, with other tasks such as IndicXNLI,
strate comparable or superior performance to the XNLI and PAWS-X, the same models, especially
multilingual XGLM model despite being trained base models, perform much worse compared to
primarily on English corpora (Touvron et al., the zero-shot setting. We also see marginal im-
2023). This indicates that their task-solving abil- provements in some cases, e.g., XGLM and Llama
ities can transfer across languages. Regardless of 2 on XCOPA. In comparison to chat models, the
the model, however, performance on the AfriSenti addition of demonstrations typically reduces the
6
We extract verbalized labels from the generated outputs performance of base models across many tasks.
using regular expressions before calculating the exact match.
Model XNLI IndicXNLI PAWS-X XCOPA XStoryCloze AfriSenti XQuAD TyDiQA MT (en-xx) MT (xx-en)
XGLM 4.59 2.49 0.24▽ 0.03 0.97▽ 5.62 1.77 4.21 1.31 0.66
Llama 2 6.61 4.17 2.35 −0.11 0.33 4.17 1.32 0.54 2.15 1.35
Llama 2-Chat −0.28 −1.36 −1.71▽ 0.32 0.43 2.17 1.02 2.42 0.74 0.66
GPT-3.5 0.18 0.71 −2.07 0.86 −0.61 −0.66▽ −0.34 2.98 0.72 0.43
GPT-4 0.76 −0.19 0.07 −0.36 0.05 −0.68 −0.77 1.88 1.21 0.65
Table 1: Performance difference of 4-shot ICL with TOP - K vs. RANDOM selection. Positive numbers show that
TOP - K is better than RANDOM (expected), and highlighted cells show where top-k is even worse than random
selection. ▽: TOP - K performance is even worse than zero-shot learning. For RANDOM, we average over 3 seeds
(except for GPT-4).
Model XNLI IndicXNLI PAWS-X XCOPA XStoryCloze AfriSenti XQuAD TyDiQA MT (en-xx) MT (xx-en)
XGLM 0.46 −0.05 0.44 0.51 0.62∗ 3.78∗ 24.56∗ 26.64∗ 3.18∗ 6.73∗
Llama 2 0.96∗ 0.43 1.16 0.61∗ 1.12∗ 2.27∗ 26.68∗ 29.20∗ 4.79∗ 8.34∗
Llama 2-Chat −0.34 0.04 1.48 0.03 −0.23 0.77∗ 5.94∗ 4.37∗ 1.13∗ 1.53∗
GPT-3.5 0.39 1.02 0.64 0.26 0.58∗ −0.62 5.46∗ 5.61∗ 1.39∗ 0.48∗
GPT-4 −0.86 −0.04 0.57 0.86 1.13 0.90 9.60 6.97 1.24 0.64
Table 2: Performance difference of 4-shot ICL with RANDOM vs. RANDOM - CORRUPTED demonstrations. Positive
numbers show that RANDOM is better than RANDOM - CORRUPTED (expected), and highlighted cells show where
corrupted labels perform even better than ground-truth labels. We average over 3 seeds (except for GPT-4). ∗: a
significant difference (p = 0.05).
10
XGLM Llama 2 Llama 2-Chat GPT-3.5 GPT-4
languages and models can behave very differently
Performance difference (%)
Llama 2
F1
40 40
20 20
0 0
80 80
60 60
Accuracy
GPT-3.5
F1
40 40
20 20
0 0
Hausa
Igbo
Twi
Oromo
Tsonga
Swahili
Amharic
Yoruba
Tigrinya
English
German
Russian
Spanish
Chinese
Turkish
Arabic
Greek
Thai
Hindi
Mo. Arabic
Al. Arabic
Kinyarwanda
Ni. Pidgin
Portuguese
Vietnamese
Romanian
Figure 5: Performance of 4-shot ICL using different types of demonstrations for individual languages on AfriSenti
and XQuAD. The top row shows Llama 2 results, and the bottom row shows GPT-3.5 results.
Accuracy
60 40
In-context learning performance depends not only 40
on the demonstrations, which we have varied so 20
20
F1
40 40
and demonstrations.
20 20
tion and generation tasks that seem to benefit most Original 73.2 79.3 72.8 78.3
GPT-4 Corrupted 63.6 79.8 65.8 77.6
from in-context demonstrations (see Section 4). ∆ 9.6 -0.5 7.0 0.7
However, as Figure 6 shows, the performance
gap between zero-shot and in-context learning Table 3: Effect of using different templates on 4-shot
performance with RANDOM and RANDOM - CORRUPTED
diminishes with formatting-focused templates.
demonstrations. When using formatting-focused tem-
The gap reduction is more substantial for QA plates (F) over the original templates (O), the perfor-
datasets (i.e., the generation tasks) than for XCOPA mance gap (∆) between original and corrupted labels
and AfriSenti (i.e., the classification tasks). We decreases. We average and report standard deviations
speculate that it is simpler for the model to gen- over 3 seeds for all models except GPT-4.
erate label words for classification tasks with a
pre-defined label space than to answer questions in
With our new formatting-focused templates, we
a way that is easy to evaluate automatically. In the
revisit the impact of the input-label mapping dis-
latter case, formatting-focused templates can teach
cussed in Section 5. As Table 3 shows, all
output styling, largely eliminating the benefits of
models perform worse with corrupted labels, but
demonstrations.
formatting-focused templates largely mitigate this
Compared to GPT-3.5 and GPT-4, Llama 2-Chat
degradation. Notably, GPT-4 using corrupted la-
performs worse in both zero-shot and few-shot set-
bels performs on par with ground truth labels.
tings, and formatting-focused templates have a less
This strengthens our finding that the correct input-
pronounced impact. On QA datasets, GPT-3.5 and
label mapping is not that important, while also
GPT-4 even achieve better zero-shot performance
highlighting the crucial role that templates play in
with formatting-focused templates than ICL with
in-context learning.
original templates, a pattern that is not observed
Figure 7 shows the language-specific effects of
with Llama 2-Chat. This suggests that the relative
formatting-focused templates on XQuAD (results
significance of demonstrations and templates
for other tasks are in Appendix D.1). For Llama
varies based on the inherent abilities of models
2-Chat, demonstrations remain essential even with
at solving tasks and following instructions.
Llama 2-Chat
0-shot
the underlying mechanisms of ICL (Xie et al., 2022;
80
60
4-shot Von Oswald et al., 2023; Wang et al., 2023b; Hen-
del et al., 2023), motivated by its successes. Our
F1
40
80
Optimizing demonstrations or templates. With
60
the increasing popularity of research on demonstra-
F1
German
Russian
Spanish
Chinese
Turkish
Arabic
Greek
Thai
Hindi
Vietnamese
Romanian
Sebastian Ruder, Ivan Vulić, and Anders Søgaard. Johannes Von Oswald, Eyvind Niklasson, Ettore Ran-
2022. Square one bias in NLP: Towards a multi- dazzo, Joao Sacramento, Alexander Mordvintsev,
dimensional exploration of the research manifold. Andrey Zhmoginov, and Max Vladymyrov. 2023.
In Findings of the Association for Computational Transformers learn in-context by gradient descent.
Linguistics: ACL 2022, pages 2340–2354, Dublin, In Proceedings of the 40th International Conference
Ireland. Association for Computational Linguistics. on Machine Learning, volume 202 of Proceedings
of Machine Learning Research, pages 35151–35174.
Timo Schick and Hinrich Schütze. 2021. It’s not just PMLR.
size that matters: Small language models are also few-
shot learners. In Proceedings of the 2021 Conference Xingchen Wan, Ruoxi Sun, Hootan Nakhost, Hanjun
of the North American Chapter of the Association Dai, Julian Eisenschlos, Sercan Arik, and Tomas Pfis-
for Computational Linguistics: Human Language ter. 2023. Universal self-adaptive prompting. In Pro-
Technologies, pages 2339–2352, Online. Association ceedings of the 2023 Conference on Empirical Meth-
for Computational Linguistics. ods in Natural Language Processing, pages 7437–
7462, Singapore. Association for Computational Lin-
Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane guistics.
Suhr. 2024. Quantifying language models’ sensitiv-
ity to spurious features in prompt design or: How i Boshi Wang, Sewon Min, Xiang Deng, Jiaming Shen,
learned to start worrying about prompt formatting. You Wu, Luke Zettlemoyer, and Huan Sun. 2023a.
In The Twelfth International Conference on Learning Towards understanding chain-of-thought prompting:
Representations. An empirical study of what matters. In Proceedings
of the 61st Annual Meeting of the Association for
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Computational Linguistics (Volume 1: Long Papers),
Scales, David Dohan, Ed H Chi, Nathanael Schärli, pages 2717–2739, Toronto, Canada. Association for
and Denny Zhou. 2023a. Large language models Computational Linguistics.
can be easily distracted by irrelevant context. In In-
ternational Conference on Machine Learning, pages Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou,
31210–31227. PMLR. Fandong Meng, Jie Zhou, and Xu Sun. 2023b. Label
words are anchors: An information flow perspective
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, for understanding in-context learning. In Proceed-
Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, ings of the 2023 Conference on Empirical Methods
Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, in Natural Language Processing, pages 9840–9855,
and Jason Wei. 2023b. Language models are multi- Singapore. Association for Computational Linguis-
lingual chain-of-thought reasoners. In The Eleventh tics.
International Conference on Learning Representa-
tions. Xinyi Wang, Wanrong Zhu, Michael Saxon, Mark
Steyvers, and William Yang Wang. 2023c. Large
Peng Shi, Rui Zhang, He Bai, and Jimmy Lin. language models are latent variable models: Explain-
2022. XRICL: Cross-lingual retrieval-augmented in- ing and finding good demonstrations for in-context
context learning for cross-lingual text-to-SQL seman- learning. In Thirty-seventh Conference on Neural
tic parsing. In Findings of the Association for Com- Information Processing Systems.
putational Linguistics: EMNLP 2022, pages 5248–
5259, Abu Dhabi, United Arab Emirates. Association Yizhong Wang, Swaroop Mishra, Pegah Alipoormo-
for Computational Linguistics. labashi, Yeganeh Kordi, Amirreza Mirzaei, Atharva
Naik, Arjun Ashok, Arut Selvan Dhanasekaran,
Anjana Arunkumar, David Stap, Eshaan Pathak, in Natural Language Processing and the 9th Inter-
Giannis Karamanolakis, Haizhi Lai, Ishan Puro- national Joint Conference on Natural Language Pro-
hit, Ishani Mondal, Jacob Anderson, Kirby Kuznia, cessing (EMNLP-IJCNLP), pages 3687–3692, Hong
Krima Doshi, Kuntal Kumar Pal, Maitreya Patel, Kong, China. Association for Computational Linguis-
Mehrad Moradshahi, Mihir Parmar, Mirali Purohit, tics.
Neeraj Varshney, Phani Rohitha Kaza, Pulkit Verma,
Ravsehaj Singh Puri, Rushang Karia, Savan Doshi, Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyun-
Shailaja Keyur Sampat, Siddhartha Mishra, Sujan soo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee,
Reddy A, Sumanta Patro, Tanay Dixit, and Xudong and Taeuk Kim. 2022. Ground-truth labels matter: A
Shen. 2022. Super-NaturalInstructions: Generaliza- deeper look into input-label demonstrations. In Pro-
tion via declarative instructions on 1600+ NLP tasks. ceedings of the 2022 Conference on Empirical Meth-
In Proceedings of the 2022 Conference on Empiri- ods in Natural Language Processing, pages 2422–
cal Methods in Natural Language Processing, pages 2437, Abu Dhabi, United Arab Emirates. Association
5085–5109, Abu Dhabi, United Arab Emirates. As- for Computational Linguistics.
sociation for Computational Linguistics.
Ruochen Zhang, Samuel Cahyawijaya, Jan Chris-
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten tian Blaise Cruz, Genta Winata, and Alham Aji.
Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, 2023. Multilingual large language models are not
and Denny Zhou. 2022. Chain-of-thought prompt- (yet) code-switchers. In Proceedings of the 2023
ing elicits reasoning in large language models. In Conference on Empirical Methods in Natural Lan-
Advances in Neural Information Processing Systems, guage Processing, pages 12567–12582, Singapore.
volume 35, pages 24824–24837. Curran Associates, Association for Computational Linguistics.
Inc.
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Sameer Singh. 2021. Calibrate before use: Improv-
Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, ing few-shot performance of language models. In In-
Da Huang, Denny Zhou, et al. 2023. Larger language ternational Conference on Machine Learning, pages
models do in-context learning differently. arXiv. 12697–12706. PMLR.
Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu,
Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Shujian Huang, Lingpeng Kong, Jiajun Chen, and
Spencer-Smith, and Douglas C. Schmidt. 2023. A Lei Li. 2023. Multilingual machine translation with
prompt pattern catalog to enhance prompt engineer- large language models: Empirical results and analy-
ing with chatgpt. arXiv. sis. arXiv.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
Quentin Lhoest, and Alexander Rush. 2020. Trans-
formers: State-of-the-art natural language processing.
In Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online. Association
for Computational Linguistics.
Zhenyu Wu, Yaoxiang Wang, Jiacheng Ye, Zhiyong
Wu, Jiangtao Feng, Jingjing Xu, and Yu Qiao. 2023.
OpenICL: An open-source framework for in-context
learning. In Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics
(Volume 3: System Demonstrations), pages 489–498,
Toronto, Canada. Association for Computational Lin-
guistics.
Sang Michael Xie, Aditi Raghunathan, Percy Liang,
and Tengyu Ma. 2022. An explanation of in-context
learning as implicit bayesian inference. In Interna-
tional Conference on Learning Representations.
Yinfei Yang, Yuan Zhang, Chris Tar, and Jason
Baldridge. 2019. PAWS-X: A cross-lingual adversar-
ial dataset for paraphrase identification. In Proceed-
ings of the 2019 Conference on Empirical Methods
A Experimental details
A.1 Tasks and datasets
We conduct experiments on 9 multilingual datasets with a wide coverage of tasks and languages, as shown
in Table 4. All datasets are public research datasets and our experiments are consistent with their intended
use, i.e., NLP evaluation. For the machine translation dataset MAFAND, English serves as the pivot
language and there are two translation directions: en-xx (i.e., translating from English to another language)
and xx-en (i.e., translating from another language to English). As the black-box training data of OpenAI
APIs that we used is up to September 2021, we include the dataset release date in the table which can be
taken as a clue to the severity of dataset contamination.
XNLI natural language inference English, German, Russian, French, Spanish, Chinese, Vietnamese, 15 2019.09
Turkish, Arabic, Greek, Thai, Bulgarian, Hindi, Urdu, Swahili
IndicXNLI natural language inference Hindi, Bengali, Tamil, Marathi, Malayalam, Telugu, Kannada, Punjabi, 11 2022.04
Oriya, Assamese, Gujarati
PAWS-X paraphrase identification English, German, Japanese, French, Spanish, Chinese, Korean 7 2019.08
XCOPA commonsense reasoning Chinese, Italian, Vietnamese, Indonesian, Turkish, Thai, Estonian, 11 2020.04
Tamil, Swahili, Haitian, Quechua
XStoryCloze commonsense reasoning English, Russian, Spanish, Chinese, Indonesian, Arabic, Hindi, 11 2023.05
Basque, Telugu, Burmese, Swahili
AfriSenti sentiment analysis Swahili, Amharic, Hausa, Kinyarwanda, Yoruba, Tigrinya, Igbo, Oromo, 14 2023.05
Moroccan Arabic, Algerian Arabic, Nigerian Pidgin, Mozambican Portuguese,
Tsonga, Twi
XQuAD extractive QA English, German, Russian, Spanish, Chinese, Vietnamese, Turkish, Greek, 12 2019.10
Romanian, Thai, Hindi
TyDiQA-GoldP extractive QA English, Russian, Indonesian, Korean, Arabic, Finnish, Bengali, Telugu, Swahili 9 2020.02
MAFAND machine translation Amharic, Hausa, Kinyarwanda, Luganda, Luo, Chichewa, Nigerian Pidgin, 14 2022.06
Shona, Swahili, Setswana, Twi, Xhosa, Yoruba, Zulu
A.3 Implementation
Our codebase is adapted from OpenICL (Wu et al., 2023). We use int8bit model quantization11 for all
models except OpenAI models. Experiments are conducted using a single NVIDIA A100-80GB GPU.
As models have a maximum context length, we preserve complete demonstrations that can fit within the
context window. We employ greedy decoding for model generation. For chat models, the maximum
new token is set to 50, while for machine translation, it is set to 100. For other models, the maximum
11
In our preliminary experiments, we found that int8 quantization led to a performance degradation of 1-2% on a few
classification datasets with Llama 2 and XGLM. Since this degradation is consistent across different setups, we believe that it
would not affect our overall findings.
new token is set to 20, while for machine translation, it is set to 50. We use three seeds (0, 33, 42) in our
experiments, and the single-seed results for BLOOMZ and mT0 are obtained with the seed 0.
Accuracy
Accuracy
Accuracy
Accuracy
50 70
70 70
40 60
40 60 60
50
50 50
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
AfriSenti XQuAD TyDiQA MAFAND (en-xx) MAFAND (xx-en)
45.0 80 25.0 40
75
42.5
Accuracy
ChrF++
ChrF++
22.5
70
F1
F1
40.0 20.0 30
70
37.5 60 17.5
65 20
15.0
0 2 4 8 0 2 4 0 2 4 0 2 4 8 0 2 4 8
k-shot k-shot k-shot k-shot k-shot
Figure 8: Average performance across languages for BLOOMZ and mT0 with different numbers of demonstrations.
The results are obtained with a single random seed. Note that PAWS-X, XQuAD and TyDiQA are included in the
instruction-tuning datasets of BLOOMZ and mT0.
Table 5: Performance of different types of demonstrations. For RANDOM and RANDOM - CORRUPTED, we report the
mean and standard deviation across 3 seeds except for GPT-4. Best results for each model and dataset are boldfaced.
zero-shot performance. Overall, the base models are more sensitive to the type of demonstrations than
chat models.
Table 6: Prompting templates for XGLM and Llama 2 following Brown et al. (2020) and Lin et al. (2022).
Task Pattern Verbalizer
NLI {premise} Based on the previous passage, is it true that Yes || Maybe || No
{hypothesis}? Yes, No, or Maybe? {label}
PAWS-X Sentence 1: {sentence1}\n No || Yes
Sentence 2: {sentence2}\n
Question: Can we rewrite Sentence 1 to Sentence 2? Yes or No?
{label}
XCOPA {premise} {% if question == “cause" %}This happened because...
{% else %} As a consequence...{% endif %}\n
Help me pick the more plausible option:\n {choice1} || {choice2}
- {choice1}\n
- {choice2}\n
{label}
XStoryCloze {input_sentence_1} {input_sentence_2}
{input_sentence_3} {input_sentence_4}\n
What is a possible continuation for the story given the following {sentence_quiz_1} ||
options?\n {sentence_quiz_2}
- {sentence_quiz_1}\n
- {sentence_quiz_2}\n
{label}
AfriSenti {tweet} Would you rate the previous sentence as positive, positive || neutral || negative
neutral or negative? {label}
QA {context}\nQ:{question}\nReferring to the passage above, {answer}
the correct answer to the given question is:{answer}
MT Translate the following {src_language} text to {tgt_language}:\n {tgt_sentence}
{src_sentence}\n{tgt_sentence}
Table 7: Prompting templates for BLOOMZ and mT0 following Muennighoff et al. (2023) and Bach et al. (2022).
English German Russian French
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Spanish Chinese Vietnamese Turkish
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Arabic Greek Thai Bulgarian
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Hindi Urdu Swahili
XGLM
80 80 80
Llama 2
Llama 2-Chat
Accuracy
60 60 60
GPT-3.5
GPT-4
40 40 40
0 2 4 8 0 2 4 8 0 2 4 8
(a) XNLI
Hindi Bengali Tamil Marathi
80 80 80 80
70 70 70 70
Accuracy
60 60 60 60
50 50 50 50
40 40 40 40
30 30 30 30
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Malayalam Telugu Kannada Punjabi
80 80 80 80
70 70 70 70
Accuracy
60 60 60 60
50 50 50 50
40 40 40 40
30 30 30 30
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Oriya Assamese Gujarati
80 80 80 XGLM
70 70 70 Llama 2
Llama 2-Chat
Accuracy
60 60 60
50 50 50 GPT-3.5
GPT-4
40 40 40
30 30 30
0 2 4 8 0 2 4 8 0 2 4 8
(b) IndicXNLI
English German Japanese French
80 80 80 80
Accuracy
70 70 70 70
60 60 60 60
50 50 50 50
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Spanish Chinese Korean
XGLM
80 80 80
Llama 2
Llama 2-Chat
Accuracy
70 70 70
GPT-3.5
60 60 60 GPT-4
50 50 50
0 2 4 8 0 2 4 8 0 2 4 8
(c) PAWS-X
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Turkish Thai Estonian Tamil
100 100 100 100
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Swahili Haitian Quechua
100 100 100 XGLM
80 80 80 Llama 2
Llama 2-Chat
Accuracy
60 60 60
GPT-3.5
40 40 40 GPT-4
20 20 20
0 2 4 8 0 2 4 8 0 2 4 8
(d) XCOPA
English Russian Spanish Chinese
100 100 100 100
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Indonesian Arabic Hindi Basque
100 100 100 100
80 80 80 80
Accuracy
60 60 60 60
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Telugu Burmese Swahili
100 100 100 XGLM
80 80 80 Llama 2
Llama 2-Chat
Accuracy
60 60 60
GPT-3.5
40 40 40 GPT-4
20 20 20
0 2 4 8 0 2 4 8 0 2 4 8
(e) XStoryCloze
Swahili Amharic Hausa Kinyarwanda
80 80 80 80
60 60 60 60
Accuracy
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Yoruba Tigrinya Igbo Oromo
80 80 80 80
60 60 60 60
Accuracy
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Moroccan Arabic Algerian Arabic Nigerian Pidgin Mozambican Portuguese
80 80 80 80
60 60 60 60
Accuracy
40 40 40 40
20 20 20 20
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Tsonga Twi
80 80
XGLM
Llama 2
60 60
Llama 2-Chat
Accuracy
GPT-3.5
40 40
GPT-4
20 20
0 2 4 8 0 2 4 8
(f) AfriSenti
English German Russian Spanish
80 80 80 80
60 60 60 60
F1
40 40 40 40
20 20 20 20
0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Chinese Vietnamese Turkish Arabic
80 80 80 80
60 60 60 60
F1
40 40 40 40
20 20 20 20
0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Greek Romanian Thai Hindi
80 80 80 80
60 60 60 60
F1
40 40 40 40
20 20 20 20
0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
(g) XQuAD
English Russian Indonesian Korean
80 80 80 80
60 60 60 60
F1
40 40 40 40
20 20 20 20
0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Arabic Finnish Bengali Telugu
80 80 80 80
60 60 60 60
F1
40 40 40 40
20 20 20 20
0 0 0 0
0 2 4 0 2 4 0 2 4 0 2 4
Swahili
80 XGLM
Llama 2
60
Llama 2-Chat
F1
40 GPT-3.5
GPT-4
20
0
0 2 4
(h) TyDiQA
Amharic Hausa Kinyarwanda Luganda Luo
60 60 60 60 60
40 40 40 40 40
ChrF++
20 20 20 20 20
0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Chichewa Nigerian Pidgin Shona Swahili Setswana
60 60 60 60 60
40 40 40 40 40
ChrF++
20 20 20 20 20
0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Twi Xhosa Yoruba Zulu
60 60 60 60
XGLM
Llama 2
40 40 40 40
Llama 2-Chat
ChrF++
GPT-3.5
20 20 20 20 GPT-4
0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
60 60 60 60 60
ChrF++
40 40 40 40 40
20 20 20 20 20
0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Chichewa Nigerian Pidgin Shona Swahili Setswana
60 60 60 60 60
ChrF++
40 40 40 40 40
20 20 20 20 20
0 0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Twi Xhosa Yoruba Zulu
XGLM
60 60 60 60 Llama 2
Llama 2-Chat
ChrF++
40 40 40 40
GPT-3.5
GPT-4
20 20 20 20
0 0 0 0
0 2 4 8 0 2 4 8 0 2 4 8 0 2 4 8
Figure 9: Language-specific performance for both classification and generation tasks with different numbers of
demonstrations. We average and report standard deviations over 3 seeds for all models except GPT-4.
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
English
English
German
Russian
German
French
Zero
Zero
Spanish
Japanese
Chinese
Vietnamese
Top-k
Top-k
French Turkish
GPT-4
GPT-4
XGLM
XGLM
(a) XNLI
Llama 2
Llama 2
GPT-3.5
GPT-3.5
Arabic
(c) PAWS-X
Llama 2-Chat
Llama 2-Chat
Spanish Greek
Random
Random
Thai
Bulgarian
Chinese
Hindi
Urdu
Korean
Swahili
Random-corrupted
Random-corrupted
Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy Accuracy
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
50
100
0
50
100
0
50
100
0
50
100
0
50
100
Chinese Hindi
Italian Bengali
Vietnamese Tamil
Zero
Zero
Indonesian Marathi
Turkish Malayalam
Top-k
Top-k
Thai Telugu
GPT-4
XGLM
GPT-4
XGLM
Llama 2
GPT-3.5
Llama 2
GPT-3.5
(d) XCOPA
Llama 2-Chat
Kannada
Llama 2-Chat
Estonian
(b) IndicXNLI
Random
Random
Tamil Punjabi
Swahili Oriya
Haitian Assamese
Quechua Gujarati
Random-corrupted
Random-corrupted
XGLM XGLM
100
Zero Top-k Random Random-corrupted 75 Zero Top-k Random Random-corrupted
Accuracy
Accuracy
50
50
25
0 0
Llama 2 Llama 2
100
75
Accuracy
Accuracy
50
50
25
0 0
Llama 2-Chat Llama 2-Chat
100
75
Accuracy
Accuracy
50
50
25
0
0
GPT-3.5 GPT-3.5
100 75
Accuracy
Accuracy
50
50
25
0
0
GPT-4
GPT-4
100 75
Accuracy
Accuracy
50
50
25
0
0
Amharic
Hausa
Swahili
Yoruba
Igbo
Tigrinya
Oromo
Tsonga
Twi
Kinyarwanda
Mo. Arabic
Al. Arabic
Ni. Pidgin
Portuguese
English
Russian
Spanish
Chinese
Arabic
Hindi
Basque
Telugu
Burmese
Indonesian
Swahili
XGLM XGLM
75 75
Zero Top-k Random Random-corrupted Zero Top-k Random Random-corrupted
50 50
F1
F1
25 25
0 0
Llama 2 Llama 2
75 75
50 50
F1
F1
25 25
0 0
Llama 2-Chat Llama 2-Chat
75 75
50 50
F1
F1
25 25
0 0
GPT-3.5 GPT-3.5
75 75
50 50
F1
F1
25 25
0 0
GPT-4 GPT-4
75 75
50 50
F1
F1
25 25
0 0
English
German
Russian
Spanish
Chinese
Turkish
Arabic
Greek
Thai
Hindi
Vietnamese
Romanian
English
Russian
Korean
Arabic
Finnish
Telugu
Indonesian
Bengali
Swahili
ChrF++
40
20 20
0 0
Llama 2 Llama 2
60
60
40
ChrF++
ChrF++
40
20 20
0 0
Llama 2-Chat Llama 2-Chat
60
60
40
ChrF++
ChrF++
40
20 20
0 0
GPT-3.5 GPT-3.5
60
60
40
ChrF++
ChrF++
40
20 20
0 0
GPT-4 GPT-4
60
60
40
ChrF++
ChrF++
40
20 20
0 0
Luo
Twi
Luo
Twi
Amharic
Hausa
Xhosa
Xhosa
Luganda
Chichewa
Shona
Yoruba
Zulu
Amharic
Hausa
Luganda
Chichewa
Shona
Yoruba
Zulu
Kinyarwanda
Ni. Pidgin
Swahili
Setswana
Kinyarwanda
Ni. Pidgin
Swahili
Setswana
Figure 10: Language-specific performance of 4-shot ICL using different types of demonstrations. We average and
report standard deviations over 3 seeds for all models except GPT-4.
Llama 2-Chat
100
0-shot w/ original template 0-shot w/ formatting-focused template
75 4-shot w/ original template 4-shot w/ formatting-focused template
Accuracy
50
25
0
GPT-3.5
100
75
Accuracy
50
25
0
GPT-4
100
75
Accuracy
50
25
0
Chinese
Turkish
Thai
Tamil
Italian
Estonian
Swahili
Haitian
Quechua
Vietnamese
Indonesian
(a) XCOPA
Llama 2-Chat
80
0-shot w/ original template 0-shot w/ formatting-focused template
60 4-shot w/ original template 4-shot w/ formatting-focused template
Accuracy
40
20
0
GPT-3.5
80
60
Accuracy
40
20
0
GPT-4
80
60
Accuracy
40
20
0
Hausa
Igbo
Twi
Oromo
Tsonga
Swahili
Amharic
Yoruba
Tigrinya
Mo. Arabic
Al. Arabic
Kinyarwanda
Ni. Pidgin
Portuguese
(b) AfriSenti
Llama 2-Chat
100
0-shot w/ original template 0-shot w/ formatting-focused template
75 4-shot w/ original template 4-shot w/ formatting-focused template
F1
50
25
0
GPT-3.5
100
75
F1
50
25
0
GPT-4
100
75
F1
50
25
0
English
German
Russian
Spanish
Chinese
Turkish
Arabic
Greek
Thai
Hindi
Vietnamese
Romanian
(c) XQuAD
Llama 2-Chat
100
0-shot w/ original template 0-shot w/ formatting-focused template
80 4-shot w/ original template 4-shot w/ formatting-focused template
60
F1
40
20
0
GPT-3.5
100
80
60
F1
40
20
0
GPT-4
100
80
60
F1
40
20
0
English
Russian
Korean
Arabic
Finnish
Bengali
Telugu
Swahili
Indonesian
(d) TyDiQA
Figure 11: Effect of using different templates on 0-shot and 4-shot performance for XCOPA, AfriSenti, and TyDiQA.
Few-shot results are averaged across 3 seeds except for GPT-4.
Task Template
NLI task instruction: You are an NLP assistant whose purpose is to solve Natural Language Inference
(NLI) problems in <EVALUATION_LANGUAGE>. NLI is the task of determining the inference relation
between two (short, ordered) texts: entailment, contradiction, or neutral. Answer as concisely as
possible in the same format as the examples below:
pattern: {premise}\nQuestion: {hypothesis}\nTrue, False, or Neither?
verbalizer: True || Neither || False
PAWS-X task instruction: You are an NLP assistant whose purpose is to perform Paraphrase Identification in
<EVALUATION_LANGUAGE>. The goal of Paraphrase Identification is to determine whether a pair
of sentences have the same meaning. Answer as concisely as possible in the same format as the
examples below:
pattern: {sentence1}\nQuestion: {sentence2}\nTrue or False?
verbalizer: False || True
XCOPA task instruction: You are an NLP assistant whose purpose is to perform open-domain commonsense
causal reasoning in <EVALUATION_LANGUAGE>. You will be provided a premise and two alternatives,
where the task is to select the alternative that more plausibly has a causal relation with the premise.
Answer as concisely as possible in the same format as the examples below:
pattern:
Premise: {premise}\nWhat is the {question}? Pick the more plausible option:\n
1: {choice1}\n2: {choice2}\n
You should tell me the choice number in this format ’Choice number:’
verbalizer: Choice number: 1 || Choice number: 2
XStoryCloze task instruction: You are an NLP assistant whose purpose is to perform open-domain commonsense
causal reasoning in <EVALUATION_LANGUAGE>. You will be provided a four-sentence story and two
continuations, where the task is to select the correct ending. Answer as concisely as possible in the same
format as the examples below:
pattern:
Story: {input_sentence_1} {input_sentence_2} {input_sentence_3} {input_sentence_4}\n
What is a possible continuation for the story? Pick the more plausible option:\n
1: {sentence_quiz1}\n2: {sentence_quiz2}\n
You should tell me the choice number in this format ’Choice number:’
verbalizer: Choice number: 1 || Choice number: 2
AfriSenti task instruction: You are an NLP assistant whose purpose is to perform Sentiment Analysis in
<EVALUATION_LANGUAGE>. Sentiment Analysis is the task of determining the sentiment,
opinion or emotion expressed in a textual data. Give your answer as a single word, "positive", "neutral"
or "negative".
pattern: Does this statement “{tweet}” have a {positive neutral or negative} sentiment? Labels only
verbalizer: positive || neutral || negative
QA task instruction: You are an NLP assistant whose purpose is to solve reading comprehension
problems in <EVALUATION_LANGUAGE>. You will be provided questions on a set of passages and
you will need to provide the answer as it appears in the passage. The answer should be in the same
language as the question and the passage.
pattern:
{context}\nQ: {question}\nReferring to the passage above, the correct answer to the given question is:
verbalizer: {answer}
Table 8: Prompting templates for chat models following Ahuja et al. (2023) and Ojo et al. (2023). We add language
identifiers in task instructions as it is an effective strategy for improving multilingual prompting (Huang et al., 2023).
Task Template
XCOPA task instruction: You are an NLP assistant whose purpose is to perform open-domain commonsense
causal reasoning in <EVALUATION_LANGUAGE>. You will be provided a premise and two alternatives,
where the task is to select the alternative that more plausibly has a causal relation with the premise.
Answer as concisely as possible in the same format as the examples below:
pattern:
Premise: {premise}\nWhat is the {question}? Pick the more plausible option:\n
1: {choice1}\n2: {choice2}\n
This is very important: Do not repeat the question and no explanation.
You should tell me the choice number in this format ’Choice number:’
verbalizer: Choice number: 1 || Choice number: 2
AfriSenti task instruction: You are an NLP assistant whose purpose is to perform Sentiment Analysis in
<EVALUATION_LANGUAGE>. Sentiment Analysis is the task of determining the sentiment,
opinion or emotion expressed in a textual data. Give your answer as a single word, "positive", "neutral"
or "negative".
pattern: Does this statement “{tweet}” have a {positive neutral or negative} sentiment?
This is very important: Do not repeat the question and no explanation. Labels only
verbalizer: positive || neutral || negative
QA task instruction: You are an NLP assistant whose purpose is to solve reading comprehension
problems in <EVALUATION_LANGUAGE>. Answer the question from the given passage. Your answer
should be directly extracted from the passage and be a single entity, name, or number, not a sentence.
pattern:
{context}\nQ: {question}\nThis is very important: Your answer should be directly extracted from the
passage and be a single entity, name, or number, not a sentence.
verbalizer: {answer}
Table 9: Formatting-focused templates for chat models. We augmented the original templates in Table 8 with
formatting-focused instructions.