Multilingual Machine Translation With Large Language Models
Multilingual Machine Translation With Large Language Models
Abstract
Large language models (LLMs) have demon-
strated remarkable potential in handling mul-
tilingual machine translation (MMT). In this
arXiv:2304.04675v3 [cs.CL] 29 Oct 2023
X⇒Eng 18.54 / 70.09 34.65 / 83.71 27.37 / 67.40 37.28 / 84.73 34.82 / 84.25 45.83 / 89.05 48.51 / 89.48 42.72 / 87.74 46.54 / 88.18 51.16 / 89.36
Indo-Euro-Germanic (8)
Eng⇒X 9.16 / 50.21 18.89 / 71.97 13.19 / 52.93 22.78 / 76.05 19.44 / 73.63 36.34 / 87.83 40.64 / 88.50 37.30 / 86.47 38.47 / 87.31 45.27 / 89.05
X⇒Eng 31.11 / 79.67 38.93 / 87.75 34.06 / 84.40 41.10 / 88.10 37.84 / 87.80 45.68 / 89.61 47.29 / 89.74 42.33 / 88.31 46.33 / 88.99 35.69 / 89.66
Indo-Euro-Romance (8)
Eng⇒X 21.95 / 69.08 24.30 / 79.07 20.02 / 70.36 27.81 / 82.05 25.50 / 79.67 41.35 / 89.00 44.47 / 88.94 42.98 / 87.56 43.48 / 88.12 37.10 / 88.77
X⇒Eng 13.20 / 64.24 20.83 / 74.80 13.15 / 57.34 34.00 / 84.90 30.94 / 83.90 39.27 / 87.74 41.19 / 88.15 35.87 / 85.97 39.23 / 87.08 43.61 / 88.18
Indo-Euro-Slavic (12)
Eng⇒X 6.40 / 43.28 8.18 / 54.45 4.34 / 35.73 20.24 / 76.30 16.14 / 69.75 32.61 / 87.90 36.06 / 89.15 35.01 / 86.43 36.56 / 88.74 42.75 / 90.05
X⇒Eng 8.68 / 63.93 1.20 / 49.37 1.40 / 45.22 6.68 / 62.63 4.29 / 60.29 25.32 / 84.14 37.30 / 87.79 17.53 / 69.66 40.75 / 88.80 45.66 / 89.43
Indo-Euro-Indo-Aryan (10)
Eng⇒X 4.76 / 40.99 0.14 / 31.85 0.13 / 25.84 1.61 / 35.92 1.24 / 34.74 16.50 / 68.43 21.35 / 73.75 14.44 / 65.32 34.04 / 82.55 39.04 / 82.78
X⇒Eng 7.32 / 55.29 7.80 / 59.60 7.04 / 51.59 14.27 / 69.87 11.46 / 67.64 29.54 / 84.52 37.29 / 86.76 22.38 / 77.47 36.16 / 86.81 41.68 / 88.29
Indo-Euro-Other (11)
Eng⇒X 4.51 / 40.60 3.10 / 40.04 3.38 / 34.64 5.00 / 44.09 4.83 / 43.73 22.81 / 77.33 28.45 / 80.94 19.71 / 74.90 31.65 / 85.82 38.54 / 87.44
X⇒Eng 16.19 / 78.80 25.60 / 78.03 18.62 / 75.36 26.70 / 80.21 24.39 / 80.39 39.95 / 87.29 46.81 / 88.65 31.84 / 84.76 45.41 / 87.85 50.68 / 88.89
Austronesian (6)
Eng⇒X 10.01 / 73.14 10.68 / 64.97 8.56 / 60.89 14.59 / 74.80 13.29 / 74.88 30.17 / 86.36 34.66 / 87.68 27.03 / 86.83 37.17 / 88.82 40.74 / 89.34
X⇒Eng 6.67 / 62.00 9.17 / 57.59 6.98 / 0.56 8.76 / 57.72 9.01 / 57.86 19.86 / 79.63 28.27 / 83.42 10.55 / 76.43 32.20 / 84.00 23.55 / 85.44
Atlantic-Congo (14)
Eng⇒X 2.52 / 54.93 1.60 / 34.15 1.89 / 0.34 2.45 / 34.17 3.09 / 38.13 8.91 / 75.26 13.70 / 77.79 6.53 / 75.79 21.99 / 79.95 16.77 / 80.89
X⇒Eng 6.70 / 54.51 5.93 / 52.90 4.87 / 38.62 10.41 / 57.72 8.65 / 58.27 20.84 / 70.39 30.48 / 78.76 10.00 / 66.98 32.69 / 82.99 36.14 / 84.47
Afro-Asiatic (6)
Eng⇒X 2.07 / 41.48 1.40 / 41.86 1.40 / 27.64 3.22 / 43.04 3.07 / 43.39 13.57 / 67.60 19.36 / 75.56 7.83 / 68.86 26.08 / 82.84 31.00 / 83.78
X⇒Eng 7.43 / 61.69 7.89 / 62.47 4.15 / 33.11 9.51 / 65.95 8.88 / 66.15 24.64 / 84.04 31.73 / 86.90 10.25 / 58.52 32.92 / 87.51 37.78 / 88.53
Turkic (5)
Eng⇒X 3.48 / 40.32 2.58 / 44.80 1.75 / 20.00 3.28 / 39.65 3.09 / 41.97 17.13 / 74.77 20.96 / 78.50 10.87 / 68.21 30.17 / 88.47 36.54 / 89.38
X⇒Eng 8.04 / 61.95 0.89 / 44.01 1.18 / 24.29 2.65 / 53.17 1.52 / 52.95 20.26 / 82.00 33.10 / 86.91 10.26 / 63.77 39.07 / 88.42 43.17 / 89.10
Dravidian (4)
Eng⇒X 5.30 / 48.15 0.02 / 32.51 0.03 / 15.31 0.56 / 34.03 0.58 / 35.65 12.34 / 64.74 18.60 / 75.15 6.85 / 62.25 37.33 / 86.32 44.16 / 87.75
X⇒Eng 9.35 / 58.60 9.32 / 65.32 16.59 / 72.34 18.35 / 74.45 16.88 / 74.20 21.36 / 78.52 27.74 / 84.48 11.09 / 71.35 30.88 / 86.50 35.68 / 87.66
Sino-Tibetan (3)
Eng⇒X 10.14 / 74.16 2.57 / 54.73 10.74 / 66.74 12.24 / 65.99 9.06 / 65.07 19.92 / 76.04 22.81 / 81.11 10.42 / 73.82 16.85 / 80.74 32.40 / 88.52
X⇒Eng 9.71 / 60.43 10.10 / 60.78 5.37 / 47.38 16.00 / 71.15 14.25 / 70.35 25.59 / 82.48 32.62 / 86.21 25.53 / 81.53 35.06 / 86.86 36.95 / 87.93
Other (14)
Eng⇒X 8.42 / 51.57 3.82 / 46.85 1.73 / 29.73 8.19 / 53.20 7.14 / 52.12 20.26 / 74.31 24.04 / 79.59 23.29 / 77.80 28.54 / 85.84 34.34 / 87.82
Table 1: Average translation performance of LLMs on different language families. The number in the bracket
indicates the number of evaluated languages in the specific language family. Bold text denotes the highest BLEU or
COMET score across models. Underlined text denotes the highest BLEU or COMET score across LLMs.
ICL strategy For each model, we report its trans- 4 Benchmarking LLMs for Massively
lation performance with eight randomly-picked Multilingual Machine Translation
translation pairs from the corresponding develop-
ment set as in-context exemplars and “<X>=<Y>” In this section, we report results on multilingual
as in-context template. “<X>” and “<Y>” are the machine translation and introduce our main find-
placeholder for the source and target sentence. We ings about LLMs’ translation ability.
use line-break as the concatenation symbol. Ac- The multilingual translation capabilities of
cording to our experiment analysis, this ICL strat- LLMs are continually improving. Table 1
egy serves as a simple but strong recipe. All imple- presents evaluation results8 grouped by language
mentation is based on OpenICL3 (Wu et al., 2023).
4
https://translate.google.com/
2 5
We evaluate LLMs on the first 100 sentences of each https://github.com/mjpost/sacrebleu
6
direction’s test set in benchmarking experiment, considering We compute the score with wmt22-comet-da model.
7
the prohibitive API cost of evaluating massive languages. In We compute the score with SEScore-2 (Xu et al., 2022a).
analysis experiment, we use full test set. 8
Evaluating with SEScore leads to similar findings, thus
3
https://github.com/Shark-NLP/OpenICL we report those results in Appendix A. Detailed results for
70 X->Eng
GPT4
60 ChatGPT
NLLB
Google
50
40
BLEU
30
20
10
0
cyamfr
mlt
swchi
dan
kegal
msoar
swe
ceb
cazt
norba
bos
gle
rod
denu
hint
ukpri
buvl
mvkied
herbp
arar
slk
paenll
est
celgs
hrv
urind
slv
ita
rus
nld
huans
pol
lav
isl
malirt
zul
xhuoj
uzsbo
zho mtel
_simal
bepnl
spor
kana
tha
zho ory
jpd
mn
h ri
hayue
khnmd
soamz
tam
nyak
asamzj
sna
c el
amkbh
kat
mluogn
yor
puos
ibiro
kalin
myma
wol
ormo
umfubl
as
tu
lt
tg
ja
la
lu
in
_tra
k
o
b
t
k
p
k
f
s
Eng->X
70 GPT4
ChatGPT
60 NLLB
Google
50
40
BLEU
30
20
10
0
pa
cyomr
swe
mlt
cat
infr
rodn
dan
s eu
mwsha
bul
vie
cecsi
glg
nob
elk
ukst
hrvr
rus
boas
ar
mkrda
srpl
aslvt
fin
hin
jpn
heleb
nld
zho laltv
_sim z
fasl
ell
lit
huonl
ceba
zho kiso l
_tra r
kd
paena
tha
benv
uzb
mri
urpdi
gl
hyuej
nso
bel
manl
ory
mgakr
kat
tamul
kazzj
hau
snd
a ho
amsmh
n ir
soyma
ckb
ibo
so
mnona
yn
m or
khyma
wo
p ol
kaums
lug
umfubl
orm
tg
te
tu
ja
fr
it
sp
la
lu
ka
li
k
a
a
o
z
t
d
x
Figure 2: Translation performance (BLEU) of GPT-4, ChatGPT, NLLB and Google Translator on our evaluated
languages. “X->Eng” and “Eng->X” denote translating to English and translating from English respectively. In
each subfigure, languages are sorted according to BLEU scores of GPT-4.
family. Monolingual pre-trained LLMs present im- LLM’s capability is unbalanced across lan-
pressive multilingual translation ability, indicating guages In Table 1, we observe a similar trend
the possibility of aligning multiple languages even for all evaluated LLMs: they perform better at
with unsupervised data (Garcia et al., 2023). More translating into English than translating into non-
encouragingly, the multilingual translation capa- English. LLM’s capability on non-English lan-
bilities of LLMs are continually improving. The guages is also unbalanced. For languages that are
most recent LLMs are reaching new performance similar to English, e.g, Indo-European-Germanic
heights; for example, LLaMA2-7B outperforms languages, LLMs achieve impressive results. For
previously released open-source LLMs, and GPT- languages that are dissimilar to English, e.g., Sino-
4 surpasses ChatGPT. Overall, GPT-4 is the best Tibetan languages, LLMs often produce less decent
translator among evaluated LLMs and it achieves results.
the highest average BLEU and COMET score on Table 2 presents another clue, where we evaluate
most directions. GPT-4 on French-centric and Chinese-centric trans-
Language Family X⇒Eng X⇒Fra X⇒Zho Eng⇒X Fra⇒X Zho⇒X
lation. Compared to English-centric translation,
Indo-Euro-Germanic (8) 48.51 44.23 27.97 40.64 32.34 24.13 GPT-4 faces greater challenge when it comes to
Indo-Euro-Romance (8) 47.29 45.16 27.31 44.47 36.05 27.12
non-English-centric translation, which again indi-
Indo-Euro-Slavic (12) 41.19 40.32 25.67 36.06 30.88 23.33
Table 3: Translation performance (BLEU) of using different templates for in-context learning. The number of
in-context exemplars is fixed at eight in this experiment. “<X>” and “<Y>” denote the placeholder for source and
target sentence respectively. “[SRC]” and “[TGT]” represent the placeholder for source and target language name in
English. Bold text denotes the highest score along the column.
Table 4: Translation performance of XGLM when using different contents as in-context exemplars. “Consistency”
column denotes whether source and target sentence are semantically consistent. “Granularity” column denotes
whether the exemplar is a sentence-level pair. “Diversity” column denotes whether exemplars in the context are
different from each other.
Yoshua Bengio, Réjean Ducharme, and Pascal Vincent. Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch,
2000. A neural probabilistic language model. Ad- Elena Buchatskaya, Trevor Cai, Eliza Rutherford,
vances in Neural Information Processing Systems Diego de Las Casas, Lisa Anne Hendricks, Johannes
(NeurIPS). Welbl, Aidan Clark, et al. 2022. An empirical analy-
sis of compute-optimal large language model training.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Advances in Neural Information Processing Systems
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind (NeurIPS).
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot Wenxiang Jiao, Wenxuan Wang, JT Huang, Xing
learners. Advances in Neural Information Processing Wang, and ZP Tu. 2023. Is chatgpt a good trans-
Systems (NeurIPS). lator? yes with gpt-4 as the engine. arXiv preprint
arXiv:2301.08745.
Marta R Costa-jussà, James Cross, Onur Çelebi, Maha
Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim
Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat,
et al. 2022. No language left behind: Scaling Fernanda Viégas, Martin Wattenberg, Greg Corrado,
human-centered machine translation. arXiv preprint Macduff Hughes, and Jeffrey Dean. 2017. Google’s
arXiv:2207.04672. multilingual neural machine translation system: En-
Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong abling zero-shot translation. Transactions of the As-
Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, and sociation for Computational Linguistics (TACL).
Zhifang Sui. 2022. A survey for in-context learning.
arXiv preprint arXiv:2301.00234. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B
Brown, Benjamin Chess, Rewon Child, Scott Gray,
Aparna Elangovan, Jiayuan He, and Karin Verspoor. Alec Radford, Jeffrey Wu, and Dario Amodei. 2020.
2021. Memorization vs. generalization : Quantify- Scaling laws for neural language models. arXiv
ing data leakage in NLP performance evaluation. In preprint arXiv:2001.08361.
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Zettlemoyer, and Mike Lewis. 2020. Generalization Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
through memorization: Nearest neighbor language Kaiser, and Illia Polosukhin. 2017. Attention is all
models. In International Conference on Learning you need. In Advances in Neural Information Pro-
Representations (ICLR). cessing Systems (NeurIPS).
Mukai Li, Shansan Gong, Jiangtao Feng, Yiheng Xu, David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo,
Jun Zhang, Zhiyong Wu, and Lingpeng Kong. 2023. Viresh Ratnakar, and George Foster. 2022. Prompt-
In-context learning with many demonstration exam- ing palm for translation: Assessing strategies and
ples. arXiv preprint arXiv:2302.04931. performance. arXiv preprint arXiv:2211.09102.
Xi Victoria Lin, Todor Mihaylov, Mikel Artetxe, Tianlu Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Wang, Shuohui Chen, Daniel Simig, Myle Ott, Na- Adams Wei Yu, Brian Lester, Nan Du, Andrew M
man Goyal, Shruti Bhosale, Jingfei Du, Ramakanth Dai, and Quoc V Le. 2022a. Finetuned language
Pasunuru, Sam Shleifer, Punit Singh Koura, Vishrav models are zero-shot learners. In International Con-
Chaudhary, Brian O’Horo, Jeff Wang, Luke Zettle- ference on Learning Representations (ICLR).
moyer, Zornitsa Kozareva, Mona Diab, Veselin Stoy-
anov, and Xian hLi. 2022. Few-shot learning with Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel,
multilingual generative language models. In Pro- Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
ceedings of the Conference on Empirical Methods in Maarten Bosma, Denny Zhou, Donald Metzler, et al.
Natural Language Processing (EMNLP). 2022b. Emergent abilities of large language models.
arXiv preprint arXiv:2206.07682.
Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan
Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent Jerry W. Wei, Jason Wei, Yi Tay, Dustin Tran, Al-
neural network based language model. Interspeech. bert Webson, Yifeng Lu, Xinyun Chen, Hanxiao
Liu, Da Huang, Denny Zhou, and Tengyu Ma. 2023.
Yasmin Moslem, Rejwanul Haque, and Andy Way. 2023. Larger language models do in-context learning dif-
Adaptive machine translation with large language ferently. CoRR, abs/2303.03846.
models. arXiv preprint arXiv:2301.13294.
Zhenyu Wu, YaoXiang Wang, Jiacheng Ye, Jiangtao
OpenAI. 2022. https://openai.com/blog/chatgpt. Feng, Jingjing Xu, Yu Qiao, and Zhiyong Wu. 2023.
Openicl: An open-source framework for in-context
OpenAI. 2023. Gpt-4 technical report. learning. arXiv preprint arXiv:2303.02913.
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Wenda Xu, Xian Qian, Mingxuan Wang, Lei Li, and
Dario Amodei, and Ilya Sutskever. 2019. Language William Yang Wang. 2022a. Sescore2: Retrieval
models are unsupervised multitask learners. augmented pretraining for text generation evaluation.
arXiv preprint arXiv:2212.09305.
Ricardo Rei, Craig Stewart, Ana C Farinha, and Alon
Lavie. 2020. COMET: A neural framework for MT Wenda Xu, Yi-Lin Tuan, Yujie Lu, Michael Saxon, Lei
evaluation. In Proceedings of Conference on Em- Li, and William Yang Wang. 2022b. Not all errors are
pirical Methods in Natural Language Processing equal: Learning text generation metrics using strati-
(EMNLP). fied error synthesis. In Findings of the Association
for Computational Linguistics: EMNLP 2022.
Xiaozhe Ren, Pingyi Zhou, Xinfan Meng, Xinjing
Huang, Yadao Wang, Weichao Wang, Pengfei Li, Fei Yuan, Yinquan Lu, Wenhao Zhu, Lingpeng Kong,
Xiaoda Zhang, Alexander Podolskiy, Grigory Arshi- Lei Li, Yu Qiao, and Jingjing Xu. 2023. Lego-mt:
nov, Andrey Bout, Irina Piontkovskaya, Jiansheng Towards detachable models in massively multilingual
Wei, Xin Jiang, Teng Su, Qun Liu, and Jun Yao. 2023. machine translation. In Findings of the Association
Pangu-sigma: Towards trillion parameter language for Computational Linguistics: ACL 2023.
model with sparse heterogeneous computing. arXiv
preprint arXiv:2303.10845. Biao Zhang, Barry Haddow, and Alexandra Birch. 2023.
Prompting large language model for machine transla-
Teven Le Scao, Angela Fan, Christopher Akiki, El- tion: A case study. arXiv preprint arXiv:2301.07069.
lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon, Susan Zhang, Stephen Roller, Naman Goyal, Mikel
Matthias Gallé, et al. 2022. Bloom: A 176b- Artetxe, Moya Chen, Shuohui Chen, Christopher De-
parameter open-access multilingual language model. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022.
arXiv preprint arXiv:2211.05100. Opt: Open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
bert, Amjad Almahairi, Yasmine Babaei, Nikolay
Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
Bhosale, et al. 2023. Llama 2: Open founda-
tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
年终总结
A Evaluating LLM’s translation emplars for Chinese-English translation).
performance with SEScore
[Input]
Table 6 presents average SEScore of LLMs on Этот фильм с участием Райана Гослинга и Эммы Стоун
получил номинации во всех главных категориях.=The
different language families. Currently, SEScore movie, featuring Ryan Gosling and Emma Stone, received
nominations in all major categories.
mainly supports evaluating English translation. "Теперь у нас есть четырёхмесячные мыши, у которых
Thus we evaluate LLM’s performance on trans- больше нет диабета", — добавил он.="We now have 4-
month-old mice that are non-diabetic that used to be diabetic,"
lating other languages to English. he added.
Гослинг и Стоун получили номинации на лучшего актера и
актрису соответственно.=Gosling and Stone received
B Detailed Results on Each Language nominations for Best Actor and Actress respectively.
Находка также позволяет ознакомиться с эволюцией перьев
у птиц.=The find also grants insight into the evolution of
We report detailed results of our evaluated mod- feathers in birds.
els in Table 7 (BLEU), Table 8 (COMET), Table Канцелярия губернатора сообщила, что 19 из раненных
были офицерами полиции.=The governor's office said
9 (SEScore) and Figure 8. One thing that needs nineteen of the injured were police officers.
Стандарт 802.11n работает на обоих частотах – 2.4 ГГц и
to be mentioned is that BLEU supports all transla- 5.0 ГГц.=The 802.11n standard operates on both the 2.4Ghz
and 5.0Ghz frequencies.
tion directions, whereas COMET and SEScore only
Он сказал, что создал дверной звонок, работающий от
support a subset of these translation directions. WiFi.=He built a WiFi door bell, he said.
В конце 2017 года Симинофф появился на торговом
телеканале QVC.=In late 2017, Siminoff appeared on
C Lists of Language shopping television channel QVC.
伊拉克研究⼩组于格林尼治时间 (GMT) 今天 12 点提交了
We evaluate 102 languages in this paper. Table 10 报告。=
lists the name, ISO code and language family of [Output]
these languages. The Iraqi research team submitted a report at Greenwich time
(GMT) today at 12 noon.
D Cross-lingual Exemplars
Figure 7: An example of using cross-lingual in-context
In Figure 5, we show an example of using cross- exemplars
lingual in-context exemplars (Russian-English ex-
Translation Performance (SEScore)
Language Family Direction
XGLM-7.5B OPT-175B Falcon-7B LLaMA-7B LLaMA-7B-Chat ChatGPT GPT4 M2M-12B NLLB-1.3B Google
Indo-Euro-Germanic (8) X⇒Eng -11.78 -6.00 -8.34 -5.41 -5.90 -2.52 -2.16 -3.15 -2.78 -1.85
Indo-Euro-Romance (8) X⇒Eng -6.54 -4.01 -5.57 -3.72 -4.14 -2.30 -2.08 -3.08 -2.54 -2.12
Indo-Euro-Slavic (12) X⇒Eng -14.29 -10.31 -13.46 -5.11 -5.75 -3.55 -3.17 -4.21 -3.70 -2.80
Indo-Euro-Indo-Aryan (10) X⇒Eng -16.45 -22.15 -21.65 -17.15 -19.46 -7.64 -4.69 -11.77 -3.53 -2.80
Indo-Euro-Other (11) X⇒Eng -18.36 -17.81 -18.09 -13.61 -15.42 -6.74 -4.62 -7.57 -3.75 -4.40
Austronesian (6) X⇒Eng -14.06 -10.08 -12.30 -9.61 -10.48 -4.48 -3.03 -5.37 -3.47 -2.56
Atlantic-Congo (14) X⇒Eng -19.42 -17.61 -18.44 -17.59 -18.48 -12.38 -9.34 -14.16 -6.88 -5.75
Afro-Asiatic (6) X⇒Eng -18.85 -18.91 -19.17 -16.61 -17.66 -12.16 -8.28 -14.41 -4.46 -3.49
Turkic (5) X⇒Eng -17.15 -16.99 -18.66 -15.50 -16.47 -7.63 -5.50 -15.29 -4.89 -3.93
Dravidian (4) X⇒Eng -16.52 -22.58 -21.91 -20.18 -21.96 -9.26 -5.35 -13.69 -3.76 -3.07
Sino-Tibetan (3) X⇒Eng -19.41 -15.20 -12.37 -11.33 -12.01 -10.43 -6.79 -11.93 -5.50 -4.30
Other (14) X⇒Eng -16.74 -16.56 -18.70 -13.05 -14.17 -8.51 -6.07 -6.91 -4.94 -3.80
Table 6: Average SEScore of LLMs on different language families. The number in the bracket indicates the number
of evaluated languages in the specific language family. Bold text denotes the highest SEScore across models.
Underlined text denotes the highest SEScore across LLMs.
20 20 20 20
10 10 10 10
hye khm est jav ceb sna lug
lit
vie mon swh zul
tgk
pus kor luo umb yor
ckb msa mri
fas lao tha wol xho
Turkic X->En Dravidian X->En
Afro-Asiatic X->En kaz mal Sino-Tibetan X->En
ful ara zho_simpl
kir
60 40 35 35
50 30
40 30 25 30 20
25
20
30 20 15 20 10
15
10 10 5 10 5
mlt amh azj tam kan mya
tur
orm som uzb zho_trad
tel
Indo-European-Germanic En->X Indo-European-Romance En->X Indo-European-Slavic En->X Indo-European-Indo-Aryan En->X
nld fra hrv hin guj
ces bul
deu dan glg cat
mkd bos mar ben
50 50 40 35
30
40 40 30 25 30
20 20
30 20 15 20
10 10 10 5 10
isl afr oci ast pol bel npi asm
tur
orm som uzb zho_trad
tel
XGLM OPT Falcon LLaMA2 LLaMA2-Chat ChatGPT GPT4
Figure 8: Comparison results between our evalutated LLMs on different language families.
Language ISO 639-1 ISO 639-2/T Language family Language ISO 639-1 ISO 639-2/T Language family
Afrikaans af afr Indo-European-Germanic Latvian lv lav Indo-European-Other
Amharic am amh Afro-Asiatic Lingala ln lin Atlantic-Congo
Arabic ar ara Afro-Asiatic Lithuanian lt lit Indo-European-Other
Armenian hy hye Indo-European-Other Luo luo luo Other
Assamese as asm Indo-European-Indo-Aryan Luxembourgish lb ltz Indo-European-Germanic
Asturian ast ast Indo-European-Romance Macedonian mk mkd Indo-European-Slavic
Azerbaijani az azj Turkic Malay ms msa Austronesian
Belarusian be bel Indo-European-Slavic Malayalam ml mal Dravidian
Bengali bn ben Indo-European-Indo-Aryan Maltese mt mlt Afro-Asiatic
Bosnian bs bos Indo-European-Slavic Maori mi mri Austronesian
Bulgarian bg bul Indo-European-Slavic Marathi mr mar Indo-European-Indo-Aryan
Burmese my mya Sino-Tibetan Mongolian mn mon Other
Catalan ca cat Indo-European-Romance Nepali ne npi Indo-European-Indo-Aryan
Cebuano ceb ceb Austronesian Northern Sotho ns nso Atlantic-Congo
Chinese (Simpl) zh zho_simpl Sino-Tibetan Norwegian no nob Indo-European-Germanic
Chinese (Trad) zhtrad zho_trad Sino-Tibetan Nyanja ny nya Atlantic-Congo
Croatian hr hrv Indo-European-Slavic Occitan oc oci Indo-European-Romance
Czech cs ces Indo-European-Slavic Oriya or ory Indo-European-Indo-Aryan
Danish da dan Indo-European-Germanic Oromo om orm Afro-Asiatic
Dutch nl nld Indo-European-Germanic Pashto ps pus Indo-European-Other
English en eng Indo-European-Germanic Persian fa fas Indo-European-Other
Estonian et est Other Polish pl pol Indo-European-Slavic
Tagalog tl tgl Austronesian Portuguese pt por Indo-European-Romance
Finnish fi fin Other Punjabi pa pan Indo-European-Indo-Aryan
French fr fra Indo-European-Romance Romanian ro ron Indo-European-Romance
Fulah ff ful Afro-Asiatic Russian ru rus Indo-European-Slavic
Galician gl glg Indo-European-Romance Serbian sr srp Indo-European-Slavic
Luganda lg lug Atlantic-Congo Shona sn sna Atlantic-Congo
Georgian ka kat Other Sindhi sd snd Indo-European-Indo-Aryan
German de deu Indo-European-Germanic Slovak sk slk Indo-European-Slavic
Greek el ell Indo-European-Other Slovenian sl slv Indo-European-Slavic
Gujarati gu guj Indo-European-Indo-Aryan Somali so som Afro-Asiatic
Hausa ha hau Other Kurdish ku ckb Indo-European-Other
Hebrew he heb Other Spanish es spa Indo-European-Romance
Hindi hi hin Indo-European-Indo-Aryan Swahili sw swh Atlantic-Congo
Hungarian hu hun Other Swedish sv swe Indo-European-Germanic
Icelandic is isl Indo-European-Germanic Tajik tg tgk Indo-European-Other
Igbo ig ibo Atlantic-Congo Tamil ta tam Dravidian
Indonesian id ind Austronesian Telugu te tel Dravidian
Irish ga gle Indo-European-Other Thai th tha Other
Italian it ita Indo-European-Other Turkish tr tur Turkic
Japanese ja jpn Other Ukrainian uk ukr Indo-European-Slavic
Javanese jv jav Austronesian Umbundu umb umb Atlantic-Congo
Kabuverdianu kea kea Atlantic-Congo Urdu ur urd Indo-European-Indo-Aryan
Kamba kam kam Atlantic-Congo Uzbek uz uzb Turkic
Kannada kn kan Dravidian Vietnamese vi vie Other
Kazakh kk kaz Turkic Welsh cy cym Indo-European-Other
Khmer km khm Other Wolof wo wol Atlantic-Congo
Korean ko kor Other Xhosa xh xho Atlantic-Congo
Kyrgyz ky kir Turkic Yoruba yo yor Atlantic-Congo
Lao lo lao Other Zulu zu zul Atlantic-Congo
Table 10: For each language, we list its language name, ISO code and language family.