Milmo:Minority Multilingual Pre-Trained Language Model
Milmo:Minority Multilingual Pre-Trained Language Model
Model
Junjie Deng1,2,& , Hanru Shi1,2,& , Xinhe Yu1,2 , Wugedele Bao3 , Yuan Sun 1,2,∗ , Xiaobing Zhao1,2,∗
1
Minzu University of China, China
2
National Language Resource Monitoring & Research Center Minority Languages Branch
3
Hohhot Minzu College
*Corresponding author: Yuan Sun, Xiaobing Zhao
&
These authors contributed equally to this work and should be considered co-first authors
junjie deng [email protected], [email protected], [email protected], nmzxb [email protected]
arXiv:2212.01779v2 [cs.CL] 10 Apr 2023
Abstract—Pre-trained language models are trained on large- in a variety of downstream tasks. However, there are still
scale unsupervised data, and they can fine-turn the model some problems in BERT, and a large number of BERT variant
only on small-scale labeled datasets, and achieve good results. pre-trained language models [3], [13], [14] have emerged to
Multilingual pre-trained language models can be trained on
multiple languages, and the model can understand multiple solve these problems, but these researches are mainly focused
languages at the same time. At present, the search on pre- on large-scale resources such as English, and there are still
trained models mainly focuses on rich resources, while there few researches on low resources. Moreover, these pre-trained
is relatively little research on low-resource languages such as models can only be trained in a single language, and there
minority languages, and the public multilingual pre-trained is no shared knowledge between different language models.
language model can not work well for minority languages.
Therefore, this paper constructs a multilingual pre-trained model To solve these problems, multilingual pre-trained language
named MiLMo that performs better on minority language tasks, models [15]–[17], [34] come into being, which can process
including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To multiple languages at the same time. The existing multilingual
solve the problem of scarcity of datasets on minority languages pre-trained model is trained on an unlabeled multilingual
and verify the effectiveness of the MiLMo model, this paper corpus, which can project multiple languages into the same
constructs a minority multilingual text classification dataset
named MiTC, and trains a word2vec model for each language. semantic space, and has the ability of cross-lingual transfer. It
By comparing the word2vec model and the pre-trained model can conduct zero-shot learning.
in the text classification task, this paper provides an optimal At present, the pre-trained language model has been well
scheme for the downstream task research of minority languages.
The final experimental results show that the performance of the
developed in large-scale languages. However, for minority
pre-trained model is better than that of the word2vec model, languages, since it is difficult to obtain corpus resources
and it has achieved the best results in minority multilingual and there are relatively few related studies, various publicly
text classification. The multilingual pre-trained model MiLMo, available multilingual pre-trained models do not work well on
multilingual word2vec model and multilingual text classification minority languages, which seriously affects the construction
dataset MiTC are published on http://milmo.cmli-nlp.com/.
Index Terms—Multilingual, Pre-trained language model,
of minority language informatization. Moreover, the existing
datasets, Word2vec multilingual pre-trained models only include a few minority
languages. Although cross-lingual transfer can be applied to
I. I NTRODUCTION minority languages, the effect is not ideal. For example, the
With the development of deep learning, various neural F1 of mBERT [15] on Tibetan News Classification Corpus
networks are widely used in the downstream tasks of natural (TNCC) [24] is 5.5% [32], and the F1 of XLM-R-base on
language processing and have achieved good performance. TNCC is 21.1% [32]. To further promote the development
These downstream tasks usually rely on a large-scale of of natural language processing tasks in minority languages,
labeled training datasets, but labeled datasets often require sig- this paper has fully collected and sorted out relevant docu-
nificant human and material resources. The emergence of pre- ments from the Internet, relevant books, the National Peo-
trained language model [1]–[5] has solved this problem well. ple’s Congress (NPC) and the Chinese Political Consultative
The pre-trained model is trained on large-scale unsupervised Conference the National People’s Congress (CPPCC), and
data to obtain a general model. In the downstream tasks, the government work reports, and trains a multilingual pre-trained
model can achieve good performance only by fine-tuning the model named MiLMo on these data. The main contributions
small-scale labeled data, which is crucial for the research on of this paper are as follows:
low-resource languages.
• This paper constructs a pre-trained model MiLMo con-
BERT [1] is the most influential model among various pre-
taining five minority languages, including Mongolian,
trained language models, which has achieved the best results
Tibetan, Uygur, Kazakh and Korean, to provide support
Supported by National Nature Science Foundation (No. 61972436). for various downstream tasks of minority languages.
• This paper trains a word2vec representation for five language representation. After BERT, various BERT variants
languages, including Mongolian, Tibetan, Uygur, Kazakh have emerged, such as ALBERT [13], SpanBERT [14] and
and Korean. Comparing the word2vec representation and RoBERta [15]. These models have achieved better results.
the pre-trained model in the downstream task of text However, these studies are mainly monolingual and focus on
classification, this paper provides the best scheme for large-scale languages, such as English. There is still little
the research of downstream task of minority languages. research on low resources. To solve this problem, multilingual
The experimental results show that MiLMo model out- pre-trained models [15]–[18] begin to emerge. Facebook AI
performs the word2vec representation. Research proposes the XLM [16], which uses Byte Pair
• To solve the problem of scarcity of minority language Encoding (BPE) to preprocess training data, and expands the
datasets, this paper constructs a classification dataset shared vocabulary between different languages by dividing
MiTC containing five languages, including Mongolian, text into sub-words. They propose three pre-training tasks,
Tibetan, Uyghur, Kazakh and Korean, and publishes the including CLM, MLM, and TLM, which have achieved good
word2vec representation, multilingual pre-trained model results in cross-lingual classification tasks. After that, Face-
MiLMo and multilingual classification dataset MiTC on book AI Research put forward XLM-R. This model increases
http://milmo.cmli-nlp.com/. the number of languages in the training dataset on the basis of
XLM and RoBERTa, and upsamples low-resource languages in
II. R ELATED W ORK the process of vocabulary construction and training to generate
Word representation can convert words in natural language a larger shared vocabulary and improve the ability of the
into vectors which is of great significance to natural language model. Liu et al. propose mBERT [15], they select largest 104
tasks. Early word vectors can capture the semantics of words, languages in Wikipedia as the training data, they also train the
but they are context independent and cannot solve the problem model in MLM and NSP tasks. They use the same model and
of polysemy [6]–[8]. To solve the above problems, rele- weight to process all target languages, and make the model
vant researchers study context-sensitive word representation. have cross-lingual transfer capabilities by using the shared
ELMo [9] is the first model to apply context-sensitive word parameters. The above cross-lingual models have achieved
representation successfully. It uses language model to learn great success in large-scale languages, such as English. How-
the word vector representations in a large-scale unsupervised ever, due to the scarcity of minority language corpus and the
database, and then the word representations of the network complexity of grammar rules, the above multilingual models
layers corresponding to the words are extracted from the pre- can not deal with them well. Therefore, the research on various
trained network as new features to be added to the downstream downstream tasks in minority languages is still at an early
task, and the word vector representations can be obtained using stage, which seriously hinders the development of minority
the contextual information of the data. However, this model natural language processing. To solve the above problems, The
splices two unidirection LSTMs, and the ability of feature HIT·iFLYEK Language Cognitive Computing Lab releases
extraction and fusion is weak. the minority language pre-trained model CINO [32], which
In 2017, Google proposes the Transformer block [10] in provides the ability to understand Tibetan, Mongolian, Uyghur,
the machine translation task, which is a new encoder-decoder Kazakh, Korean, Zhuang, Cantonese, Chinese and dialects. To
architecture that uses the attention mechanism to encode each further promote the development of ethnic minority language
location and can parallelize processing. Transformer solves pre-trained model, this paper trains a multilingual pre-trained
the problem that CNN needs a lot of storage resources to model for ethnic minority languages. The experimental results
remember the whole sequence and the inability to parallelize show that our model can effectively promote the research on
processing. On this basis, OpenAI proposes the GPT, which the downstream tasks of ethnic minority languages.
uses the Transformer block with 12 layers. GPT is first pre-
Languages The Amount of Data
trained on an unlabeled dataset to obtain a language model, Mongolian 788MB
and then deal with various downstream tasks by fine-tuning. Tibetan 1.5GB
GPT-1 [4] can achieve good results after fine-tuned, but does Uyghur 397MB
Kazakh 620MB
not work well on the tasks without fine-tuning. To train a Korean 994MB
word vector model with stronger generalization ability, GPT-
Table I
2 [11] uses more network parameters and larger datasets. GPT- T HE AMOUNT OF DATA FOR EACH MINORITY LANGUAGE
2 considers supervised task as the sub-task of the language
model. It verifies that word vector models trained with massive
data and parameters can be directly transferred to other tasks
III. M ODEL D ETAILS
without fine-tuning on labeled data. To further improve the
effect of unsupervised learning, GPT-3 [12] further increases A. Word2vec’s Data Preprocessing
training data and parameters, and achieves better results in a This paper obtains the training data of five minority lan-
variety of downstream tasks. guages, including Mongolian, Tibetan, Uyghur, Kazakh and
However, GPT is a unidirection language model. BERT Korean from the network, relevant books, the NPC and CP-
proposes to use MLM to pre-train to generate deep bidirection PCC sessions, government work reports and other relevant
Languages Segmentation
Mongolian
(High-quality development of education in the city.)
Tibetan syll
(The ranks of Tibetan party members have also changed from small to large, from weak to strong.)
Tibetan word
(The ranks of Tibetan party members have also changed from small to large, from weak to strong.)
Uyghur
(Open up new ways for villagers to increase income and become rich.)
Kazakh
(Now they don’t see him as part of the family anymore.)
Korean
(Many parks in Harbin are in full bloom.)
TABLE II
T HE RESULTS OF WORD SEGMENTATION IN EACH MINORITY LANGUAGE
documents. And we delete the non-text information such as B. Pre-trained Model MiLMo
pictures, links and symbols, and discard the articles with text 1) MiLMo’s Design: XLM is a cross-lingual pre-trained
length less than 20 to clean the data. The final data information model proposed by Facebook AI Research. This model can
is shown in Table I. be trained in multiple languages, allowing the model to
learn more cross-lingual information, so that the informa-
tion learned from other languages can be applied to low-
Before training the model, we need to segment the data. resource languages. XLM proposes three pre-training tasks,
The above five minority languages are all alphabetic writings. including Causal Language Modeling (CLM), Masked Lan-
Mongolian is spelled with words from top to bottom and left guage Modeling (MLM) and Translation Language Model-
to right. Mongolian words are separated by spaces [20], so this ing (TLM). CLM is a Transformer language model, which
paper directly uses spaces for word separation. The smallest can predict the probability of the next word of a given
unit of a Tibetan word is a syllable, and a syllable contains sentenceP (wt |w1 , ..., wt−1 , Θ). MLM is a masking task in
one or up to seven characters. The syllables contain rich which the model masks the tokens in the input sentence with a
semantic information. Therefore, this paper segments Tibetan certain probability and then predicts the masked tokens. TLM
sentences at syllable level and word level respectively [33]. is an extension of MLM, which splices parallel corpora from
The morphological structure of Uyghur words is complex. different languages as the input of the model, then masks some
There are 32 letters in modern Uyghur, and each word is of these tokens according to a certain probability, and then
spelled by letters. The end of the word realizes its grammatical predicts these tokens. CLM and MLM only need unsupervised
function by pasting different affixes. Therefore, the same word training on monolingual datasets, and TLM needs supervised
root can evolve into different word morphologies without great learning using parallel corpus. This paper uses MLM model.
differences in word meanings. Uyghur is written from right to The input of the model is a sentence with 256 tokens. During
left, and words are separated by spaces. Each Uyghur word training, part of the tokens are masked with a probability
can be used as a feature item [22]. Therefore, this paper uses of 15%. For each masked token, 80% uses [mask] instead,
spaces to segment Uyghur words. In Kazakh, words are also 10% randomly selects a token from the vocabulary, and
separated by spaces [21]. In Korean, the space cannot be 10% remains unchanged. The model parameters are shown
used as a direct word segmentation mark. Morpheme is the in Table III. The MiLMo model is trained using the 12 layer
smallest linguistic unit with semantics, so it is necessary to Transformer blocks.
segment sentences into morphemes. In this paper, the Korean 2) Shared sub-word vocabulary: BPE [23] is a data com-
processing toolkit KoNLPy [19] proposed by Park et al. is used pression algorithm. By iteratively merging character pairs with
to segment the Korean corpus, and the morphemes obtained high frequency, variable length sub-words can be generated in
are used as input features. The word segmentation results of a fixed size vocabulary. The process of building a vocabulary
the above five languages are shown in Table II. This paper is as follows:
uses skip-gram model to train word representation with 300 • Divideing the words in a sentence into individual char-
dimension. acters and uses all the characters to build an initial
Figure 1. Distribution of text types in the MiTC
dataset from Harbin Institute of Technology. This dataset is classification tasks. We used TextCNN, TextRNN, TextRNN
based on the Wikipedia corpus of minority languages and Attention, TextRCNN, FastText, DPCNN and Transformer
its classification system labels, including Mongolian, Tibetan, models to conduct classification experiments on MiTC. Tex-
Uyghur, Cantonese, Korean, Kazakh and Chinese. It covers ten tRNN Attention is a bidirectional LSTM network based on
categories of art, geography, history, nature, natural science, the Attention mechanism. The F1 of classification results is
characters, technology, education, economy and health. The shown in Table V.
total number of samples of five languages in WCM is 17,199, From Table V, we can see that DPCNN has the best effect
including Mongolian, Tibetan, Uyghur, Kazakh and Korean. of 49.15% on Mongolian dataset. On Tibetan syllable level
While training the pre-trained model of minority languages, data, TextCNN has the best classification F1 of 39.60%. On
this paper constructs a multilingual text classification dataset the Tibetan word level data, the effect of DPCNN reached the
MiTC for various languages, which contains 82,662 samples. best 34.51%. On Uyghur data, TextRNN has the best effect of
The data of each language in WCM and MiTC is shown in 42.19%. In Kazakh, the DPCNN model has the best effect of
Table IV. 30.13%. TextCNN has the best effect of 53.17% in Korean.
From Table IV, we can see that the MiTC dataset is rich Overall, the DPCNN model achieves the best effect on various
in all five languages, and the number of samples in Tibetan, datasets, while Transformer performs poorly. The main reason
Uyghur, Kazakh and Korean is much larger than the number is that the complex network structure of Transformer need
of samples in WCM. After analyzing WCM, we find that the more training data, while the number of classification datasets
number of samples in the same language is not balanced. For constructed in this paper is relatively small.
example, the Tibetan contains eight categories, of which the
number of ”education” categories accounts for a small pro- C. Classification based on MiLMo
portion. The Uyghur contains six categories. The total sample In this paper, we use the trained XLM model for the
contains 300 pieces of data, but the ”geography” category has downstream experiment of text classification. We first use
256 samples. The unbalanced distribution of samples leads to MiLMo model to encode classification text, and obtain the
low accuracy of the model, and the performance of the model representation vector E = e1 , e2 , ..., e256 , containing multi-
cannot be evaluated. To better evaluate the performance of the lingual common information and text information. Then use
model, this paper performs a data balancing process on the the ”linear + softmax” structure in the text classification layer
MiTC dataset by keeping the data volume of each category for text classification, and the vector representation obtained
relatively balanced under the same language. The distribution in the coding layer is fed to the linear to obtain the vector
of the categorical dataset of each language in the MiTC dataset s = Linear(E), and we use the softmax function to calculate
is shown in Figure I. the probability of text categories, p = sof tmax(s). After that
cross entropy is used to calculate the loss function value of
B. Classification based on Word2vec text classification loss = ylogp, where y is the real result and
This paper trains word2vec representation for five minority p is the prediction result. This paper constructs the MiLMo
languages. Word2vec solves the problems of ”dimension disas- model, and uses it to conduct experiments on multilingual
ter” and document vector sparsity. It transforms each word into classification datasets. The datasets are preprocessed by BPE.
a low-dimensional real vector based on the contextual semantic Then MiLMo model is used to encode the processed the
information of the document. The more similar the meaning data, and the ”linear + softmax” structure is used to classify
of words is, the more similar they are in the word vector it. The experimental results are shown in Table VI, where
space. This paper uses skip-gram to train 300-dimensional word2vec Best is the most effective classification result of
word representation, and uses the word representation for text word2vec model in five minority languages.
Model Mongolian Tibetan syll Tibetan word Uyghur Kazakh Korean
TextCNN 55.52% 30.63% 43.01% 69.44% 42.08% 30.44%
TextRNN 55.99% 17.65% 37.61% 69.44% 39.02% 18.01%
TextRNN Att 31.85% 25.62% 26.37% 69.44% 25.18% 7.41%
TextRCNN 55.53% 19.07% 46.98% 69.44% 32.35% 17.53%
FastText 31.85% 17.34% 27.23% 69.44% 10.56% 16.77%
DPCNN 56.69% 30.01% 60.98% 67.92% 50.81% 23.69%
Transformer 31.85% 25.24% 45.35% 69.44% 15.46% 11.70%
CINO-base-v2 74.44% - 75.04% 69.44% 72.82% 73.08%
MiLMo-base 91.62% - 88.15% 92.81% 82.05% 73.34%
Table VII
T EXT CLASSIFICATION RESULTS ON WCM
From Table VI, we can see that the text classification based datasets. At present, the MiLMo model architecture trained
on the pre-trained model has the highest F1 of 85.98% in in this paper only includes 12 layers of Transformer block.
Korean and 71.34% in Kazakh. The text classification effect In future research, we will further release the MiLMo-large
in five languages is higher than that of word2vec. model based on 24 layers of Transformer block, which will
To further explore the effectiveness of the multilingual pre- further improve the effect on downstream tasks.
trained model proposed in this paper this paper extracts Mon- V. C ONCLUSION
golian, Tibetan, Uyghur, Kazakh and Korean from the WCM The multilingual pre-trained model provides support for
for text classification experiments, and uses CINO-base-v2, various languages with rich resources. However, due to the
word2vec and MiLMo model for comparison. CINO-base-v2 rich morphology of minority languages, different grammatical
model is a multilingual pre-trained model for ethnic minorities rules, and difficult data acquisition, various natural language
released by HIT·iFLYEK Language Cognitive Computing Lab. processing tasks for minority languages are still at the initial
The final experimental results are shown in Table VII and stage. To solve the above problems, this paper takes Mongo-
Figure II. lian, Tibetan, Uyghur, Kazakh, and Korean as examples. We
From Table VII and Figure II, we can see that our model obtain relevant data from relevant books, relevant documents
MiLMo model has achieved the best results in five datasets. of the NPC and CPPCC sessions and government work reports,
The classification effect of CINO-base-v2 in Tibetan, Korean, and construct a multilingual dataset after data cleaning, and
Mongolian and Kazakh is higher than that of word2vec, but construct a multilingual pre-trained model MiLMo for ethnic
lower than that of MiLMo model. In Uyghur, due to the minorities. To verify the effectiveness of the pre-trained model,
unbalance distribution of WCM, the article category is mainly this paper trains word2vec on five minority languages, and uses
focused on the ”geography”. The classification F1 of CINO- the word2vec word representation and the pre-trained model
base-v2 and word2vec is the same as 69.44%, while the for text classification experiments. The experimental results
classification F1 of MiLMo model is 92.81%, which shows show that the classification effect of the pre-trained model on
that our model can still achieve good results on small-scale five languages is better than word2vec.
R EFERENCES Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 1715–1725.
[1] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirec- [24] Qun N, Li X, Qiu X, et al. End-to-end neural text classification for
tional transformers for language understanding[J]. Proceedings of the tibetan[C]//Chinese Computational Linguistics and Natural Language
2019 Conference of the North American Chapter of the Association for Processing Based on Naturally Annotated Big Data: 16th China National
Computational Linguistics: Human Language Technologies, NAACL- Conference, CCL 2017, and 5th International Symposium, NLP-NABD
HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long 2017, Nanjing, China, October 13-15, 2017, Proceedings 5. Springer
and Short Papers), 2019, pp.4171-4186. International Publishing, 2017: 472-480.
[2] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of trans- [25] Park J, Kim M, Oh Y, et al. An empirical study of topic classification for
fer learning with a unified text-to-text transformer[J]. The Journal of Korean newspaper headlines[J]. Hum. Lang. Technol, 2021: 287-292.
Machine Learning Research, 2020, 21(1): 5485-5551. [26] Chen Y. Convolutional neural network for sentence classification[D].
[3] Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert University of Waterloo, 2015.
pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019. [27] Liu P, Qiu X, Huang X. 2016. Recurrent neural network for text classi-
[4] Radford A, Narasimhan K, Salimans T, et al. Improving language fication with multi-task learning. In Proceedings of the Twenty-Fifth
understanding by generative pre-training[J]. 2018. International Joint Conference on Artificial Intelligence (IJCAI’16).
[5] Yang Z, Dai Z, Yang Y, et al. Xlnet: Generalized autoregressive pre- AAAI Press, 2873–2879.
training for language understanding[J]. Advances in neural information [28] Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-
processing systems, 2019, 32. term memory networks for relation classification[C]//Proceedings of the
54th annual meeting of the association for computational linguistics
[6] Mikolov T, Grave E, Bojanowski P, et al. 2018. Advances in Pre-Training
(volume 2: Short papers). 2016: 207-212.
Distributed Word Representations. In Proceedings of the Eleventh In-
[29] Lai S, Xu L, Liu K, et al. Recurrent convolutional neural networks for
ternational Conference on Language Resources and Evaluation (LREC
text classification[C]//Proceedings of the AAAI conference on artificial
2018).
intelligence. 2015, 29(1).
[7] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of
[30] Joulin A, Grave E, Bojanowski P, et al. 2017. Bag of Tricks for
words and phrases and their compositionality[J]. Advances in neural
Efficient Text Classification. In Proceedings of the 15th Conference of
information processing systems, 2013, 26.
the European Chapter of the Association for Computational Linguistics:
[8] Pennington J, Socher R, Manning C D. Glove: Global vectors for word Volume 2, Short Papers, pages 427–431.
representation[C]//Proceedings of the 2014 conference on empirical [31] Johnson R, Zhang T. Deep pyramid convolutional neural networks for
methods in natural language processing (EMNLP). 2014: 1532-1543. text categorization[C]//Proceedings of the 55th Annual Meeting of the
[9] Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. 2018. Deep Association for Computational Linguistics (Volume 1: Long Papers).
Contextualized Word Representations. In Proceedings of the 2018 2017: 562-570.
Conference of the North American Chapter of the Association for [32] Yang Z, Xu Z, Cui Y, et al. 2022. CINO: A Chinese Minority Pre-trained
Computational Linguistics: Human Language Technologies, Volume 1 Language Model. In Proceedings of the 29th International Conference
(Long Papers), pages 2227–2237. on Computational Linguistics, pages 3937–3949.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. [33] Long C, Liu H, Nuo M, et al. 2015. Tibetan POS Tagging Based on
Advances in neural information processing systems, 2017, 30. Syllable Tagging. In Journal of Chinese Information Processing. pp.211-
[11] Radford A, Wu J, Child R, et al. Language models are unsupervised 216.
multitask learners[J]. OpenAI blog, 2019, 1(8): 9. [34] Xue L, Constant N, Roberts A, et al. 2021. mT5: A Massively Mul-
[12] Brown T, Mann B, Ryder N, et al. Language models are few-shot tilingual Pre-trained Text-to-Text Transformer. In Proceedings of the
learners[J]. Advances in neural information processing systems, 2020, 2021 Conference of the North American Chapter of the Association
33: 1877-1901. for Computational Linguistics: Human Language Technologies, pages
[13] Lan Z, Chen M, Goodman S, et al. Albert: A lite bert for self-supervised 483–498.
learning of language representations[J]. In International Conference on
Learning Representations, 2019.
[14] Joshi M, Chen D, Liu Y, et al. Spanbert: Improving pre-training by
representing and predicting spans[J]. Transactions of the Association
for Computational Linguistics, 2020, 8: 64-77.
[15] Pires T, Schlinger E, Garrette D. How Multilingual is Multilingual
BERT?. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pp. 4996–5001.
[16] Conneau A, Lample G. Cross-lingual language model pretraining[J].
Advances in neural information processing systems, 2019, 32.
[17] Conneau A, Khandelwal K, Goyal N, et al. 2020. Unsupervised Cross-
lingual Representation Learning at Scale. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pages
8440–8451.
[18] Ouyang X, Wang S, Pang C, et al. 2021. ERNIE-M: Enhanced Multilin-
gual Representation by Aligning Cross-lingual Semantics with Mono-
lingual Corpora. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 27–38.
[19] Park E L, Cho S. KoNLPy: Korean natural language processing in
Python[C]//Annual Conference on Human and Language Technology.
Human and Language Technology, 2014: 133-136.
[20] Jian-dong Z, Guang-lai G A O, Fei-long B A O. Research on History-
based Mongolian Automatic POS Tagging[J]. Journal of Chinese Infor-
mation Processing, 2013, 27(5).
[21] Alimjan A, Jumahun H, Sun T, et al. An approach based on SV-NN
for Kazakh language text classification. Journal of Northeast Normal
University(Natural Science Edition). 2018, pp.58-65.
[22] Alimjan A, Turgun I, Hasan O, et al. Machine learning based Uyghur
language text categorization[J]. Computer Engineering and Applications,
2012, 48(5): 110-112.
[23] Sennrich R, Haddow B, Birch A. 2016. Neural Machine Translation of
Rare Words with Subword Units. In Proceedings of the 54th Annual