Milmo:Minority Multilingual Pre-Trained Language Model

This document introduces MiLMo, a multilingual pre-trained language model for minority languages. It was trained on datasets containing Mongolian, Tibetan, Uyghur, Kazakh, and Korean. The document discusses how existing multilingual models do not perform well on minority languages due to lack of data. It evaluates MiLMo on a new multilingual text classification dataset and finds it outperforms word2vec baselines, demonstrating its effectiveness for downstream tasks in minority languages.

Uploaded by

billeton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views7 pages

Milmo:Minority Multilingual Pre-Trained Language Model

Uploaded by

billeton

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

MiLMo:Minority Multilingual Pre-trained Language

Model
Junjie Deng1,2,& , Hanru Shi1,2,& , Xinhe Yu1,2 , Wugedele Bao3 , Yuan Sun 1,2,∗ , Xiaobing Zhao1,2,∗
1
Minzu University of China, China
2
National Language Resource Monitoring & Research Center Minority Languages Branch
3
Hohhot Minzu College
*Corresponding author: Yuan Sun, Xiaobing Zhao
&
These authors contributed equally to this work and should be considered co-first authors
junjie deng [email protected], [email protected], [email protected], nmzxb [email protected]
arXiv:2212.01779v2 [cs.CL] 10 Apr 2023

Abstract—Pre-trained language models are trained on large- in a variety of downstream tasks. However, there are still
scale unsupervised data, and they can fine-turn the model some problems in BERT, and a large number of BERT variant
only on small-scale labeled datasets, and achieve good results. pre-trained language models [3], [13], [14] have emerged to
Multilingual pre-trained language models can be trained on
multiple languages, and the model can understand multiple solve these problems, but these researches are mainly focused
languages at the same time. At present, the search on pre- on large-scale resources such as English, and there are still
trained models mainly focuses on rich resources, while there few researches on low resources. Moreover, these pre-trained
is relatively little research on low-resource languages such as models can only be trained in a single language, and there
minority languages, and the public multilingual pre-trained is no shared knowledge between different language models.
language model can not work well for minority languages.
Therefore, this paper constructs a multilingual pre-trained model To solve these problems, multilingual pre-trained language
named MiLMo that performs better on minority language tasks, models [15]–[17], [34] come into being, which can process
including Mongolian, Tibetan, Uyghur, Kazakh and Korean. To multiple languages at the same time. The existing multilingual
solve the problem of scarcity of datasets on minority languages pre-trained model is trained on an unlabeled multilingual
and verify the effectiveness of the MiLMo model, this paper corpus, which can project multiple languages into the same
constructs a minority multilingual text classification dataset
named MiTC, and trains a word2vec model for each language. semantic space, and has the ability of cross-lingual transfer. It
By comparing the word2vec model and the pre-trained model can conduct zero-shot learning.
in the text classification task, this paper provides an optimal At present, the pre-trained language model has been well
scheme for the downstream task research of minority languages.
The final experimental results show that the performance of the
developed in large-scale languages. However, for minority
pre-trained model is better than that of the word2vec model, languages, since it is difficult to obtain corpus resources
and it has achieved the best results in minority multilingual and there are relatively few related studies, various publicly
text classification. The multilingual pre-trained model MiLMo, available multilingual pre-trained models do not work well on
multilingual word2vec model and multilingual text classification minority languages, which seriously affects the construction
dataset MiTC are published on http://milmo.cmli-nlp.com/.
Index Terms—Multilingual, Pre-trained language model,
of minority language informatization. Moreover, the existing
datasets, Word2vec multilingual pre-trained models only include a few minority
languages. Although cross-lingual transfer can be applied to
I. I NTRODUCTION minority languages, the effect is not ideal. For example, the
With the development of deep learning, various neural F1 of mBERT [15] on Tibetan News Classification Corpus
networks are widely used in the downstream tasks of natural (TNCC) [24] is 5.5% [32], and the F1 of XLM-R-base on
language processing and have achieved good performance. TNCC is 21.1% [32]. To further promote the development
These downstream tasks usually rely on a large-scale of of natural language processing tasks in minority languages,
labeled training datasets, but labeled datasets often require sig- this paper has fully collected and sorted out relevant docu-
nificant human and material resources. The emergence of pre- ments from the Internet, relevant books, the National Peo-
trained language model [1]–[5] has solved this problem well. ple’s Congress (NPC) and the Chinese Political Consultative
The pre-trained model is trained on large-scale unsupervised Conference the National People’s Congress (CPPCC), and
data to obtain a general model. In the downstream tasks, the government work reports, and trains a multilingual pre-trained
model can achieve good performance only by fine-tuning the model named MiLMo on these data. The main contributions
small-scale labeled data, which is crucial for the research on of this paper are as follows:
low-resource languages.
• This paper constructs a pre-trained model MiLMo con-
BERT [1] is the most influential model among various pre-
taining five minority languages, including Mongolian,
trained language models, which has achieved the best results
Tibetan, Uygur, Kazakh and Korean, to provide support
Supported by National Nature Science Foundation (No. 61972436). for various downstream tasks of minority languages.
• This paper trains a word2vec representation for five language representation. After BERT, various BERT variants
languages, including Mongolian, Tibetan, Uygur, Kazakh have emerged, such as ALBERT [13], SpanBERT [14] and
and Korean. Comparing the word2vec representation and RoBERta [15]. These models have achieved better results.
the pre-trained model in the downstream task of text However, these studies are mainly monolingual and focus on
classification, this paper provides the best scheme for large-scale languages, such as English. There is still little
the research of downstream task of minority languages. research on low resources. To solve this problem, multilingual
The experimental results show that MiLMo model out- pre-trained models [15]–[18] begin to emerge. Facebook AI
performs the word2vec representation. Research proposes the XLM [16], which uses Byte Pair
• To solve the problem of scarcity of minority language Encoding (BPE) to preprocess training data, and expands the
datasets, this paper constructs a classification dataset shared vocabulary between different languages by dividing
MiTC containing five languages, including Mongolian, text into sub-words. They propose three pre-training tasks,
Tibetan, Uyghur, Kazakh and Korean, and publishes the including CLM, MLM, and TLM, which have achieved good
word2vec representation, multilingual pre-trained model results in cross-lingual classification tasks. After that, Face-
MiLMo and multilingual classification dataset MiTC on book AI Research put forward XLM-R. This model increases
http://milmo.cmli-nlp.com/. the number of languages in the training dataset on the basis of
XLM and RoBERTa, and upsamples low-resource languages in
II. R ELATED W ORK the process of vocabulary construction and training to generate
Word representation can convert words in natural language a larger shared vocabulary and improve the ability of the
into vectors which is of great significance to natural language model. Liu et al. propose mBERT [15], they select largest 104
tasks. Early word vectors can capture the semantics of words, languages in Wikipedia as the training data, they also train the
but they are context independent and cannot solve the problem model in MLM and NSP tasks. They use the same model and
of polysemy [6]–[8]. To solve the above problems, rele- weight to process all target languages, and make the model
vant researchers study context-sensitive word representation. have cross-lingual transfer capabilities by using the shared
ELMo [9] is the first model to apply context-sensitive word parameters. The above cross-lingual models have achieved
representation successfully. It uses language model to learn great success in large-scale languages, such as English. How-
the word vector representations in a large-scale unsupervised ever, due to the scarcity of minority language corpus and the
database, and then the word representations of the network complexity of grammar rules, the above multilingual models
layers corresponding to the words are extracted from the pre- can not deal with them well. Therefore, the research on various
trained network as new features to be added to the downstream downstream tasks in minority languages is still at an early
task, and the word vector representations can be obtained using stage, which seriously hinders the development of minority
the contextual information of the data. However, this model natural language processing. To solve the above problems, The
splices two unidirection LSTMs, and the ability of feature HIT·iFLYEK Language Cognitive Computing Lab releases
extraction and fusion is weak. the minority language pre-trained model CINO [32], which
In 2017, Google proposes the Transformer block [10] in provides the ability to understand Tibetan, Mongolian, Uyghur,
the machine translation task, which is a new encoder-decoder Kazakh, Korean, Zhuang, Cantonese, Chinese and dialects. To
architecture that uses the attention mechanism to encode each further promote the development of ethnic minority language
location and can parallelize processing. Transformer solves pre-trained model, this paper trains a multilingual pre-trained
the problem that CNN needs a lot of storage resources to model for ethnic minority languages. The experimental results
remember the whole sequence and the inability to parallelize show that our model can effectively promote the research on
processing. On this basis, OpenAI proposes the GPT, which the downstream tasks of ethnic minority languages.
uses the Transformer block with 12 layers. GPT is first pre-
Languages The Amount of Data
trained on an unlabeled dataset to obtain a language model, Mongolian 788MB
and then deal with various downstream tasks by fine-tuning. Tibetan 1.5GB
GPT-1 [4] can achieve good results after fine-tuned, but does Uyghur 397MB
Kazakh 620MB
not work well on the tasks without fine-tuning. To train a Korean 994MB
word vector model with stronger generalization ability, GPT-
Table I
2 [11] uses more network parameters and larger datasets. GPT- T HE AMOUNT OF DATA FOR EACH MINORITY LANGUAGE
2 considers supervised task as the sub-task of the language
model. It verifies that word vector models trained with massive
data and parameters can be directly transferred to other tasks
III. M ODEL D ETAILS
without fine-tuning on labeled data. To further improve the
effect of unsupervised learning, GPT-3 [12] further increases A. Word2vec’s Data Preprocessing
training data and parameters, and achieves better results in a This paper obtains the training data of five minority lan-
variety of downstream tasks. guages, including Mongolian, Tibetan, Uyghur, Kazakh and
However, GPT is a unidirection language model. BERT Korean from the network, relevant books, the NPC and CP-
proposes to use MLM to pre-train to generate deep bidirection PCC sessions, government work reports and other relevant
Languages Segmentation

Mongolian
(High-quality development of education in the city.)

Tibetan syll
(The ranks of Tibetan party members have also changed from small to large, from weak to strong.)

Tibetan word
(The ranks of Tibetan party members have also changed from small to large, from weak to strong.)

Uyghur
(Open up new ways for villagers to increase income and become rich.)

Kazakh
(Now they don’t see him as part of the family anymore.)

Korean
(Many parks in Harbin are in full bloom.)
TABLE II
T HE RESULTS OF WORD SEGMENTATION IN EACH MINORITY LANGUAGE

documents. And we delete the non-text information such as B. Pre-trained Model MiLMo
pictures, links and symbols, and discard the articles with text 1) MiLMo’s Design: XLM is a cross-lingual pre-trained
length less than 20 to clean the data. The final data information model proposed by Facebook AI Research. This model can
is shown in Table I. be trained in multiple languages, allowing the model to
learn more cross-lingual information, so that the informa-
tion learned from other languages can be applied to low-
Before training the model, we need to segment the data. resource languages. XLM proposes three pre-training tasks,
The above five minority languages are all alphabetic writings. including Causal Language Modeling (CLM), Masked Lan-
Mongolian is spelled with words from top to bottom and left guage Modeling (MLM) and Translation Language Model-
to right. Mongolian words are separated by spaces [20], so this ing (TLM). CLM is a Transformer language model, which
paper directly uses spaces for word separation. The smallest can predict the probability of the next word of a given
unit of a Tibetan word is a syllable, and a syllable contains sentenceP (wt |w1 , ..., wt−1 , Θ). MLM is a masking task in
one or up to seven characters. The syllables contain rich which the model masks the tokens in the input sentence with a
semantic information. Therefore, this paper segments Tibetan certain probability and then predicts the masked tokens. TLM
sentences at syllable level and word level respectively [33]. is an extension of MLM, which splices parallel corpora from
The morphological structure of Uyghur words is complex. different languages as the input of the model, then masks some
There are 32 letters in modern Uyghur, and each word is of these tokens according to a certain probability, and then
spelled by letters. The end of the word realizes its grammatical predicts these tokens. CLM and MLM only need unsupervised
function by pasting different affixes. Therefore, the same word training on monolingual datasets, and TLM needs supervised
root can evolve into different word morphologies without great learning using parallel corpus. This paper uses MLM model.
differences in word meanings. Uyghur is written from right to The input of the model is a sentence with 256 tokens. During
left, and words are separated by spaces. Each Uyghur word training, part of the tokens are masked with a probability
can be used as a feature item [22]. Therefore, this paper uses of 15%. For each masked token, 80% uses [mask] instead,
spaces to segment Uyghur words. In Kazakh, words are also 10% randomly selects a token from the vocabulary, and
separated by spaces [21]. In Korean, the space cannot be 10% remains unchanged. The model parameters are shown
used as a direct word segmentation mark. Morpheme is the in Table III. The MiLMo model is trained using the 12 layer
smallest linguistic unit with semantics, so it is necessary to Transformer blocks.
segment sentences into morphemes. In this paper, the Korean 2) Shared sub-word vocabulary: BPE [23] is a data com-
processing toolkit KoNLPy [19] proposed by Park et al. is used pression algorithm. By iteratively merging character pairs with
to segment the Korean corpus, and the morphemes obtained high frequency, variable length sub-words can be generated in
are used as input features. The word segmentation results of a fixed size vocabulary. The process of building a vocabulary
the above five languages are shown in Table II. This paper is as follows:
uses skip-gram model to train word representation with 300 • Divideing the words in a sentence into individual char-
dimension. acters and uses all the characters to build an initial
Figure 1. Distribution of text types in the MiTC

Parameter Value Language MiTC WCM

emb dim 2048 Mongolian 1,747 2,973
n layers 12 Tibetan 4,926 1,110
n heads 8 Uyghur 1,304 300
dropout 0.1 Kazakh 35,826 6,258
n langs 5 Korean 38,859 6,558
max len 256 Total 82,662 17,199
vocab size 70,000
Table IV
Table III L ANGUAGE SCALE STATISTICS IN WCM AND M I TC
T HE TRAINING PARAMETERS OF THE M I LM O

efficiency of word segmentation and simplify the vocabulary.

vocabulary. Before training the model, this paper splits the training corpus
• Counting the frequencies of adjacent sub-word pairs into training set, validation set, and test set in with ratio 8:1:1,
within words in the corpus. and then the BPE is used to preprocess the training corpus.
• Selecting the sub-word pairs with the highest frequency, To cover all the corpus to the greatest extent, this paper
merge them into new sub-words, and add new sub-words combines all the training sets of five languages, and uses
to the sub-word vocabulary. BPE to build the vocabulary. The final vocabulary has 70,000
• Deleting the sub-words that no longer exist in the corpus sub-words and covers 99.95% of the training sets. Then we
from the vocabulary. use the trained BPE model and the constructed vocabulary to
From the process of building the vocabulary, BPE is preprocess the training data of five minority languages.
generally applicable to character formal languages involving
IV. E XPERIMENTS
prefixes and suffixes. The five languages in this paper are
all alphabetic writings with stickiness phenomenon. Grammar A. Dataset
functions can be realized by pasting different suffixes in the Due to the difficulty in acquiring minority languages and
front, middle and back of the root words. Therefore, BPE can the complexity of grammar rules, at present, the only pub-
be used to preprocess the five minority languages and build licly available dataset on minority languages is Wiki-Chinese-
a shared vocabulary, so that it can effectively improve the Minority (WCM) [32] , a minority language classification task
Model Mongolian Tibetan syll Tibetan word Uyghur Kazakh Korean
TextCNN 35.12% 39.60% 32.85% 34.59% 27.52% 53.17%
TextRNN 28.77% 26.26% 31.48% 42.19% 17.95% 43.22%
TextRNN Att 22.17% 24.04% 17.20% 28.93% 20.21% 38.88%
TextRCNN 27.81% 28.92% 26.37% 38.23% 24.40% 43.41%
FastText 17.32% 11.23% 16.02% 26.34% 11.09% 19.42%
DPCNN 49.15% 34.79% 34.51% 32.15% 30.13% 52.85%
Transformer 24.13% 26.53% 18.67% 34.90% 10.88% 33.63%
Table V
T EXT CLASSIFICATION RESULTS OF WORD 2 VEC ON M I TC

Model Mongolian Tibetan Uyghur Kazakh Korean

word2vec best 49.15% 34.51% 38.23% 30.13% 53.17%
MiLMo-base 81.48% 76.44% 74.13% 71.34% 85.98%
Table VI
C OMPARISON OF TEXT CLASSIFICATION RESULTS BETWEEN WORD 2 VEC AND M I LM O ON M I TC

dataset from Harbin Institute of Technology. This dataset is classification tasks. We used TextCNN, TextRNN, TextRNN
based on the Wikipedia corpus of minority languages and Attention, TextRCNN, FastText, DPCNN and Transformer
its classification system labels, including Mongolian, Tibetan, models to conduct classification experiments on MiTC. Tex-
Uyghur, Cantonese, Korean, Kazakh and Chinese. It covers ten tRNN Attention is a bidirectional LSTM network based on
categories of art, geography, history, nature, natural science, the Attention mechanism. The F1 of classification results is
characters, technology, education, economy and health. The shown in Table V.
total number of samples of five languages in WCM is 17,199, From Table V, we can see that DPCNN has the best effect
including Mongolian, Tibetan, Uyghur, Kazakh and Korean. of 49.15% on Mongolian dataset. On Tibetan syllable level
While training the pre-trained model of minority languages, data, TextCNN has the best classification F1 of 39.60%. On
this paper constructs a multilingual text classification dataset the Tibetan word level data, the effect of DPCNN reached the
MiTC for various languages, which contains 82,662 samples. best 34.51%. On Uyghur data, TextRNN has the best effect of
The data of each language in WCM and MiTC is shown in 42.19%. In Kazakh, the DPCNN model has the best effect of
Table IV. 30.13%. TextCNN has the best effect of 53.17% in Korean.
From Table IV, we can see that the MiTC dataset is rich Overall, the DPCNN model achieves the best effect on various
in all five languages, and the number of samples in Tibetan, datasets, while Transformer performs poorly. The main reason
Uyghur, Kazakh and Korean is much larger than the number is that the complex network structure of Transformer need
of samples in WCM. After analyzing WCM, we find that the more training data, while the number of classification datasets
number of samples in the same language is not balanced. For constructed in this paper is relatively small.
example, the Tibetan contains eight categories, of which the
number of ”education” categories accounts for a small pro- C. Classification based on MiLMo
portion. The Uyghur contains six categories. The total sample In this paper, we use the trained XLM model for the
contains 300 pieces of data, but the ”geography” category has downstream experiment of text classification. We first use
256 samples. The unbalanced distribution of samples leads to MiLMo model to encode classification text, and obtain the
low accuracy of the model, and the performance of the model representation vector E = e1 , e2 , ..., e256 , containing multi-
cannot be evaluated. To better evaluate the performance of the lingual common information and text information. Then use
model, this paper performs a data balancing process on the the ”linear + softmax” structure in the text classification layer
MiTC dataset by keeping the data volume of each category for text classification, and the vector representation obtained
relatively balanced under the same language. The distribution in the coding layer is fed to the linear to obtain the vector
of the categorical dataset of each language in the MiTC dataset s = Linear(E), and we use the softmax function to calculate
is shown in Figure I. the probability of text categories, p = sof tmax(s). After that
cross entropy is used to calculate the loss function value of
B. Classification based on Word2vec text classification loss = ylogp, where y is the real result and
This paper trains word2vec representation for five minority p is the prediction result. This paper constructs the MiLMo
languages. Word2vec solves the problems of ”dimension disas- model, and uses it to conduct experiments on multilingual
ter” and document vector sparsity. It transforms each word into classification datasets. The datasets are preprocessed by BPE.
a low-dimensional real vector based on the contextual semantic Then MiLMo model is used to encode the processed the
information of the document. The more similar the meaning data, and the ”linear + softmax” structure is used to classify
of words is, the more similar they are in the word vector it. The experimental results are shown in Table VI, where
space. This paper uses skip-gram to train 300-dimensional word2vec Best is the most effective classification result of
word representation, and uses the word representation for text word2vec model in five minority languages.
Model Mongolian Tibetan syll Tibetan word Uyghur Kazakh Korean
TextCNN 55.52% 30.63% 43.01% 69.44% 42.08% 30.44%
TextRNN 55.99% 17.65% 37.61% 69.44% 39.02% 18.01%
TextRNN Att 31.85% 25.62% 26.37% 69.44% 25.18% 7.41%
TextRCNN 55.53% 19.07% 46.98% 69.44% 32.35% 17.53%
FastText 31.85% 17.34% 27.23% 69.44% 10.56% 16.77%
DPCNN 56.69% 30.01% 60.98% 67.92% 50.81% 23.69%
Transformer 31.85% 25.24% 45.35% 69.44% 15.46% 11.70%
CINO-base-v2 74.44% - 75.04% 69.44% 72.82% 73.08%
MiLMo-base 91.62% - 88.15% 92.81% 82.05% 73.34%
Table VII
T EXT CLASSIFICATION RESULTS ON WCM

Figure 2. Text classification results on WCM

From Table VI, we can see that the text classification based datasets. At present, the MiLMo model architecture trained
on the pre-trained model has the highest F1 of 85.98% in in this paper only includes 12 layers of Transformer block.
Korean and 71.34% in Kazakh. The text classification effect In future research, we will further release the MiLMo-large
in five languages is higher than that of word2vec. model based on 24 layers of Transformer block, which will
To further explore the effectiveness of the multilingual pre- further improve the effect on downstream tasks.
trained model proposed in this paper this paper extracts Mon- V. C ONCLUSION
golian, Tibetan, Uyghur, Kazakh and Korean from the WCM The multilingual pre-trained model provides support for
for text classification experiments, and uses CINO-base-v2, various languages with rich resources. However, due to the
word2vec and MiLMo model for comparison. CINO-base-v2 rich morphology of minority languages, different grammatical
model is a multilingual pre-trained model for ethnic minorities rules, and difficult data acquisition, various natural language
released by HIT·iFLYEK Language Cognitive Computing Lab. processing tasks for minority languages are still at the initial
The final experimental results are shown in Table VII and stage. To solve the above problems, this paper takes Mongo-
Figure II. lian, Tibetan, Uyghur, Kazakh, and Korean as examples. We
From Table VII and Figure II, we can see that our model obtain relevant data from relevant books, relevant documents
MiLMo model has achieved the best results in five datasets. of the NPC and CPPCC sessions and government work reports,
The classification effect of CINO-base-v2 in Tibetan, Korean, and construct a multilingual dataset after data cleaning, and
Mongolian and Kazakh is higher than that of word2vec, but construct a multilingual pre-trained model MiLMo for ethnic
lower than that of MiLMo model. In Uyghur, due to the minorities. To verify the effectiveness of the pre-trained model,
unbalance distribution of WCM, the article category is mainly this paper trains word2vec on five minority languages, and uses
focused on the ”geography”. The classification F1 of CINO- the word2vec word representation and the pre-trained model
base-v2 and word2vec is the same as 69.44%, while the for text classification experiments. The experimental results
classification F1 of MiLMo model is 92.81%, which shows show that the classification effect of the pre-trained model on
that our model can still achieve good results on small-scale five languages is better than word2vec.
R EFERENCES Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), pages 1715–1725.
[1] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirec- [24] Qun N, Li X, Qiu X, et al. End-to-end neural text classification for
tional transformers for language understanding[J]. Proceedings of the tibetan[C]//Chinese Computational Linguistics and Natural Language
2019 Conference of the North American Chapter of the Association for Processing Based on Naturally Annotated Big Data: 16th China National
Computational Linguistics: Human Language Technologies, NAACL- Conference, CCL 2017, and 5th International Symposium, NLP-NABD
HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long 2017, Nanjing, China, October 13-15, 2017, Proceedings 5. Springer
and Short Papers), 2019, pp.4171-4186. International Publishing, 2017: 472-480.
[2] Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of trans- [25] Park J, Kim M, Oh Y, et al. An empirical study of topic classification for
fer learning with a unified text-to-text transformer[J]. The Journal of Korean newspaper headlines[J]. Hum. Lang. Technol, 2021: 287-292.
Machine Learning Research, 2020, 21(1): 5485-5551. [26] Chen Y. Convolutional neural network for sentence classification[D].
[3] Liu Y, Ott M, Goyal N, et al. Roberta: A robustly optimized bert University of Waterloo, 2015.
pretraining approach[J]. arXiv preprint arXiv:1907.11692, 2019. [27] Liu P, Qiu X, Huang X. 2016. Recurrent neural network for text classi-
[4] Radford A, Narasimhan K, Salimans T, et al. Improving language fication with multi-task learning. In Proceedings of the Twenty-Fifth
understanding by generative pre-training[J]. 2018. International Joint Conference on Artificial Intelligence (IJCAI’16).
[5] Yang Z, Dai Z, Yang Y, et al. Xlnet: Generalized autoregressive pre- AAAI Press, 2873–2879.
training for language understanding[J]. Advances in neural information [28] Zhou P, Shi W, Tian J, et al. Attention-based bidirectional long short-
processing systems, 2019, 32. term memory networks for relation classification[C]//Proceedings of the
54th annual meeting of the association for computational linguistics
[6] Mikolov T, Grave E, Bojanowski P, et al. 2018. Advances in Pre-Training
(volume 2: Short papers). 2016: 207-212.
Distributed Word Representations. In Proceedings of the Eleventh In-
[29] Lai S, Xu L, Liu K, et al. Recurrent convolutional neural networks for
ternational Conference on Language Resources and Evaluation (LREC
text classification[C]//Proceedings of the AAAI conference on artificial
2018).
intelligence. 2015, 29(1).
[7] Mikolov T, Sutskever I, Chen K, et al. Distributed representations of
[30] Joulin A, Grave E, Bojanowski P, et al. 2017. Bag of Tricks for
words and phrases and their compositionality[J]. Advances in neural
Efficient Text Classification. In Proceedings of the 15th Conference of
information processing systems, 2013, 26.
the European Chapter of the Association for Computational Linguistics:
[8] Pennington J, Socher R, Manning C D. Glove: Global vectors for word Volume 2, Short Papers, pages 427–431.
representation[C]//Proceedings of the 2014 conference on empirical [31] Johnson R, Zhang T. Deep pyramid convolutional neural networks for
methods in natural language processing (EMNLP). 2014: 1532-1543. text categorization[C]//Proceedings of the 55th Annual Meeting of the
[9] Matthew E. Peters, Mark Neumann, Mohit Iyyer, et al. 2018. Deep Association for Computational Linguistics (Volume 1: Long Papers).
Contextualized Word Representations. In Proceedings of the 2018 2017: 562-570.
Conference of the North American Chapter of the Association for [32] Yang Z, Xu Z, Cui Y, et al. 2022. CINO: A Chinese Minority Pre-trained
Computational Linguistics: Human Language Technologies, Volume 1 Language Model. In Proceedings of the 29th International Conference
(Long Papers), pages 2227–2237. on Computational Linguistics, pages 3937–3949.
[10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. [33] Long C, Liu H, Nuo M, et al. 2015. Tibetan POS Tagging Based on
Advances in neural information processing systems, 2017, 30. Syllable Tagging. In Journal of Chinese Information Processing. pp.211-
[11] Radford A, Wu J, Child R, et al. Language models are unsupervised 216.
multitask learners[J]. OpenAI blog, 2019, 1(8): 9. [34] Xue L, Constant N, Roberts A, et al. 2021. mT5: A Massively Mul-
[12] Brown T, Mann B, Ryder N, et al. Language models are few-shot tilingual Pre-trained Text-to-Text Transformer. In Proceedings of the
learners[J]. Advances in neural information processing systems, 2020, 2021 Conference of the North American Chapter of the Association
33: 1877-1901. for Computational Linguistics: Human Language Technologies, pages
[13] Lan Z, Chen M, Goodman S, et al. Albert: A lite bert for self-supervised 483–498.
learning of language representations[J]. In International Conference on
Learning Representations, 2019.
[14] Joshi M, Chen D, Liu Y, et al. Spanbert: Improving pre-training by
representing and predicting spans[J]. Transactions of the Association
for Computational Linguistics, 2020, 8: 64-77.
[15] Pires T, Schlinger E, Garrette D. How Multilingual is Multilingual
BERT?. In Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics, pp. 4996–5001.
[16] Conneau A, Lample G. Cross-lingual language model pretraining[J].
Advances in neural information processing systems, 2019, 32.
[17] Conneau A, Khandelwal K, Goyal N, et al. 2020. Unsupervised Cross-
lingual Representation Learning at Scale. In Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics, pages
8440–8451.
[18] Ouyang X, Wang S, Pang C, et al. 2021. ERNIE-M: Enhanced Multilin-
gual Representation by Aligning Cross-lingual Semantics with Mono-
lingual Corpora. In Proceedings of the 2021 Conference on Empirical
Methods in Natural Language Processing, pages 27–38.
[19] Park E L, Cho S. KoNLPy: Korean natural language processing in
Python[C]//Annual Conference on Human and Language Technology.
Human and Language Technology, 2014: 133-136.
[20] Jian-dong Z, Guang-lai G A O, Fei-long B A O. Research on History-
based Mongolian Automatic POS Tagging[J]. Journal of Chinese Infor-
mation Processing, 2013, 27(5).
[21] Alimjan A, Jumahun H, Sun T, et al. An approach based on SV-NN
for Kazakh language text classification. Journal of Northeast Normal
University(Natural Science Edition). 2018, pp.58-65.
[22] Alimjan A, Turgun I, Hasan O, et al. Machine learning based Uyghur
language text categorization[J]. Computer Engineering and Applications,
2012, 48(5): 110-112.
[23] Sennrich R, Haddow B, Birch A. 2016. Neural Machine Translation of
Rare Words with Subword Units. In Proceedings of the 54th Annual

Copie de Restaurants Email List
No ratings yet
Copie de Restaurants Email List
249 pages
Current Best Practices For Training LLMs From Scratch - Final
No ratings yet
Current Best Practices For Training LLMs From Scratch - Final
23 pages
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
No ratings yet
Ten Years of Generative Adversarial Nets (Gans) : A Survey of The State-Of-The-Art
29 pages
How Good Is Your Tokenizer? On The Monolingual Performance of Multilingual Language Models
No ratings yet
How Good Is Your Tokenizer? On The Monolingual Performance of Multilingual Language Models
28 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Targeted Multilingual Adaptation For Low-Resource Language Families
No ratings yet
Targeted Multilingual Adaptation For Low-Resource Language Families
19 pages
Unsupervised Cross-Lingual Representation Learning at Scale
No ratings yet
Unsupervised Cross-Lingual Representation Learning at Scale
12 pages
Duan 2020
No ratings yet
Duan 2020
6 pages
Multimodal Pretraining From Monolingual To Multilingual
No ratings yet
Multimodal Pretraining From Monolingual To Multilingual
14 pages
Wintersc Iitguwahati Multilingual Model Jan25
No ratings yet
Wintersc Iitguwahati Multilingual Model Jan25
81 pages
MAD-X: An Adapter-Based Framework For Multi-Task Cross-Lingual Transfer
No ratings yet
MAD-X: An Adapter-Based Framework For Multi-Task Cross-Lingual Transfer
20 pages
1 s2.0 S2095809922006324 Main
No ratings yet
1 s2.0 S2095809922006324 Main
20 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
A Survey On Multilingual Large Language Models: Corpora, Alignment, and Bias
No ratings yet
A Survey On Multilingual Large Language Models: Corpora, Alignment, and Bias
25 pages
LLM Survey
100% (1)
LLM Survey
43 pages
Large Language Models: A Survey
No ratings yet
Large Language Models: A Survey
43 pages
A Survey On Multilingual Large Language Models - Corpora, Alignment, and Bias
No ratings yet
A Survey On Multilingual Large Language Models - Corpora, Alignment, and Bias
41 pages
Pre Trained Models For NLP
No ratings yet
Pre Trained Models For NLP
15 pages
Trend
No ratings yet
Trend
47 pages
Analysing The Impact of Linguistic Features On Cross-Lingual Transfer
No ratings yet
Analysing The Impact of Linguistic Features On Cross-Lingual Transfer
10 pages
Scaling Speech Technology To 1,000+ Languages
No ratings yet
Scaling Speech Technology To 1,000+ Languages
41 pages
76 Main Long
No ratings yet
76 Main Long
13 pages
ANLP Lec09
No ratings yet
ANLP Lec09
50 pages
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
11 pages
RUBERT A Bilingual Roman Urdu BERT Using Cross Lin
No ratings yet
RUBERT A Bilingual Roman Urdu BERT Using Cross Lin
7 pages
Ikun For Wmt24 General MT Task: Llms Are Here For Multilingual Machine Translation
No ratings yet
Ikun For Wmt24 General MT Task: Llms Are Here For Multilingual Machine Translation
7 pages
Sabi A: Portuguese Large Language Models
No ratings yet
Sabi A: Portuguese Large Language Models
17 pages
Check This 1st
No ratings yet
Check This 1st
14 pages
(IJETA-V11I3P37) :anantika Johari, Rishabh Sharma, Aanchal Meena, Vansh Tiwari
No ratings yet
(IJETA-V11I3P37) :anantika Johari, Rishabh Sharma, Aanchal Meena, Vansh Tiwari
9 pages
Language, Linguistics, and Development Simplified
From Everand
Language, Linguistics, and Development Simplified
Narinder Mehra
No ratings yet
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
No ratings yet
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
40 pages
Investigating Tasks in Formal Language Learning
From Everand
Investigating Tasks in Formal Language Learning
María del Pilar García Mayo
No ratings yet
(2303.18223) A Survey of Large Language Models
No ratings yet
(2303.18223) A Survey of Large Language Models
115 pages
Rishabh Sharma (Anantika Johari)
No ratings yet
Rishabh Sharma (Anantika Johari)
8 pages
2023 Acl-Long 66
No ratings yet
2023 Acl-Long 66
16 pages
Llama Beyond English: An Empirical Study On Language Capability Transfer
No ratings yet
Llama Beyond English: An Empirical Study On Language Capability Transfer
10 pages
Paper 1
No ratings yet
Paper 1
44 pages
Multilingual Testing and Assessment
From Everand
Multilingual Testing and Assessment
Gessica De Angelis
No ratings yet
Third or Additional Language Acquisition
From Everand
Third or Additional Language Acquisition
Gessica De Angelis
No ratings yet
Bertimbau
No ratings yet
Bertimbau
14 pages
Yulan-Mini: An Open Data-Efficient Language Model
No ratings yet
Yulan-Mini: An Open Data-Efficient Language Model
47 pages
Advancement in NLP Paper
No ratings yet
Advancement in NLP Paper
49 pages
Alternating Language Modeling For Cross-Lingual Pre-Training
No ratings yet
Alternating Language Modeling For Cross-Lingual Pre-Training
8 pages
Mega
No ratings yet
Mega
36 pages
Insights into Task-Based Language Teaching
From Everand
Insights into Task-Based Language Teaching
Sima Khezrlou
No ratings yet
Multilingual Machine Translation With Large Language Models
No ratings yet
Multilingual Machine Translation With Large Language Models
16 pages
Downloed Papers
No ratings yet
Downloed Papers
700 pages
Kalyan 1 s2.0 S2949719123000456 Main
No ratings yet
Kalyan 1 s2.0 S2949719123000456 Main
48 pages
2022 Lrec-1 220
No ratings yet
2022 Lrec-1 220
8 pages
Large Language Models
From Everand
Large Language Models
A. Scholtens
2/5 (2)
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
No ratings yet
Qiu Et Al. - 2020 - Pre-Trained Models For Natural Language Processing
28 pages
Low-Resource Machine Translation For Low-Resource Languages: Leveraging Comparable Data, Codeswitching and Compute Resources
No ratings yet
Low-Resource Machine Translation For Low-Resource Languages: Leveraging Comparable Data, Codeswitching and Compute Resources
14 pages
Peerj Cs 899
No ratings yet
Peerj Cs 899
23 pages
L M M C - T R: Anguage Odels Are Ultilingual Hain OF Hought Easoners
No ratings yet
L M M C - T R: Anguage Odels Are Ultilingual Hain OF Hought Easoners
20 pages
Transfer Learning in Natural Language Processing PDF
0% (1)
Transfer Learning in Natural Language Processing PDF
238 pages
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
No ratings yet
L L M H T C: A S R: Arge Anguage Odels For Ealthcare EXT Lassification Ystematic Eview
55 pages
Explanation Based Learning: Fundamentals and Applications
From Everand
Explanation Based Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
CUNI Submission For Low-Resource Languages in WMT News 2019
No ratings yet
CUNI Submission For Low-Resource Languages in WMT News 2019
7 pages
Language Learning Environments: Spatial Perspectives on SLA
From Everand
Language Learning Environments: Spatial Perspectives on SLA
Phil Benson
No ratings yet
Cross-Lingual Language Model Pretraining PDF
No ratings yet
Cross-Lingual Language Model Pretraining PDF
10 pages
Grammar That Sticks. Teaching Grammar Through Communication
From Everand
Grammar That Sticks. Teaching Grammar Through Communication
Carol Nelson
No ratings yet
Multilingual Mentality: Strategies for Learning Multiple Languages Simultaneously
From Everand
Multilingual Mentality: Strategies for Learning Multiple Languages Simultaneously
Harlan G Otis
5/5 (1)
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
No ratings yet
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
9 pages
B C: A Benchmark For Bioinformatics Code Generation With Contextual Pragmatic Knowledge
No ratings yet
B C: A Benchmark For Bioinformatics Code Generation With Contextual Pragmatic Knowledge
72 pages
Knowledge-Grounded Natural Language Recommendation Explanation
No ratings yet
Knowledge-Grounded Natural Language Recommendation Explanation
15 pages
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
No ratings yet
BLSP: B L - S P - B A C - W: Ootstrapping Anguage Peech RE Training Via Ehavior Lignment of Ontinu Ation Riting
11 pages
Toddlerberta: Exploiting Babyberta For Grammar Learning and Language Understanding
No ratings yet
Toddlerberta: Exploiting Babyberta For Grammar Learning and Language Understanding
10 pages
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
No ratings yet
Conti Inc.: Understanding The Internal Discussions of A Large Ransomware-as-a-Service Operator With Machine Learning
10 pages
Generalised Winograd Schema and Its Contextuality: Kin Ian Lo Mehrnoosh Sadrzadeh Shane Mansfield
No ratings yet
Generalised Winograd Schema and Its Contextuality: Kin Ian Lo Mehrnoosh Sadrzadeh Shane Mansfield
16 pages
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
No ratings yet
FPTQ: F - P - T Q - L L M: INE Grained OST Raining Uantiza Tion For Arge Anguage Odels
17 pages
Link Prediction For Wikipedia Articles As A Natural Language Inference Task
No ratings yet
Link Prediction For Wikipedia Articles As A Natural Language Inference Task
4 pages
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
No ratings yet
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
5 pages
Business Process Text Sketch Automation Generation Using Large Language Model
No ratings yet
Business Process Text Sketch Automation Generation Using Large Language Model
10 pages
2308 16474
No ratings yet
2308 16474
6 pages
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
No ratings yet
Towards One-Shot Learning For Text Classification Using Inductive Logic Programming
11 pages
The Nordic Pile: A 1.2TB Nordic Dataset For Language Modeling
No ratings yet
The Nordic Pile: A 1.2TB Nordic Dataset For Language Modeling
14 pages
Nandani Product List
No ratings yet
Nandani Product List
8 pages
LL SM: L L S M: A Arge Anguage and Peech Odel
No ratings yet
LL SM: L L S M: A Arge Anguage and Peech Odel
8 pages
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
No ratings yet
Lm-I: S O - F L G - L L M: Nfinite Imple N THE LY Ength Ener Alization For Arge Anguage Odels
14 pages
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
No ratings yet
Linktransformer:: A Unified Package For Record Linkage With Transformer Language Models
16 pages
Task-Based Moe For Multitask Multilingual Machine Translation
No ratings yet
Task-Based Moe For Multitask Multilingual Machine Translation
11 pages
SPDX1
No ratings yet
SPDX1
21 pages
Hugva Usd Price List
No ratings yet
Hugva Usd Price List
1 page
Mera: Merging Pretrained Adapters For Few-Shot Learning
No ratings yet
Mera: Merging Pretrained Adapters For Few-Shot Learning
6 pages
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
No ratings yet
Quantifying and Analyzing Entity-Level Memorization in Large Language Models
9 pages
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
No ratings yet
P T P: U F A A L L - M: Eering Hrough References Nraveling Eedback Cquisition For Ligning Arge AN Guage Odels
24 pages
2020 Golden Rose Price List-1
No ratings yet
2020 Golden Rose Price List-1
6 pages
Brochure - Oct - Pepsi
No ratings yet
Brochure - Oct - Pepsi
15 pages
Twosisters OO
No ratings yet
Twosisters OO
1 page
Tronik Data Bank
No ratings yet
Tronik Data Bank
35 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
BERT Slides
No ratings yet
BERT Slides
62 pages
Unifying Large Language Models and Knowledge Graphs A Roadmap
No ratings yet
Unifying Large Language Models and Knowledge Graphs A Roadmap
20 pages
博士论文长度
100% (1)
博士论文长度
10 pages
A Multi-Institutional Study Using Arti Cial Intelligence To Provide Reliable and Fair Feedback To Surgeons
No ratings yet
A Multi-Institutional Study Using Arti Cial Intelligence To Provide Reliable and Fair Feedback To Surgeons
12 pages
Researchdemo 1
No ratings yet
Researchdemo 1
11 pages
Rasa: Relation and Sensitivity Aware Representation Learning For Text-Based Person Search
No ratings yet
Rasa: Relation and Sensitivity Aware Representation Learning For Text-Based Person Search
13 pages
Full Page Handwriting Recognition Via Image To Sequence Extraction
No ratings yet
Full Page Handwriting Recognition Via Image To Sequence Extraction
16 pages
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
No ratings yet
LLMs Overview and OpenAI API Ver 1-8 - Final NLP Day-UM6P-Nov 2023
45 pages
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
No ratings yet
Decoder-Only Transformer (LLM) For Question Asking: Notebook Structure
9 pages
Yang Foreground-Background Distribution Modeling Transformer For Visual Object Tracking ICCV 2023 Paper
No ratings yet
Yang Foreground-Background Distribution Modeling Transformer For Visual Object Tracking ICCV 2023 Paper
11 pages
Unit 2 DL
No ratings yet
Unit 2 DL
43 pages
Learning 3D Representations From 2D Pre-Trained Models Via Image-to-Point Masked Autoencoders
No ratings yet
Learning 3D Representations From 2D Pre-Trained Models Via Image-to-Point Masked Autoencoders
15 pages
XCS224N Problem Set 5 Self-Attention, Transformers, Pre-Training
No ratings yet
XCS224N Problem Set 5 Self-Attention, Transformers, Pre-Training
18 pages
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
No ratings yet
Sinan Ozdemir Quick Start Guide To Large Language Models Strategies
285 pages
CosyVoice v1
No ratings yet
CosyVoice v1
10 pages
Algorithmic Language Models With Neurally Compiled Libraries
No ratings yet
Algorithmic Language Models With Neurally Compiled Libraries
12 pages
NLP Cookbook
No ratings yet
NLP Cookbook
27 pages
Simcpsr: Simple Contrastive Learning For Paper Submission Recommendation System
No ratings yet
Simcpsr: Simple Contrastive Learning For Paper Submission Recommendation System
13 pages
Dilated Neighborhood Attention Transformer
No ratings yet
Dilated Neighborhood Attention Transformer
17 pages
Classifying Referring - Non-Referring ADR in biomedi-WT - Summaries
No ratings yet
Classifying Referring - Non-Referring ADR in biomedi-WT - Summaries
4 pages
Data Science and Applications (Satyasai Jagannath Nanda, Rajendra Prasad Yadav Etc.) (Z-Library)
No ratings yet
Data Science and Applications (Satyasai Jagannath Nanda, Rajendra Prasad Yadav Etc.) (Z-Library)
546 pages
Revised Chapter I - III - SignMo
No ratings yet
Revised Chapter I - III - SignMo
70 pages
Retrieval-Augmented Embodied Agents: Yichen Zhu, Zhicai Ou, Xiaofeng Mou, Jian Tang Midea Group, AI Lab
No ratings yet
Retrieval-Augmented Embodied Agents: Yichen Zhu, Zhicai Ou, Xiaofeng Mou, Jian Tang Midea Group, AI Lab
11 pages
Multimodal Fake News Detection
No ratings yet
Multimodal Fake News Detection
16 pages
The Transformer Model
No ratings yet
The Transformer Model
1 page
WeNet-Production Oriented Streaming and Non-Streaming End-To-End Speech Recognition Toolkit
No ratings yet
WeNet-Production Oriented Streaming and Non-Streaming End-To-End Speech Recognition Toolkit
5 pages
Ul 2
No ratings yet
Ul 2
39 pages
Ensor Roduct Ttention Is All You Need
No ratings yet
Ensor Roduct Ttention Is All You Need
23 pages

Milmo:Minority Multilingual Pre-Trained Language Model

Uploaded by

Milmo:Minority Multilingual Pre-Trained Language Model

Uploaded by

MiLMo:Minority Multilingual Pre-trained Language

Parameter Value Language MiTC WCM

efficiency of word segmentation and simplify the vocabulary.

Model Mongolian Tibetan Uyghur Kazakh Korean

Figure 2. Text classification results on WCM

You might also like