A Pipeline For Large Raw Text Preprocessing and Model Training of Language Models at Scale
A Pipeline For Large Raw Text Preprocessing and Model Training of Language Models at Scale
Master Thesis
Author:
Jordi Armengol Estapé
Advisor:
Marta Ruiz Costa-Jussà
Computer Science (CS) Department - UPC
Co-Advisor:
Maite Melero Nogues
Text Mining Unit - Barcelona Supercomputing Center
January 2021
First of all, coinciding with the last phase of the development of this master thesis, I was diagnosed
with COVID-19. I thank my family for their support when I was recovering, especially my parents
and brother.
The resources used in this work have been partially funded by the State Secretariat for Digital-
ization and Artificial Intelligence (SEDIA) to carry out specialised technical support activities
in supercomputing within the framework of the Plan TL1 signed on 14 December 2018 and
the MT4All CEF project2 for developing resources and models for the unsupervised training of
machine translation systems for low-resource languages.
This master thesis would not have been possible without the direction of Marta Ruiz Costa-Jussà,
an expert in Natural Language Processing, especially machine translation, who has recently
been awarded an ERC starting grant. The supervision and insights of Maite Melero, established
researcher at the Text Mining Unit of the Barcelona Supercomputing Center (TeMU-BSC), have
been key to success as well. Finally, the advice of Marta Villegas, co-leader of TeMU-BSC, has
been priceless.
I thank and acknowledge the help of my colleagues at TeMU-BSC, especially Ona de Gibert
and Casimiro Pio Carrino, research engineers in charge of the data gathering process (thus, data
gathering itself is out of the scope of this thesis) for the Catalan, biomedical Spanish, and the
machine translation pairs in the MT4All project. They also helped with the testing and certain
extensions of the pipeline. Carlos Gerardo Rodrı́guez Penagos, a researcher at TeMU-BSC,
provided resources for evaluating the Catalan RoBERTa. In this thesis, I focus on my individual
contributions.
Regarding the data, the raw crawlings used for developing the corpora were conducted by the
Operations departments at the Barcelona Supercomputing Center and the Spanish National
Library. Albert Farrés helped with the profiling of the cleaning pipeline parallelization. Quim
Moré developed the script to store the Spanish National Library in JSON files.
The industry partners of the MT4All project provided the seed URLs for running the crawlers
for their respective use cases. UPV-EHU developed the unsupervised machine translation system
that was used with the corpora resulting from these crawlings.
1
https://www.plantl.gob.es/
2
https://ec.europa.eu/inea/en/connecting-europe-facility/cef-telecom/2019-eu-ia-0031
List of Figures
1 Introduction 1
1.1 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How to read this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Transformer language models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Decoder-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Encoder-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Unsupervised machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Off-line mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Joint learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Tokenization, subwords, and vocabulary building . . . . . . . . . . . . . . . . . . 19
2.5 Corpora generation and processing methodologies . . . . . . . . . . . . . . . . . . 22
3 Related work 23
3.1 Domain-specific models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Language-specific models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Unlabelled corpora generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Summary and conclusions on the state of the art . . . . . . . . . . . . . . . . . . 33
4 Settings 35
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 General-domain Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Biomedical Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 MT4All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 Methods 39
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Data gathering and storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Cleaning and formatting pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Design and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 Data parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 Encoding fixer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.4 Prefilterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.5 Sentence splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.6 Sentence filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.7 Normalizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.8 Document filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.9 Output formatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.10 Implementation and performance . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.11 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Metadata and aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Model-ready pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5.1 Document-level statistics and sanity checks . . . . . . . . . . . . . . . . . 56
5.5.2 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.3 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.4 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.5 Dictionary building and binarization . . . . . . . . . . . . . . . . . . . . . 58
5.5.6 Final sanity checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6.1 Considerations on parallelization . . . . . . . . . . . . . . . . . . . . . . . 59
5.6.2 Considerations on the training environment . . . . . . . . . . . . . . . . . 59
5.6.3 Effective batch size and gradient accumulation . . . . . . . . . . . . . . . 60
5.6.4 Deep learning backend and distributed training . . . . . . . . . . . . . . . 61
5.6.5 Launcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6.7 Hyperparameters and training objective . . . . . . . . . . . . . . . . . . . 63
5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7 Discussion 84
7.1 Results analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Impact statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
References 90
Monolingual corpora have become central to the training of Natural Language Processing (NLP)
models. Long before that, in this field it was thought that supervision was required for obtain-
ing useful enough representations, even in the pre-training stage [1] [2]. More recently, with
the advent of Word2vec [3] and other algorithms for learning word embeddings, unsupervised
pre-training started to become competitive and even state-of-the art for a number of NLP tasks.
ELMo [4] introduced the concept of deep contextualized word embeddings, by taking the hidden
states of a recurrent architecture together with the word embeddings themselves as represen-
tations. It was the first language model explicitly pre-trained with the intend of transferring
knowledge to a number of downstream tasks.
The recently introduced Transformer architecture [5] has revolutionized the NLP scene. Initially
intended for machine translation, researchers have shown that it also serves as a powerful deep
learning backend for pre-training large language models in the seminal works of the encoder-based
system BERT [6] and the decoder-based architecture GPT [7] [8].
Researchers have proposed many improvements to BERT and GPT, both in terms of performance
and efficiency. On the one hand, alternative pre-training algorithms such as the one in ELECTRA
[9] enable a more efficient usage of data. On the other hand, neural-level modifications to the
Transformer such as the one in ALBERT [10] provide different trade-offs for computation time
and memory requirements.
In addition, there is a vibrant ecosystem of open pre-trained models and libraries to leverage
them. Today, transferring knowledge from most of these models (except the closed ones or
computationally infeasible for most institutions, such as GPT-3 [11]) is easier than ever, thanks to
the flourishing of a powerful and diverse ecosystem of libraries. We especially remark the usability
of Huggingface’s Transformers [12], a library with many implementations of Transformer-based
models, as well as an ecosystem of pre-trained weights for these models to which everyone is
welcome to contribute. These libraries can even be used for training models from scratch, but
it is not the typical use-case, and they do not solve the problem of obtaining enough data for
different languages and domains.
In parallel, some works have shown that monolingual corpora can effectively be leveraged for ma-
chine translation systems as well, both in semi-supervised or even unsupervised settings [13] [14]
[15]. Besides, and more related to the more general NLP models mentioned above, monolingual
corpora can be leveraged by cross-lingual models as well, such as XLM [16]. By cross-lingual
models we mean models that explicitly learn from data in different languages (instead of just
concatenating data in all the languages, as in the case of the multilingual BERT), which can
then transfer knowledge to machine translation. Thus, the development of monolingual corpora
for a wide range of languages and domains is of the uttermost importance, together with the
training of the models themselves. Once the resources and infrastructure for training models
from monolingual corpora are deployed, we can transfer the knowledge to numerous downstream
tasks.
However, there are some aspects that, to the best of our knowledge, and from our point of view,
have been partially neglected, at least in comparison with the vast amount of efforts devoted
to the architectural improvements, English-centric resources, and tools for leveraging existing
pre-trained models. First, few works state explicit details on data collection and preprocessing
1
steps (unlike the neural architectures themselves, which are extensively documented and usually
open-sourced), in spite of the fact that data are arguably the most important part of these
models (without data, there would be no models). Obtaining large, model-ready English corpora,
especially if from the general domain, is straightforward. Doing so for low-resource languages
or domains, not so much. It can be challenging even for languages with hundreds of millions of
speakers. In addition, some methodologies, such as the criteria for decontaminating training data
from evaluation sentences, good practices for keeping track of the state of the corpora, or the
importance of maintaining document-level corpora, are not generally well-established3 Second,
practical details on training models from scratch are usually less understood and shared than
the ones related to the use of these models for transferring knowledge to downstream tasks. One
of the reasons this may be the case is that using pre-trained models for downstream tasks is
a common use-case for many companies, institutions, and individuals. Training from scratch
is not a possibility for most users. Nevertheless, we observe a middle ground between massive
language models only feasible for training by big companies and institutions4 and every single
researcher training his or her own model. At least in Europe, there is a network of publicly owned
(or funded) High-Performance Computing centers that could be able to train (and, in fact, has
already done with some degree of success5 ) relatively big models. Otherwise, apart from facing
a reproducibility crisis6 , we risk to neglect many languages and domains big companies and
institutions may no be interested in enough.
In this work, we propose a complete cleaning and processing pipeline for building new corpora for
training this kind of models, which is especially relevant in the case of non-English models and
low-resource domains. The cleaning pipeline is build from scratch with a high degree of linguistic
sensitivity and the purpose of being generic, extensible, and big-data and HPC-ready. Once the
cleaning process has ended and the data are cleaned and formatted, the corpora is organized,
decontaminated from sentences in the target evaluation benchmarks, tokenized and binarized.
Regarding the training, we leverage existing libraries to the specifics of the HPC cluster and the
data of use. Ideally, we would like to have an end-to-end pipeline, from data collection to model
evaluation, although the scope of the project is limited as a master thesis, and we leave some
steps as future work. Figure 1 shows the high-level overview of the desired architecture. Note
that in this work we especially focus on the data processing, and show the corresponding recipes
for effective model training, but having a general understanding of the whole process is key to
success. The proposed architecture is heterogeneous in the sense that different components run
on different kinds of machines.
Far from being a theoretical exercise or a set of experiments with toy data, we apply our pro-
cessing pipeline to real-word use cases, and show how the generated corpora can be used to train
models from scratch at scale. Specifically, both the Spanish PlanTL and the MT4All CEF (for
reference, both mentioned in Acknowledgements) directly benefit from this work.
One may wonder why simply using existing pre-trained systems is not always a solution. First
of all, these systems have been trained with the data that were available at the time of their
development, and by available we don’t always mean literally existing, but being either model-
ready or easy to preprocess. For instance, multilingual BERT does include Catalan, but only the
3
For instance, see Section 4 in the recent article on the GPT-3 model [11].
4
And the rest of NLP researchers only being able to fine-tune these models.
5
See the case of the Finnish BERT [17], which was trained in a European HPC cluster. We will review it in
Section 3.
6
https://www.wired.com/story/artificial-intelligence-confronts-reproducibility-crisis/
2
Wikipedia. Instead, in this work we collect and preprocess a Catalan corpus orders of magnitude
larger. Apart from that, building a language or domain-specific model with an emphasis on
data collection and preprocessing has already been shown to outperform multilingual and general
domain models. We refer to the paradigmatic cases of the Finnish BERT [17] and BioBERT [18],
described in Section 3.
Figure 1: Desired end-to-end architecture overview: High-level overview of the proposed het-
erogeneous (i.e., running on different kinds of clusters) architecture. In this work, we especially
focus on the preprocessing components, which we build from scratch, and show the ersults can
be leveraged by the training component, which we adapt to the specifics of our data and HPC
cluster. Source: Own elaboration.
3
which motivates the development of this system. In Section 3, we describe the related work,
and how our method fits in as one of the natural next steps. Section 4.1 describes the data on
which we apply our system, and the environment in which we run it. Even if our method aims
to be as generic as possible, these data serves both as the initial motivating example for building
these tools, and validate our architecture. In Section 5, we extensively describe our proposal,
with fine-grained details. In Section 6 we present our results, which are discussed in Section 7.
Finally, we arrive to conclusions in Section 8.
4
2 Background
For understanding the motivation of our proposal, we must first contextualize the NLP scenario
in which we believe our approach is a sensible next step. In this section, we describe the state-
of-the-art architecture in NLP and the implications of the scalability of models based on it, as
well as other concepts that are relevant to the development of rest of the thesis.
2.1 Transformer
We can argue that the closest ancestors of the Transformer7 architecture were recurrent Seq2seq
[19] architectures with encoder-decoder attention. This architecture, as many other NLP models,
leveraged embedding layers (depicted in Figure 2), one for the encoder and another one for the
decoder. Figure 3 depicts the vanilla recurrent Seq2seq architecture.
The problem of the vanilla recurrent Seq2seq was that the model had to compress the source
sequence (of variable size) into a fixed size vector, and this caused an information bottleneck that
prevented the model from recalling the whole original sentence. This motivated the introduction
of an attention mechanism, with the attention mechanism known as Bahdanau attention (named
after his author) [22].
By attention we mean the mechanism by which a model learns a set of attention weights. These
weights are akin to the other parameters of the neural network, but their value depends, dynam-
ically, on the specific input (thanks to being computed by other weights, which are fixed for all
inputs). Each attention weight can be thought as the relative importance of each component of
the input and is multiplied by the said respective component, outputting a context vector, c:
m
X
ci = aij sj
j=1
ai = softmax(fatt (hi , sj ))
where s1 , s2 , ..., sm are the hidden states with respect to which the attention is paid (i.e., the
preceding ones), and h1 , h2 , ..., hn are the hidden states the value of which will be determined
depending on the context vector (i.e., the new ones). fatt is the attention function, and it de-
pends on the specific attention implementation. The original attention used in Seq2seq systems,
Bahdanau attention, is also known as additive attention and is computed as follows:
But there are other implementations, such as multiplicative variants [23]. For additional infor-
mation on attention, we refer to [24]. The transformer uses scaled dot product attention, as we
will see. This was regarding the implementation. Nevertheless, attention can also be classified
in terms of to what state attention is paid to. The original attention was encoder-decoder atten-
tion, but apart from that, the Transformer also uses self-attention (i.e., attention with respect to
7
Apart from the original article, we recommend the following resource for getting a better understanding of
this model: http://jalammar.github.io/illustrated-transformer/
5
Figure 2: Embedding layer: Embedding layer used in virtually all neural NLP architectures,
including both the original Seq2seq and the Transformer. For each token, the embedding layer,
which works as a lookup table, retrieves the corresponding dense vector, a better representation
than just one-hot vectors. This representation is more compact and the distance correlates with
semantic distance. This layer is differentiable and, thus, the word embeddings can be learned
end-to-end. Source: Own elaboration, first appeared in [20].
the input itself), both in the encoder (encoder self-attention) and in the decoder (decoder self-
attention), unlike RNN-based architectures. Self-attention can only be multiplicative. Instead of
having the attention mechanism as a complementary component, the Transformer is built using
scaled-dot product attention as the fundamental building block.
The Transformer, like the recurrent Seq2seq architecture, is also composed of an encoder and a
decoder. In the encoder, the tokens are not input sequentially as in RNN architectures. Instead,
they can be all input at once (if running on a GPU), since there is no specific order the inputs
must be passed through the model in each layer. First of all, the tokens are passed through a
conventional embedding table (as in Figure 2), which serves as a differentiable lookup table that
retrieves the word vector corresponding to each word (no different with respect to plain recurrent
Seq2seq). In the embedding layer there is one subtle, yet vital, difference that we will see later on.
Then, the encoder is composed of a stack of identical encoder blocks. Each of them is composed
of scaled dot-product self-attention layer followed by a linear layer. In the former, self-attention
vectors are computed pair-wise (each token attends with respect to the rest, including itself),
which, for a given layer, can be done in parallel since there are no dependencies. Crucially, there
6
Figure 3: Original Seq2seq architecture: The word vector of each token in source sequence is
input one by one into a recurrent encoder. Then, the recurrent decoder autoregressively generates
the target tokens (with teacher forcing, i.e., using the real target tokens instead of the predicted
ones, during training). Source: Own elaboration adapted from [21], first appeared in [20].
are residual connections, as in [25]. These connections skipping one layer (going directly to the
next one) allow to retain information from the previous layers. Both attention and linear layers
are normalized with layer normalization8 [27]. Without these details, the Transformer would not
train properly.
Each encoder block takes as input the outputs of the previous block (except for the first one,
which takes the vectors from the embedding layer). The embeddings generated from these
attentional layers are known as contextual embeddings, in the sense that there is one vector for
each token, but the value depends on the other ones.
The decoder works similarly. It performs decoder self-attention with the target tokens that have
already been predicted (note that masking is required for, in training, preventing the decoder
from seeing future target tokens that have not been yet predicted, which would result in a model
that would not learn properly). In inference this cannot be done in parallel, and is instead run
autoregressively as in Seq2seq (i.e., for each time step, the decoder takes its own previous outputs
as input). The decoder is, thus, also composed of a stack of identical blocks, but each of them
has three layers (instead of two, as the encoder). Apart from the self-attention and linear layers,
it has an additional layer for performing encoder-decoder attention, to attend to the encoder
outputs (the ones of the last layer) as in recurrent Seq2seq with Bahdanau attention. After the
last decoder block, the Transformer decoder has a projection layer for actually predicting the
specific tokens that should be output.
We said that the embedding layer has one subtle but relevant difference with the vanilla one.
The Transformer itself can be seen as a fully-connected graph neural network9 for working with
sets, since unlike RNNs, it does not have an inherent notion of sequence order. For preventing
the model from degenerating into a bag of words-like model, positional embeddings are summed
to the input embeddings. In the original article, they tried both learned positional embeddings,
and sinusoidal positional embeddings, and the latter performed better:
8
Notice the difference with batch normalization [26]. In layer normalization, the normalization is independent
from other samples in the batch.
9
https://thegradient.pub/transformers-are-graph-neural-networks/
7
pos
P E[pos, 2i] = sin( 2i )
10000 dmodel
Newer variants of the Transformer do use learned positional embeddings, but with other imple-
mentations10 . Figure 4 shows an overview of the Transformer architecture.
There is one low-level (but important) detail we have not gone through yet, the attention im-
plementation. As we said, it is called scaled dot product attention. It is a key-value attention
variant, in the sense that query vector and key vector pairs are used to compute a similarity
measure (specifically, the dot product). Each hidden vector hi is split into a key ki and a value
vi . In the Transformer, this is implemented as:
QK T
sof tmax( √ )V = Z
dk
The query, key and value vectors of a given token are projected with a linear layer that decreases
the dimensionality of the embedding of the token. A typical dimension would be 64, when the
embedding itself is of 512 elements (in the original Transformer; there are scaled up versions of
it). First of all, we compute the attention score by taking the dot product between the query
vector and the key vector of the word we are scoring. The query vector is the one of token we
are interested in, and the key is the one we are measuring the relative importance with respect
to the query. So, when computing the new contextual embedding for a given token, we will take
the dot product between its query vector with respect to each of the key vectors we are attending
to. The score is then divided by the square root of the number of dimensions of the query, vector
and value vectors (we said 64; so 8). This division is why it is called scaled dot product. We
apply softmax to each of the scaled scores for a given token and a set of tokens it is attending
to (so that they sum 1), and they are then summed. This sum is multiplied by the value vector
(which comes from the token we are interested in). These operations must be done for all token
pairs, and it is easy to see that they can be efficiently expressed as matrix operations in the
formula above. Notice the versatility of this mechanism. In self-attention layers, queries, keys,
and values come from the input itself. In the encoder-decoder attention layers, the exact same
mechanism is used, but this time the keys and values come from the output of the last encoder
layer, while queries come from the decoder itself. Note that this is only one attention head.
The Transformer uses multiple heads, all of them running in parallel. The motivation for this
multi-head attention is that each head can then specialize on different aspects (e.g., one head
could learn to focus on syntactic features). Figure 5 depicts the scaled dot product multi-head
attention. Figure 6 shows an attention map.
The original Transformer improved the BLEU score on the WMT 2014 English-to-German trans-
lation task by 2 points and has ever since been the standard architecture in neural machine
translation. But with the advent of BERT and GPT-like models, its influence has affected the
whole NLP field. Thanks to being purely attentional, it outperforms RNNs for modeling long-
range dependencies. This comes, though, with the cost of having a quadratic cost ((O(n2 ))) with
respect to the number of tokens. This has motivated the development of approximations of the
attention mechanism of the Transformer (e.g., O(n log n)) [29].
10
See relative positional embeddings [28].
8
Figure 4: Transformer architecture: Both the input tokens and the (right-shifted) target tokens
are input to their respective embedding layers. In both cases, positional embeddings are summed
to encode the position of each token, since the Transformer does not have an inherent notion
of sequentiality. The encoder is composed of a stack of identical blocks, each of them having a
multi-head attention layer and a linear layer. Both of them exhibit residual connections and layer
normalization, for easing the optimization. The decoder is built likewise, but with an additional
layer to compute the encoder-decoder attention. The decoder’s self-attention, unlike the encoder
self-attention, has to be masked to prevent the model from seeing target tokens before they
have been actually predicted. After the last decoder block, there is a linear projection to the
dimension of the target vocabulary size, followed by a softmax (since the output must be a
probability distribution over the vocabulary). Source: Own elaboration based on figure 1 in [30],
first appearing in [20].
9
Figure 5: Multi-head attention: Scaled dot-product attention is derived from V (values), K
(keys) and Q (queries). Each of these vectors is computed from a down-sampling projection
coming from the required token embedding. The query represents the token we are computing.
The key represents the token with respect to which we are computing the relative importance
(attention score). The scores, once scaled (and passed through a softmax), are then multiplied
by the value vector. In self-attention layers, queries, keys, and values come from the input itself.
In the encoder-decoder attention layers, the exact same mechanism is used, but this time the
keys and values come from the output of the last encoder layer, while queries come from the
decoder itself. These operations can be efficiently expressed as matrix operations. In the depicted
example, there are four heads. This means that the same computation is performed four times,
each of them independently (with different weights). Then, the results are concatenated and
passed through another linear layer. The motivation for having multiple heads is letting the
Transformer attend to different aspects at the same time; each head can specialize in detecting
different features, which is reminiscent of the convolutional filters. Source: Own elaboration
adapted from figure 2 in [30], originally appearing in [20].
10
Figure 6: Attention map in a sequence-to-sequence tasks: In machine translation tasks, the
encoder-decoder attention mechanism acts a soft aligner between source and target sequences.
In this attention map, we observe that the English word (of the original sentence) that has
the most importance when re-writing in Spanish the word ”económico” (economic) is, indeed,
”economic”. In self-attentive settings, though, attention maps will reveal relations between the
words of the sentence itself (e.g., adjectives contextualizing a noun, or aco-reference and the
corresponding entity). Source: Own elaboration based on figure 1 in [30], first appearing in [20].
• Scalability: Transformer language models have been shown to scale with data and number
of parameters to unprecedented sizes in machine learning [31]. Unsupervised pre-training
is more effective the more data are used and the bigger the model, so it benefits from
architectures that scale.
• Pre-training tasks friendliness and versatility: In deep learning, researchers use different
surrogate tasks for pre-training models without supervision. This is also known as self-
supervised learning 11 . Even if many of the pre-training objectives that are used with
Transformers could be used with other architectures, these attentional models happen
to fit well in those (e.g., masked language modeling, the main pre-training objective in
encoder-based models, could also be used with a bidirectional LSTM [32], but they would
hardly benefit from it as as an effective pre-training strategy with enough scale, which is
difficult to obtain in the case of RNNs). For discriminative tasks, in token classification, one
can directly take the contextual embeddings of the token. In sentence classification, it is
trivial to inject a special token representing the sentence. Natural Language Understanding
(NLU) and Natural Language Generation (NLG) can easily be formulated as sequence-to-
sequence tasks, if required.
There were two seminal works as far as transformer language models are concerned, which have
been the base of many other variants. They were proposed almost in parallel.
11
See this talk by the deep learning pioneer Yann LeCun, who coined the term: https://www.youtube.com/
watch?v=8TTK-Dd0H9U.
11
2.2.1 Decoder-based models
Decoder-based Transformer language models, or GPT-like models, basically take the Transformer
decoder and throw away the encoder. The last layer of the decoder block, the one for performing
encoder-decoder attention, is therefore removed, since the decoder no longer is conditioned on
the outputs of any encoder. This model is trained with the conventional language modeling task,
that is, autoregressively predicting the next token from the previous ones. The original work
that crafted this approach successfully applied it on Wikipedia [33].
Nevertheless, the series of works that made this approach famous were the ones of Generative
Pretraining Transformer (GPT), starting with the original GPT [7], which essentially scaled
up Transformer decoder language models, except for a few modifications, such as replacing the
Rectified Linear Units (ReLU) by Gaussian Error Linear Units [34]. They showed that these
pre-training could transfer knowledge to a number of language understanding benchmarks. The
representations from the last layer can be used for different tasks, once fine-tuned. GPT-2 [8],
with as many as a 1.5B parameters, and trained on even more data, was essentially a scaled up
version of GPT. One of the advantages of the GPT approach, apart from the extreme simplicity
of the training objective (pure language modeling), is that these models are a natural fit for
generative tasks (e.g., open question answering), and this is further exploited in GPT-2. Instead
of using the model as a feature extractor, in the article of GPT-2 there is more emphasis on
directly using its generated text. The most recent version GPT-3 [11], which, again is basically a
scaled up version of its ancestors12 , has as many as 175B parameters (more than 10x with respect
to any existing non-sparse language model at the time), and instead of fine-tuning, the authors
claim that the model has outstanding few-shot (or even zero-shot) capabilities. No weights are
modified for downstream tasks; instead, a few examples are shown as context in the inference
itself.
12
pre-training objective. Each example is composed of two sentences. Since the dataset is a
document-level corpus, these two sentences can be sampled such that 50% of the time the second
one is the sentence actually following the first one, and 50% of the time it is a random sentence.
This is predicted from the final representation of the special token [CLS], which is supposed to
learn to extract sentence-level information (i.e., a sentence embedding). The separation between
the two sentences is indicated with the special token [SEP], which can also be used in downstream
tasks implying two sentences, such as text entailment14 . Segment embeddings are also summed
to help BERT identifying which token belongs to which sentence (sentence A vs. sentence B).
Figure 7 depicts a simplified schema of BERT’s pre-training strategy.
Figure 7: BERT pre-training objective: Two sentences are input to the model, a Transformer
encoder, separated with the special token [CLS]. A random subset (with proportion a of 0.15)
of the tokens is masked. BERT is trained to predict the masked tokens (the other ones are
not needed to be predicted). Apart from that, it has to predict whether the second sentence
actually goes after the first one, or it is a random sentence, with a logistic regression from the
last representation of the [CLS] token, which is injected together with the other ones. In this
case, BERT must predict 1 since the second sentence is the one that actually goes after the first
one, not a random one. In the example sentences, words are not split into subwords for the sake
of clarity. Source: Own elaboration.
After having been pre-trained on large corpora, BERT can then be used in different downstream
tasks. As usual in deep learning, transfer learning can be applied by using the model as a
feature extractor (letting the weights frozen), or fine-tuning (updating the weights), usually
attaching a linear layer as the classifier or regressor. For token-level tasks (e.g., Named Entity
Recognition and Classification, NERC), the representations of the last encoder block are used.
Instead, sentence-level tasks, the representations of the [CLS] are used. The main BERT can
also be used for generation, but in a cumbersome way (by recursively demasking), and it is not
the intended use-case. The representations of BERT are said to be bidirectional since the model
another one, is useless on itself, but results in efficient word embedding learning.
14
I.e., detecting whether one sentence is logically coherent with the other, whether there is a contradiction, or,
alternatively, whether the two sentences are completely unrelated.
13
uses the encoder (instead of the decoder, autoregressive), so each contextual token representation
depends on both the tokens from its left and right, instead of only the ones from the right. In
addition, even in inference all contextual embeddings can be computed in parallel (for each
encoder block). In the other side of the coin of this trade-off, as we said, BERT is not especially
suitable for language generation tasks.
BERT improved the state-of-the-art in numerous NLU benchmarks. Not only that, but the
authors claimed BERT to be flexible enough to be transferable to the downstream task of choice
of the user, at least to some extent. The authors also released different versions of BERT, which
set a sort of standard for the models based on this model:
• Cased vs uncased: The authors released both cased and uncased models.
• English vs multilingual: For the multilingual BERT, the authors just concatenated data
from different languages. In this case, there is one additional process of transfer learning,
from languages with more data to low-resource languages, even to languages that were not
seen in training.
• Base vs Large: The BERT architecture can, of course, be instantiated in arbitrary sizes, as
long as the shapes fit (e.g., the embedding dimension, which in the Transformer is set to be
the same for all blocks, must be a multiple of the number of heads, since for merging back
the outputs of each head they must be concatenated and projected back to the embedding
size). However, the archetypal sizes established in BERT are typically used as a reference.
The base model has 12 layers, 12 attention heads, and an embedding size of 768 (resulting
in 110 million parameters). The large model has 24 layers, 12 attention heads, and an
embedding size of 1024 (resulting in 340M parameters).
Since BERT, many works based on it have proposed different modifications. One of the most
widely used ones15 , RoBERTa (Robustly Optimized BERT Pretraining Approach) [36] is essen-
tially BERT but without the next sentence prediction task and trained for longer and with more
data. The next sentence prediction did not appear to help in their ablation studies, and thus was
removed from the training procedure. Training for longer and with more data, instead, resulted
in a more robust model that outperformed the original BERT in different NLU benchmarks.
RoBERTa also removed the segment embeddings, since not training with the next sentence pre-
diction objective made them redundant. In addition, RoBERTa used dynamic masking (instead
of BERT’s static masking), meaning that each sentence could be masked in different ways each
time it was passed through the model during training. In RoBERTa, the used batch size was
even larger than in the case of BERT, which is thought to be beneficial. Note that other lan-
guage models might incorporate their own additional embeddings (e.g, XLM [37] sums language
embeddings).
There are many other works based on BERT16 . XLM [37] is an extension of BERT that makes
it explicitly multilingual17 such that cross-lingual representations are learned. It can be used for
15
In Section 3, we will see that many of the works that consisted of the development of a BERT-like model for
a certain language or domain ending up choosing RoBERTa.
16
https://medium.com/@phylypo/a-survey-of-the-state-of-the-art-language-models-up-to-early-
2020-aba824302c6
17
BERT had a multilingual version from the beginning, but all text is input to the model as in the monolingual
version.
14
initializing an encoder of a machine translation system. ALBERT [38] is a lite version of BERT
that severely decreases memory usage by cross-layer parameter sharing (in a pseudo-recurrent
way, as in [39]) and a clever embedding matrix factorization. There is even BART [40], a model
which combines both BERT and GPT. In fact, it is exactly like the original Transformer, since
it has both an encoder and a decoder. It is trained to restore the original sentence, which is
corrupted (by different means, such as token permutation, document rotation...). The authors
refer to this pre-training objective as language denoising. Apart from the typical downstream
tasks such as token classification (to which the knowledge can be transferred using the last
contextual embedding of the last token in the decoder), since it is a sequence-to-sequence model,
it can be fine-tuned to perform summarization, or as a decoder in machine translation systems.
A multilingual version of this model, mBART [41], shows transfer learning capabilities from the
high-resource language to the low-resource ones, and serves as a powerful initialization for a
machine translation system, if fine-tuned.
Another line of research related to BERT-like models consists of compressing the models, which
are usually huge in terms of parameters and required computation. One possibility is distillation
[42], in which a smaller version of the original model is trained with the whole probability
distribution that outputs the original models (instead of only the actual targets). The other
possibility is applying either model quantization18 or pruning (for instance, removing some of
the attention heads, as in [43]). Note, however, that in these cases the training algorithm is
roughly equivalent to the original one. What changes is that, a posteriori, a compressed version
of the model is generated. Instead, ELECTRA [9] does so by modifying the training algorithm
itself, instead of compressing the model once it has been trained. In ELECTRA, an auxiliary
small BERT model is learned as usual. Then, the actual model is trained to predict whether a
given token comes from the original input or has been predicted from by the auxiliary model.
This pre-training objective turns out to be considerably more data and compute efficient the its
alternatives, since the model receives signal from all tokens (instead of only the masked ones),
and the projection layer just has to model a logistic regression. ELECTRA is reminiscent of,
and could even be considered to be to some extent, contrastive learning19 . Table 1 shows a
comparison of some of the aspects of several Transformer language models.
18
https://pytorch.org/docs/stable/quantization.html
19
To simplify, in contrastive learning, the model is trained to discriminate between instances being positive or
negative of certain property. See [44].
15
Model Architecture base Pre-training objective Convenient for...
BERT Transformer encoder MLM + NSP Discriminative tasks
RoBERTa Transformer encoder MLM Discriminative tasks
GPT Transformer decoder LM Generative tasks
BART Full Transformer LD Seq-to-seq tasks
ALBERT Universal Transformer encoder20 MLM + SOP Discriminative tasks
ELECTRA Transformer encoder MLD Discriminative tasks
Table 1: Comparison between some of the Transformer language models: MLM means Masked
Language Modeling; NSP stands for Next Sentence Prediction; LM means Language Modeling;
LD stands for Language Denoising, which in turn consists of different corruptions (e.g., token
permutation, document rotation,...); SOP means Sentence Order Prediction; MLD means Masked
Language Discrimination. Notice that most models are based on the Transformer encoder, while
the ecosystem of decoder or full Transformer models is considerably less populated. Note that
all models benefit from document-level data. First, because most pre-training objectives require
a notion of order in sentences. Second, because even the ones that only use masked language
modeling, benefit from being able to learn on longer context windows. Regarding the last column,
by convenient we mean that the model is especially adequate for the corresponding task, but it
does not necessarily mean that it cannot perform the other ones. For instance, since BART is
a full Transformer, and it is trained to restore the original sequence from a corrupted version
of it, it transfers well to sequence-to-sequence tasks such as summarization, although it is still
considerably competitive (even if not state-of-the-art) in token classification tasks.
2.2.3 Evaluation
Language modeling has an intrinsic metric that can be computed without the need of any anno-
tations, namely, perplexity:
t
( )
1X
PPL(X) = exp − log pθ (xi |x<i )
t
i
It measures the ability of the language model to predict words given the previous ones, which
is a direct measure of the quality of the language model. The lower the perplexity, the better.
Perplexity can be used for monitoring the training of the model, early stopping, or model selec-
tion, and even for the evaluation itself. Nevertheless, we must take into account some important
considerations. Perplexities of language models with different vocabularies (arising from differ-
ent tokenization or vocabulary seen in training) are not directly comparable. More precisely,
perplexities of different systems are not comparable if the denominator depends on the segmen-
tation. That is, perplexities per predicted token may not be comparable, but perplexities per
character (or per word, if an exogenous criterion for determining words is imposed) is [46]. If
the tokenization is imposed by the benchmark, perplexity can be used as a proper evaluation
metric, such as in the Wikitext-103 benchmark [47]. Second, for computing the perplexity of
fixed-size models correctly, one must apply a sliding-window strategy21 . For instance, GPT-2
18
Cross-layer parameter sharing as a sort of pseudo-recurrence, as in [45].
21
https://huggingface.co/transformers/perplexity.html
16
has a maximum context window of 1024. Then, we cannot directly compute pθ (xi |x<i ) when t
is greater than 1024. Instead, we must break the sequence into sub-sequences with length of the
size of the model maximum context window.
Ideally, for a complete evaluation of a model, it should be passed through a series of bench-
marks for extrinsic evaluation. There are several supervised benchmarks that assess the Natural
Language Understanding capabilities, many of them grouped in the General Language Under-
standing Evaluation (GLUE [48]) 22 . GLUE is composed of different tasks, including Semantic
Text Similarity [49], Text Entailment23 , and others. Since state-of-the-art models score incred-
ibly higher in this benchmark, a new, more challenging version of it, SuperGLUE [50], was
recently introduced. The original GLUE dates back from 2018, so notice how fast the bench-
mark has become somewhat obsolete, a clear indication of the pace of NLP progress. Generative
and sequence-to-sequence models are also evaluated in tasks such as summarization or machine
translation. Encoder models can also be evaluated in terms of their usefulness as initialization
for an encoder in a machine translation system. On the other hand, SQUAD [51] is a question-
answering dataset of more than 100,000 questions, and it is also typically used for evaluating
these models.
In the case of multilingual language models, there are benchmarks for evaluating cross-lingual
representations [52]. Models for languages other than English need, obviously, benchmarks in
the corresponding language which is expensive and difficult to obtain. Recently, a GLUE-like
benchmark for French was released [53]. In the case of the Finnish BERT [54], for instance, they
used classical Finnish datasets (such as a Part-Of-Speech dataset). Another potential source of
benchmarks is translating English evaluation datasets (with the help of machine translation).
Likewise, models targeting specific domains will require specialized benchmarks. For instance,
BioBERT [18] was evaluated on biomedical NERC tasks. NukeBERT24 [55], a BERT model for
the nuclear physics domain, was evaluated on a specific benchmark for nuclear physics question
answering, NQUAD.
The methodology for evaluation is of vital importance. First of all, we that note there are no
standard guidelines for training data decontamination from sentences used in the evaluation,
as discussed in Section 4 in [11]. If we are training a computer vision model on ImageNet, we
will use the train set, and then evaluate on the standard test set. But if only the benchmark is
standard, and in the case of the pre-training data, one is encouraged (and for a good reason)
to use as much data as possible, with millions and millions of documents coming from massive
crawlings and book dumps, it is not unimaginable to conceive that some evaluation sentences
might leak. In the case of GPT-3, the authors admit that there was a bug in the script supposed
to decontaminate the pre-training data, and this affected at least some of the benchmarks used
in the article.
Finally, regarding fairness in model evaluation, comparing models trained with different amounts
of data is perfectly fair as long as what is evaluated is the model itself, not the training algorithm
or the architecture. When comparing models in a given discriminative downstream task, the usual
approach is adding a linear layer (the classifier itself) and fine-tuning the model. Obviously, the
compared models must be evaluated using the same transfer learning strategy (e.g., not fine-
tuning the proposed model and then just using the baseline as feature extractor).
22
https://gluebenchmark.com/leaderboard
23
https://demo.allennlp.org/textual-entailment
24
Yes, that is a legitimate BERT.
17
2.3 Unsupervised machine translation
As we said, machine translation systems can also benefit from unsupervised signals (and, there-
fore, monolingual corpora). The mapping between the spaces of the source language can be
learned either off-line or jointly.
In off-line mapping learning, we have two different spaces that have already been learned and
are fixed, and we want to learn a function that maps from one space into the other.
Vecmap [56–59] is an algorithm for learning cross-lingual word embedding mappings. It can run
with different degrees of supervision, including a fully unsupervised mode. For unsupervisedly
learning the mapping, Vecmap starts from the assumption that the two embedding spaces are
roughly isomorphic, which is not unrealistic taking into account that both spaces refer to nat-
ural languages. However, this assumption will hold less strongly in the case of very dissimilar
languages or domains.
Let X and Z be the embeddings matrices in two different languages (in each row they have the
embedding vector of a given word), without any alignment between them. The goal of Vecmap is
learning WX and WZ , such that XWX and ZWZ are in the same embedding space. In addition,
Vecmap also learns a dictionary D, in which Dij = 1 iff the ith word in X is the translation of
the jth word in Z. Unsupervised Vecmap consists of the following steps:
2. Initialization: Firstly, we compute the similarity matrices (i.e. the Gram matrix, XX T )
of each embedding matrix. Because of the isomorphism assumption, one similarity matrix
should be equal to the other, if some (unknown) row and column permutations were applied.
Since trying all the combinations is not feasible, Vecmap sorts the rows of each similarity
matrix. If X and Z were strictly isomorphic, nearest neighbour matching would be enough
for building the mapping.
3. Robust self-learning: Starting from this initialization, Vecmap iteratively applies two steps,
until convergence:
(a) Compute the orthogonal mapping that maximizes the similarities for D.
(b) Update D using nearest neighbours matching.
Apart from the direct application of bilingual lexicon extraction, Vecmap can be used to initialize
unsupervised statistical machine translation systems, as in [14]. Then, the translations of the
statistical system can be used as a seed to train an unsupervised neural machine translation
system [60], and are further refined.
18
Remarkably, one of the direct applications of the corpora generated with some of the tools
proposed in this work, has been precisely leveraging Vecmap, within the MT4All project.
Despite the success of some unsupervised off-line mappings, it has been shown that jointly
learning the cross-lingual mappings could be, in some cases, more effective than the off-line
setting. The pre-trained embeddings may be optimal on their own, but for having a shared
representation they might cause local minima. By back-translation [13] we mean the generation
and use of synthetic parallel sentences. Say that we have a system for translating (not necessarily
well) from language A to language B. We then translate monolingual sentences of language A
to language B. The synthetic sentences can be used to train a translation system from B to
A. Back-translation can be applied starting from a supervisedly pre-trained system, to increase
the training data, but purely unsupervised approaches are also possible by using iterative back-
translation [15], alternatively generating synthetic data in both directions while the system is
iteratively improved.
In the case of machine translation, zero-shot translations (i.e. translation of unseen pairs in train-
ing) can be obtained by supervisedly training multilingual systems with shared representations
and language tags25 [61]. However, this approach still requires supervision for at least some pairs.
A recent work based on BART, mBART [41], applies the same model and training procedure of
BART but with multilingual (but not parallel) corpora, up-sampling the low-resource languages.
mBART can be used for unsupervised machine translation with iterative back-translation (so,
no supervision at all). For doing so, the system is constrained not to output subwords not be-
longing to the target language by masking the output probabilities of subwords with less than
1% occurrences. As we saw in Section 2.2.2, cross-lingual encoder language models such as XLM
can be used for initializing machine translation systems as well.
Regarding the evaluation, we know that in machine translation BLEU [62] is the most commonly
used metric. Some works in unsupervised MT tried to use an unsupervised BLEU (the one
obtained from translating into one language and then translating back to the original language,
and comparing with the original source) [15]. Nevertheless, this approach has the problem that
degenerate solutions (e.g. a system with the two directions, A → B and B → A, that always
translates to the same sentence) can obtain high BLEU scores. Thus, in practice, parallel sets
are used for evaluating unsupervised models.
Recently, though, some works have warned that assuming that no parallel corpora are available
but large monolingual corpora are may not be realistic in many real-life scenarios [63].
19
especially build for a given language or domain, such as Stanford’s tokenizer for Arabic 26 .
The emergence of Transformers has coincided with the partial 27 abandonment of these classical
tokenizers. The reason why this happened is that the newest, subword-based tokenizers have
been shown to generally improve the results, not because any especial limitation or need of
the Transformer architecture. In fact, the Transformer architecture can work with any discrete
sequence (being characters, words, subwords, or even pixels or audio fragments [66]).
Subword tokenization was invented just a bit before Transformers. Interestingly enough, just like
Transformers, the original motivation of subword tokenization was precisely machine translation.
Specifically, the authors of the seminal work in this regard intended (and with success) the
improvement of machine translation systems performance when encountering rare words [67].
The proposed method, Byte Pair Encoding (BPE), is based on a compression algorithm with
the same name [68]. The original algorithm compresses data by recursively replacing the most
frequent pair of bytes with a new symbol (an unused byte). In the case of machine translation,
the idea was to do so with characters instead of bytes. For instance, BPE could encode the word
’internationalization’ as ’inter## national## ization’, with ’#’ being the symbol to
denote subword separation, to the relative frequencies in the training set. This might (or might
not) correlate with morphological features, but note that it is purely based on character counts
and, therefore, language and domain-agnostic. Ever since, some variants of this algorithm have
been proposed. For instance, as in the BPE compression algorithm, researchers have proposed
to use bytes as the unit to encode, instead of characters [69].
Machine translation systems, like language models, suffer from the open vocabulary problem
(i.e., in inference an arbitrary number of new words unseen in training may appear), but at the
same time, a character-level tokenization may be less efficient and difficult to learn (due to having
longer sequences). As a middle ground, taking the bests of the two worlds, BPE will build a
given word from the subwords in its vocabulary. Otherwise, it can still build them character-by-
character, if no subwords correspond to the word, mostly solving the open vocabulary problem
(instead of assigning a special <UNKNOWN> token to unseen words, or words in the train set below
a certain frequency threshold).
BPE is learned from the training set. Once it has build the vocabulary from it, it freezes it and
applies the same tokenization to both validation and test sets. Recovering the initial text is as
easy as replacing the BPE special characters (followed by a space) by an empty string.
BPE, though, introduces another hyperparameter, the number of operations. The more merges,
the larger the vocabulary 28 . There are studies on the effect of the vocabulary size [70], and one
of their main takeaways is that especially in low-resource scenarios, it may be sensible to set a
low number of BPE operations.
Vanilla BPE and some of its variants, such as BPE dropout [71] (a regularization technique
consisting on making the process of merging tokens or characters stochastic), still assume a pre-
tokenization by a classical tokenizer. Instead, more recent alternatives such as sentencepiece [72]
assume no previous tokenization and rely completely on statistics from the train corpus. The
authors of sentencepiece show that their purely language-agnostic outperforms other alterna-
26
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/international/arabic/process/
ArabicTokenizer.html
27
Note that they are still extensively used, in many cases in conjunction with subword tokenizers.
28
Specifically, |V | = #ops + |Characters|
20
tives. Apart from a BPE-based algorithm, they experiment with a unigram-based alternative.
Wordpiece [73] is another frequency-based tokenization by Google, but unlike the other systems,
it is not open-source and less details are known about it. Table 2 shows a comparison of some
of the aspects of several existing tokenizers.
Despite their name, most BPE tokenizers do not actually operate in the byte space. They
typically operate in the character or word (if the text has been pre-tokenized) space. In [74],
they introduced a Byte-level BPE (BBPE), which, indeed operates in the byte space. This has
the effect of literally eliminating the out-of-vocabulary problem, since the vocabularies built with
this technique start from the possible 256 values of bytes, and then, as in BPE, recursively apply
symbol merging while keeping the symbols from the previous operations. Thus, this vocabulary
can represent any Unicode string, even if byte by byte in case of characters not seen in training.
In Sentecepiece and Wordpiece, for instance, characters unseen in training will still be replaced
with the [UNK] (unknown) token.
Since tokenization has to be applied to large corpora, at least in the case of language models, some
implementations focused on efficiency have been proposed. For instance, FastBPE29 is a C++
implementation of the original BPE tokenizer. The tokenizers library, released by Huggingface,
provides a fast30 implementation of several subword tokenizers.
At the end of the day, the choice of the tokenization strategy to use is another set of hyperpa-
rameters when building language models, but it is a very relevant aspect that can hardly31 be
changed after training. For instance, one can fine-tune a pre-trained model on a given dataset if
the original one did not have enough data from a given language or domain, but if the vocabulary
was built with text from a very different language or domain, the model may struggle even after
fine-tuning.
The data that have been used for building the vocabulary should be as representative as possible
of those that will be seen during inference. This will be relevant in the comparison of two domain-
specific BERT-based systems, BioBERTa and SciBERT, that we will see in Section 3.1. For the
same reason, in the multilingual model mBART [41] the documents of languages with less data
were over-sampled when building the vocabulary, solving a problem (the one of underrepresented
languages in the vocabularies of multilingual language models) that was already known since at
least the publication of Multilingual BERT.
Table 2 shows a comparison of some of the available tokenizers, while Table 3 gives more details
on the vocabulary building of some of the models we saw.
21
Model Pre-tok Tok Vocab. size Multi-lingual Lang. oversampling
GPT Spacy32 BPE 40,478 No -
GPT-2 Spacy Byte-level BPE 50,257 Yes No
RoBERTa33 Spacy Byte-level BPE 50,257 Yes No
M-BERT No Byte-level BPE 110,000 Yes No
mBART No SentencePiece 40,000 Yes Yes
30
https://spacy.io/
31
Re-used GPT-2 vocabulary.
34
By document-level we mean that each instance is a sequence of ordered sentences, such as a small paragraph
or even a whole book.
35
https://fasttext.cc/blog/2017/10/02/blog-post.html
36
https://spark.apache.org/
37
https://www.open-mpi.org/
22
3 Related work
After having contextualized the required background, we now go through the specific works that
are most related with our proposal.
• Fine-tuning an existing model: Notice that we mean re-training with unlabelled data with
the same pre-training objective of the model (but starting with the pre-trained weights),
not fine-tuning to the desired downstream task. BioBERT [18] is the canonical example of
this approach. It is based on BERT-Base (not only the architecture itself, but the weights
as well), and further pre-trained in 18B tokens from the biomedical domain, extracted
from PubMed38 abstracts and full articles from PubMed Central (PMC39 ) full articles.
BioBERT clearly outperformed BERT in different biomedical text mining benchmarks,
from NERC to relation extraction.
• Building a model from scratch: If enough data are available, one might start the pre-
training from scratch, as in the case of SciBERT [80]. SciBERT is also based on BERT-
base, but trained from the ground up. SciBERT was trained on scientific articles from
Semantic Scholar40 , with a total of 3.17B tokens, similar to the scale of the corpus of the
original BERT. SciBERT outperformed BioBERT in most benchmarks. The authors hy-
pothesize that one of the reasons why this be may the case is the vocabulary. Crucially,
since BioBERT starts from the pre-trained weights of BERT, it has to re-use the same vo-
cabulary. Instead, SciBERT builds the vocabulary from scratch with in-domain text, using
SentencePiece. The overlapping between the new vocabulary and the one of the original
BERT is of 42%, meaning that the frequencies in the use of subwords are considerably
different.
There are domain-specific BERTs, such as PatentBERT [81], which uses the same approach
as BioBERT but in the domain of patents. The general idea is that there may be a certain
point upon which pre-training BERT on monolingual data, which is expensive, is worth the cost.
Furthermore, there may be another threshold upon which it is worth pre-training from scratch
(rather than fine-tuning). Figure 8 shows a schematic view of the two approaches.
Another example of domain-specific BERTs is the one specifically fine-tuned on the domain of
nuclear physics [55], a challenging, low-resource of domain. They have to build a corpus by
38
https://pubmed.ncbi.nlm.nih.gov/
39
https://www.ncbi.nlm.nih.gov/pmc/
40
https://www.semanticscholar.org/
23
Figure 8: Further in-domain pre-training vs. training from scratch: The two main approaches
for building domain-specific language models are the following ones: (i) Further in-domain pre-
training (above): Take a pre-trained language model in the general model, re-using its vocabulary,
and continue pre-training it in the in-domain (but unlabelled) corpus. Finally, for applying the
model to the desired downstream task, the user can fine-tune it using the corresponding annotated
dataset. This approach has the advantage of leveraging all the pre-trained knowledge of the
language model trained in the general domain, but the disadvantage of having a vocabulary not
adapted to the desired domain. The paradigmatic example is the BioBERT model. (ii) Training
from scratch: Instead, if enough data are available, one can directly pre-train the model on
the in-domain (but unlabelled) data. This approach has the advantage of having a vocabulary
especially built for the domain of choice, but potentially less data. the paradigmatic example is
the SciBERT model. Source: Own elaboration.
developing a text processing pipeline for retrieving tokens from PDFs with Optical Character
Recognition, sentence splitting, and language detection with NLTK41 , but they only obtain
8 million tokens, several orders of magnitude below the number of tokens used to train the
original BERT. For this reason, they take the BioBERT approach (fine-tuning), but they try to
41
https://www.nltk.org/
24
circumvent the problem of the vocabulary by adding around 100 domain words using BERT’s
UNUSED tokens (BERT reserved around 100 tokens for future usages).
Related to the issue of the vocabulary when using a model in a domain sufficiently different
from the one it was trained, but in the case of languages, [75] showed that a BERT model
can be trained on a given language and then replace the English vocabulary by the one of the
target language (by re-initializing the embedding layer). The model is re-trained with all the
weights frozen, except the ones of the embedding layer. We observe that this finding could have
implications in future domain adaptation techniques (apart from the cross-lingual transfer use
case).
25
concerned, the large version clearly outperformed its multilingual counterpart, and even Camem-
BERT. They relied as well on Common Crawl or OSCAR, but starting with more data (250GB
of text) and then cleaning them aggressively, ending up with 41 GB. Concatenated with other
corpora, this resulted in a dataset of 71 GB. It could be the case that the small improvements
in performance in the case of CamemBERT when increasing the data size from 4GB to 130 GB
were caused by the model not being big enough to take advantage of the increase, since they used
RoBERTa base instead of RoBERTa large (unlike FlauBERT). In this case, they used a pretok-
enizer, Moses, and then vanilla BPE, with a vocabulary size of 50k. FlauBERT was trained in
a French HPC center, not GPUs from cloud providers.
In some other language-specific models, the authors leveraged the release of the language-specific
model itself for, in addition, experimenting other more general settings. For instance, an Italian
model named GilBERTo 46 was released, using the same approach as CamemBERT, with OSCAR
as the source of the corpus and a SentencePiece vocabulary of 32k tokens. But another Italian
model, UmBERTo, introduced whole word masking, which had been implemented in an updated
version of the original repository of BERT. Both the original BERT and RoBERTa randomly
masked 15% of the tokens, which happen to be subwords. In UmBERTo [83], the authors made
the masking to be applied to the whole word (even if implied masking more than one consecutive
subword), which seemed to improve training, forcing the model to extract more information
(instead of, perhaps, guessing from local correlations). Works using SentencePiece, without any
pre-tokenization, can still do whole word masking, by using whitespaces as delimiters.
We are especially interested in the Finnish BERT47 [54]. It is one of the works that gives the most
emphasis in data cleaning. Not only they give importance to this factor, but they document it and
devote a large portion of their article to this matter, and we will see more about it in Section 3.3.
Regarding the vocabulary, again, they observed that the multilingual BERT tokenizer produced
too many tokens per Finnish word (in contrast to the case of English words). This tokenization is
thought to make the process of learning more difficult, especially since subwords end up being too
short, such that they are less linguistically interpretable. The authors used a Finnish pretokenizer
and then applied BPE, showing that with the new vocabularies considerably less subwords per
word were generated. The Finnish BERT was trained in a Nordic HPC center.
Apart from documenting their cleaning of the Finnish text, which perhaps should not be literally
copied for other languages but at least be used as a source for inspiration48 , they went one step
forward and published Wikibert [84]. It is an open-source49 and generic pipeline for training
BERTs for many different languages. This pipeline is supposed to support all Indo-European
languages, and consists of the following steps:
26
3. Download the corresponding UDPipe50 model.
4. Sentence splitting.
7. Basic pre-tokenization.
9. Create TF records51 .
The authors provide numerous pre-trained models with this pipeline, which works mostly out-
of-the-box (yet, not including the training scripts themselves) for many languages, provided they
have enough Wikipedia entries and a UDPipe model available. For instance, Catalan is one of
the compatible languages because it fulfills the two requirements. However, while this system
is an excellent choice for obtaining reasonable baselines, the resulting models will not have seen
data from crawlings (just Wikipedia). Unrestricted web crawlings are larger than Wikipedia,
and in the work of CamemBERT it is claimed that web crawlings are also more diverse in terms
of genre and domains than Wikipedia, which generally implies better generalization. In addition,
processing Wikipedia is not as challenging as doing so with crawlings (so it is unclear whether
this pipeline preprocessing techniques would generalize to unrestricted crawlings). Finally, the
system is intended for using BERT’s original Tensorflow 1 repository, but many other libraries
and models have been introduced ever since.
Other relevant examples of successful language-specific models for languages with less resources
than English or French are the Basque BERT (BERTEus) [85], the Korean BERT [86], the
Estonian BERT [87], or the Dutch BERT [88]. In the case of the basque BERT, they made
an observation regarding tokenization of multilingual languages, consistent with other views we
have seen. The tokenizer of the multilingual BERT, when applied to Basque (a language seen in
training, but underrepresented in the training set) tends to split words into shorter subwords that
happen to be less interpretable. The vocabulary build from Basque corpora results in subwords
that are closer to being linguistically interpretable, due to the relative frequencies. Table 4 shows
some examples of these tokenization artifacts.
Regarding Spanish, a Spanish BERT, named BETO [89], was recently released, and outperformed
its multilingual counterpart in a number of benchmarks. The used architecture, RoBERTa-base
but with the number of attention heads and embedding size of RoBERTa-large, was trained using
whole word masking. As far as the vocabulary is concerned, they used 31k subwords built with
SentencePiece. Regarding the data, they used an aggregation of different Spanish corpora with
almost no preprocessing [90]. We will see more about this preprocessing in Section 3.3.
50
UDPipe is an open-source trainable pipeline for tokenization, tagging, lemmatization, and dependency parsing.
See https://github.com/ufal/udpipe
51
This pipeline assumes that the code of the original BERT, which is based on Tensorflow 1, will be used. The
original repository works with TF records, a data format optimized for this library.
27
Word M-BERT subword tok Lang-specific subword tok
etxeranz52 et #xer #ant #z etxera #ntz
medikuarenera53 medi #kua #rene #ra mediku #aren #era
valtiovarainministeri54 valt #io #vara #in #minister #i valtiovarain # ministeri
vaihtuu55 vai #htuu vaihtuu
Table 4: Tokenization artifacts when text in underrepresented languages (the same happens with
domains with different enough lexicon) is tokenized: When a multilingual tokenizer is applied
to underrepresented languages, it is common to see that a considerable amount of subwords per
word is generated, and these subwords tend to be less interpretable. It is hypothesized that they
are more difficult to learn (since sequences become artificially longer). Source: Recollected from
the articles of the Basque and Finnish BERTs, respectively.
Table 5: Comparison between some of the available language-specific models. The same units
are used whenever possible (depending on how the authors report them). For example, some
works report training time with time units (can be normalized to days, ”d”), but other ones
provide the number of training steps, instead; the same happens with batch size (tokens, ”tok”,
or sentences, ”sent”), and data disze (file size, in gigabytes, ”GB”, or number of tokens, ”tok”).
Notice that BERT and RoBERTa use the same architecture, but a different pre-training task
(models with BERT use the additional task of next sentence prediction). For the sake of brevity,
we denote it with the same field, architecture. Note as well that all NVIDIA V100 mentioned in
this table, and in most of the works, are actually the 32GB version (instead of 16GB).
The original BERT provided cased and uncased models, as well as base and large architec-
tures. Most of the works, though, do not provide these 4 combinations, due to computational
constraints. For instance, in the case of the Finnish BERT, they provided uncased and cased
versions of the BERT-base architecture. Regarding model-size, if a given domain or language
does not have yet a specific baseline or does not have tons of data, it may make sense to start
52
”To the house” (Basque).
53
”To the doctor” (Basque). Notice the interpretability of the subwords: mediku (”doctor”), aren (”’s”), era
(”to the”).
54
”Finance minister” (Finnish). Again, notice that there are more subwords and that they are more meaningful.
55
”Exchange” (Finnish).
57
Whole Word Masking
58
They also trained a RoBERTa-base version, in this case 32 V100 for 410 hours.
59
Two versions with the same architecture: cased and uncased.
60
They use two batch sizes. The first one, for most of the training, with a maximum sequence length of 128.
Then, with a maximum sequence length of 512, the batch size is decreased to fit in memory. This was thought to
help training, but it is disputed and more recent models do not usually do it.
61
Same is in the case of the Finnish BERT.
28
with the base architecture. Regarding casing, even though the uncased versions might outper-
form their cased counterparts in specific benchmarks, cased models should be the priority, since
they generally perform better as reported in the literature61 .
Table 5 shows basic comparative information between some of the available language-specific
models in terms of the models themselves, while Table 6 does so in terms of their vocabulary,
including some domain-specific ones.
Table 6: Comparison between the vocabulary building strategies of some of the language and
domain-specific models.
Encoder-based models are considerably more popular in the domain and language-specific scene,
and we observe a series of reasons why this is the case. First of all, they are more conve-
nient for fine-tuning for discriminative downstream tasks (i.e., token-level and sentence-level
classifications), which are the most common use cases. Second, the multilingual BERT is a well-
established baseline for numerous languages, against which language or domain-specific models
can be compared in NLU tasks. In the case of generative or sequence-to-sequence, there are
not so well-established multilingual baselines and tasks. Finally, obtaining useful representations
may be easier than generating coherent sentences, especially in languages or domains with less
resources. Nevertheless, we observe that it would be interesting to explore the development of
generative and sequence-to-sequence models for a variety of languages and domains. A recent
work introduced a French BART [91], being competitive with RoBERTa-based French models in
discriminative tasks, while gaining generative capabilities.
For a reference of the results of language-specific models, we recommend a recent study on the
matter [92].
29
they followed a preprocessing-lightweight approach. Apart from using FTFY for fixing encoding
errors, it seems that they did not apply many other preprocessing steps, for a number of reasons.
Specifically, these models are huge and may need less cleaning. In addition, they precisely wanted
to prove that their models need not extensive preprocessing.
However, OpenAI never open-sourced the preprocessing scripts. Furthermore, they never released
the crawling-based datasets they built such as the dataset referred as Webtext in their articles.
In addition, the books corpus is no longer publicly available in the original repository, and the
books2 dataset they mention was created by OpenAI and never released63 . In other words, we
are not really sure about the details of their preprocessing, and perhaps applied some additional
steps (e.g., in the WebText corpus, extracted from a crawling, we understand that they extract
the text itself from the HTML).
This encoding-related preprocessing can, indeed, fix certain problems of raw text, but it cannot
magically distinguish between natural and non-natural text, or between high-quality and low-
quality text. Furthermore, regardless of the quality, the user may be interested in a specific kind
of text (e.g., text in a specific language). In other words, FTFY and other encoding-related tools
are only one part of the story, albeit a crucial one.
As a successful example of end-to-end corpora generation process, Paracrawl [94] consists of
a set of large-scale parallel corpora for European languages, targeting sentence-level machine
translation. Interestingly, apart from the datasets themselves, they have released their crawling
and cleaning pipeline, Bitextor. This pipeline comprises several steps, namely 1. Crawling,
2. HTML preprocessing, normalization, and information augmentation, 3. Document alignment,
4. Sentence alignment within documents identified as parallel 5. Filtering of noise, deduplication
and output formatting. Bitextor was designed with scalability in mind, being able to deal with
large amounts of data, and it is HPC-friendly (e.g., it integrates with the SLURM64 job manager,
installed in many supercomputers and clusters). Bicleaner, a tool for filtering parallel data, is
integrated in this pipeline. Bicleaner ranked among the best toolkits for cleaning parallel corpora
at WMT18 [95]. As we will see, Paracrawl will serve us as inspiration, but their use-case (focus
on parallel corpora) is different from ours (monolingual/unlabelled corpora), even if at least
some of their components could be useful to our needs. We observe that 1. Their focus on
machine translation influenced the whole design of the pipeline towards the goal of obtaining
parallel corpora, while a focus on monolingual text could probably lead to obtaining more text,
2. Their language identifiers65 did not work especially well in some of our use-cases (e.g., Catalan,
Biomedical Spanish), 3. We needed more flexibility in terms of the applied filters and input and
output formats, 4. We needed document-level deduplication, and we do care about document
coherency.
Regarding monolingual corpora generation, the WaCky initiative [96], which dates back to 2009,
is worth mentioning. It was one of the first large-scale projects for building monolingual, un-
labelled corpora from the web. They produced 1B-token datasets for English, German, and
Italian, and shared the tools for doing so. At the time, this was considered to be large, but,
nowadays, this size would not be considered impressive taking into account that those languages
63
See https://github.com/pytorch/fairseq/issues/2947 in the case of RoBERTa and https://gist.
github.com/alvations/4d2278e5a5fbcf2e07f49315c4ec1110 in the case of GPT. We were particularly aware
that for certain languages and domains obtaining corpora for training language models could be problematic, but
it seems that even for English this can be the case as well.
64
https://slurm.schedmd.com/documentation.html
65
Namely, Cld2 and Cld3 https://github.com/google/cld3
30
are not low-resource. A more recent (2012) approach based on the WaCky initiative introduced
a software toolkit that applies basic cleaning, simple connected text detection (by removing boil-
erplate), and deduplication [97]. The pipeline used66 is an impressive piece of engineering, even
if the fact that it is based on Pascal might make it less attractive to other developers nowadays.
For removing boilerplate, they use a simple neural network with hand-engineered features such
as the ratio of text characters vs. markup characters, or the ratio of uppercase vs. lowercase
characters. They generate document-level corpora, removing near duplicate documents.
In addition, there is CommonCrawl67 , which is a large-scale multilingual crawling. While it
can be used for extracting parallel sentences [98], perhaps it is better known for generating the
OSCAR corpus. OSCAR (”Open Super-large Crawled ALMAnaCH coRpus”) [99] [100] is a
multilingual (yet, not parallel) corpus precisely targeting models that benefit from unlabelled
text. Once the data are downloaded from the CommonCrawl repositories, the authors apply a
language identifier to organize the corpus into different files, one for each language. Specifically,
they use FastText’s [101] [102] language identifier68 . Thus, unlike Paracrawl, their scripts do
not discard sentences without a parallel counterpart. Like Paracrawl, though, they also open-
sourced they pipeline, GoClassy, which is considerably simpler than Paracrawl’s Bitextor. It
does not include the scripts for downloading the data, but they do include utilities for language
identification, as mentioned, and sentence-level deduplication. Sentence-level OSCAR is publicly
available, but for obtaining the document-level version, one must directly contact the authors.
Some of the language-specific models that have been published have released their code for
cleaning the data. However, in most cases the code was quite specific to their respective use
case. We highlight the case of BETO [89], the first Spanish BERT, for which authors aggregated
different existing Spanish corpora and concatenated them using a very simple preprocessing
script. As an example, we observed that this script did not split sentences correctly (e.g., dots
in acronyms were detected as end of sentences), and all corpora were concatenated sentence-by-
sentence, thus missing document-level boundaries.
As said in Section 3.2, we find especially interesting the case of the Finnish BERT, since the
relevant paper gives more details about text cleaning than usual. They aggregated text from
different sources, namely, two news corpora, a corpus coming from a large forums website in
Finland, and unrestricted web crawls. The cleanup and filtering consisted of the following steps:
3. Agressive filtering using language detection and hand-written heuristics (e.g., removing
documents with too high a ratio of digits, non-Finnish alphabet characters...).
31
Work Data source Data size before cleaning Data size after cleaning (%)
Finnish BERT Crawlings 13.5B tok 3.3B tok (22.76%)
FlauBERT CommonCrawl 215GB 43.4GB (20.19%)
Table 7: Results of two of the cleaning strategies studied: We observe considerably aggressive
strategies when cleaning crawlings, resulting in large reductions in corpus size. Sizes are provided
in different units due to the different criteria used to report them in the original works, but relative
sizes are also provided for easing the comparison.
Table 7 shows statistics of the different cleaning strategies in some of the referenced works.
Finally, regarding the specific data we are targeting in this work, we highlight:
• General domain Spanish: We already saw that for training the Spanish BERT, BETO,
the authors collected and preprocessed data from many different sources, totalling around
3.3B tokens. The corpus is, though, sentence-level, and we identified some problems in the
sentence splitting.
• Biomedical Spanish: For the English biomedical domain, in the works of BioBERT and
SciBERT, large amounts of scientific articles (e.g., from PubMEd) were extracted, ranging
between 3B and 18B tokens. In the case of Spanish, similar collections (yet, in a smaller
scale; 182M tokens in total) have been generated for learning domain word embeddings
[103], aggregating articles from Scielo69 and health articles in Wikipedia. In this case, the
problem is that this scale, while enough for word embeddings, could be too small for a
language model. In addition, domain documents collected from Wikipedia were not always
actually related to the biomedical domain (using Wikipedia tags can be problematic).
• General domain Catalan: For most languages, including Catalan, Multilingual BERT was
trained on the corresponding Wikipedia, with around 200M tokens. caWac [104], the
largest Catalan corpus ever published to date, is composed of more than 780M tokens
coming from a large-scale crawling of the top .cat domains. For the release, authors
tokenized and tagged the corpus. CuCWeb [105], a predecessor of caWac, consisted of
a corpus of 166M tokens coming from a crawling. Table 8 shows how existing Catalan
crawlings compare.
• MT4All language pairs: The MT4All project targets the following language pairs: 1. Finnish,
Norwegian, Latvian ↔ English for the financial domain; 2. Ukrainian, Georgian, and
Kazakh ↔ English for the legal domain; 3. Norwegian, Spanish, German ↔ English for
the customer support; 4. Spanish ↔ English for the biomedical domain; 5. Basque, and
Catalan ↔ English for the general domain. While the biomedical and general domain use
cases are covered by the referenced literature (and, in the case of Basque, see [106] and the
aforementioned Basque BERT, BERTEus), the other domains are considerably specific.
The literature on generating corpora for those domains is limited, especially for languages
other than English. In this work, we will explore the Finnish, Norwegian, Latvian, Catalan,
Basque, and Spanish use cases.
69
https://scielo.org/es/
32
Work Source Size Preprocessing Document-level
CuCWeb (2006) [105] Spanish websites (IP) 166M LI, TOK, SP, TAG X
caWac (2014) [104] Top .cat domains 780M LI, TOK, SP, TAG X
OSCAR (2019) [99] CommonCrawl 728M LI, SP
Table 8: Existing Catalan crawlings. LI means Language Identification; TOK means tokeniza-
tion; SP means sentence splitting; TAG means tagging.
33
Work Model Data Parameters Catalan vocabulary
mBERT [6] BERT Wikipedia (200M tokens) 110M
Wikibert ca [84] BERT Wikipedia (200M tokens) 110M X
Calbert70 ALBERT Oscar (720M tokens, no cleaning) 12M X
Table 9: Existing Catalan language models: mBERT (Google’s Multilingual BERT) was trained
with tons of multilingual data, but specifically for Catalan it only saw around 200M tokens,
from the Catalan Wikipedia. Wikibert ca is a monolingual model just trained with the Catalan
Wikipedia. Finally, Calbert is an ALBERT model trained with the Catalan section of OSCAR,
with no preprocessing. Calbert was published as an experiment in the Github repository cited
in the table, but it does not have any accompanying paper or evaluation.
Finally, regarding the specific languages and domains we want to generate new corpora for
(and, ideally, models), we observe that while there are already corpora and models for Catalan,
general-domain Spanish, and biomedical Spanish, they are considerably smaller than those for
other languages. The same applies to the languages targeted by the already mentioned MT4All
project, namely Finnish, Norwegian, Latvian, and Basque. Specifically for the case of Catalan,
Table 9 shows existing Catalan language models.
67
https://github.com/codegram/calbert
34
4 Settings
In this section, we describe the settings which we intend to use for this project. These settings
will serve both as initial motivation (even if the approach aims to be more generic rather than
just addressing these specific use cases), and validation of the methods.
4.1 Data
Regarding the data, we will apply our proposed text processing pipeline to an heterogeneous set
of corpora, in terms of languages, domains, scales, and targeted use-cases. Using our pipeline
with each of these data sets has its own motivation, as will see, but at the same time, it will
show the performance of the pipeline under different scenarios.
4.1.1 Catalan
In the case of Catalan, we will make use of the following raw data sources:
• New crawlings: During 2020, three new Catalan crawlings were run at the Barcelona
Supercomputing Center (BSC). The first one targeted the top .cat, .ad (Andorra), and
.barcelona domains. The second one targeted websites from the Catalan Government, for
which the government, in collaboration with BSC, had to explicitly allow access to BSC
crawlers. Finally, a similar crawling, also with explicit access allowance, was done for the
Catalan News Agency website. Section 6.1.1 gives more details about these crawlings.
• Existing crawlings: In Section 3.3, the caWac corpus, a relatively large crawling, consisting
of 780M tokens, is presented. Instead of using the tokenized version that was published,
we contacted the authors in order to obtain the original raw version, to which we apply
our cleaning pipeline described in Section 5.3.
• Wikipedia: The Catalan Wikipedia consists of approximately 200M tokens at the time of
writing this section. It is naturally document-level, and using a lightweight preprocessing
one can easily obtain high-quality text.
• Other sources: The Catalan section of the OSCAR corpus, which is a mostly unprocessed
document-level web corpus consisting of around 720M tokens. Sentences from OpenSubti-
tles71 and the DOGC dataset72 are publicly available. A In all cases, after inspection, it
became clear the need to preprocess them before being able to use them .
35
• New crawling: The Spanish National Library (Biblioteca Nacional Española, BNE73 provided
a massive web crawling of the top 4557 .es domains, with a depth of 5. The crawling was
conducted between 22/05/2019 and 13/10/2019. All in all, the crawling has resulted in
616703 files, each line of which being a JSON with the data and metadata for a specific
web-page. The huge size of the raw data (around 45TB of raw WARC files which are even
larger than the corresponding JSON) is remarkable and makes it challenging to process.
The WARC files were extracted with Selectolax74 . The software used for the crawling was
the one used by Internet Archive75 . After inspecting the raw data, the need for cleaning is
clear, since it cannot even be assumed that all websites are in Spanish.
• BETO corpora aggregation: In Section 3.3, we mentioned that for building a Spanish
BERT, BETO, the authors already aggregated and preprocessed different sources for
general-domain Spanish, including Wikipedia, OSCAR, and others. All in all, the cor-
pus consists of almost 3B tokens, after a light preprocessing (by the authors of BETO).
• A new Spanish biomedical crawling: The Spanish Health Web Corpus (referred later as just
”Medical Crawler”), or ”Corpus Web Salud Español (CoWeSE)”, is a new crawling run at
BSC. The selected websites belong to at least one of the following categories: 1. Medical
communities, 2. Scientific communities, 3. Medical journals, 4. Research centres, 5. Phar-
maceutical companies, 6. Informative websites about health issues, 7. Patient associations,
8. Personal blogs from healthcare professional, 9. Hospital websites, 10. Public health or-
ganizations. In total, it consists of more than 4500 websites, stored in one WARC file for
each of them. While mostly in Spanish, text in Catalan and Galician is also present.
• TeMU’s biomedical Spanish collection: The Text Mining Unit of Barcelona Supercomputing
Center76 has collected different corpora of biomedical articles and anonymized clinical
histories, including cardiology, covid-19, and radiology:
– CardioCC, CovidCC, and RadioCC: Corpora extracted from cardiology clinical cases
(150k tokens), covid-19 clinical cases (82k tokens), and clinical cases involving radiol-
ogy (177k tokens), respectively.
– Libros Casos Clinicos: A miscellaneous collection of clinical cases, totalling more than
1 million tokens.
– EMEA: EMEA is a corpus of biomedical text retrieved from the European Medicines
Agency77 (EMEA). It consists of around 13.8M tokens.
– Patents: A corpus generated with patents (related to the biomedical domain), com-
posed of more than 14M tokens.
73
http://www.bne.es/
74
https://pypi.org/project/selectolax/
75
https://github.com/internetarchive/heritrix3
76
https://temu.bsc.es/
77
https://www.ema.europa.eu/en
36
– BARR2 Background set: The background set of TeMU’s Biomedical Abbreviation
Recognition and Resolution task78 : Corpus extracted from biomedical literature in
Spanish used in an abbreviation recognition and resolution task. The text itself
is document-level and basically consists of abstracts from biomedical articles, with
28.87M tokens.
– REEC: ”Registro Español de Estudios Clı́nicos”79 (REEC) is the Spanish Registry of
Clinical Studies. The resulting corpus consists of 4.58M tokens.
– SciELO: A corpus made of biomedical articles, extracted from the SciELO reposi-
tory80 . It consists of a document-level corpus of 61.84M tokens.
– Mespen Medline: Corpus generated from medical articles extracted from Medline81 .
This repository is better known for storing papers in English, but it has a portion of
Spanish articles as well. In total, it has almost 110M tokens.
– PDFs general: A document-level, miscellaneous collection of biomedical Spanish text
extracted from PDFs. It consists of 129.12M tokens.
• Wikipedia Life Sciences: For this work we use a new crawling of the health-related articles
in the Spanish Wikipedia, run at BSC, with a more sophisticated strategy that the one
in [103] (traversing the nodes belonging to health-related categories with a certain depth,
and discarding records belonging to a list of categories not related, but potentially conflated,
with health topics).
4.1.4 MT4All
• Financial domain for Finnish, Norwegian, Latvian ↔ English: One option could be select-
ing financial domain text from large general domain corpora, although this could hardly be
effective for languages other than English (because they will not have enough data of this
domain in a general crawling). Instead, we will opt for running new crawlings on websites
belonging specifically linked to the financial domain. These crawlings will be cleaned with
the cleaning pipeline described in Section 5.3.
78
https://temu.bsc.es/BARR2/
79
https://reec.aemps.es/reec/public/web.html
80
http://scielo.isciii.es/scielo.php
81
https://medlineplus.gov/
82
https://temu.bsc.es/
37
• Biomedical domain for Spanish ↔ English: For biomedical Spanish, we refer to Section
4.1.3. For biomedical English, the data collected in BioBERT [18] and SciBERT [80] can
be used out-of-the-box, so it is not really a concern.
• General domain: Basque, Catalan ↔ English. For Catalan, we refer to Section 4.1.1. For
English and Basque, we will apply the cleaning pipeline described in Section 5.3 to the
English and Basque sections of the OSCAR corpus, respectively.
4.2 Environment
As we said, the proposed architecture (meaning the general architecture of the system, not the
neural architecture) is heterogeneous in the sense that each component may leverage a different
kind of device.
Regarding data gathering, most of the raw data used for this project came from web crawlings.
I was not personally involved in this part of the process, but for the sake of completeness, let
us describe the hardware that was used. The crawlers executed at BSC by the Operations
department83 were executed in a regular Linux virtual machine with 6 cores, parallelized with
MPI. We do not know the specifics of the hardware used in the BNE crawling, which was executed
in the BNE servers. The output of this crawling was transferred to BSC’s storage facilities.
All data, regardless of their origin and state during the whole process, is stored at the large
storage facilities of BSC, which leverages IBM’s General Parallel File System (GPFS). Data
partitions connected to the computer nodes are SSD-based.
As far as the text processing pipelines are concerned, they are run on a supercomputer, MareNos-
trum484 . This supercomputer is based on Intel Xeon Platinum processes (Skylake). Regarding
the operating system, it runs SuSE Linux Enterprise Server. Each compute node has 48 cores
(2 sockets Intel Xeon Platinum 8160 CPU with 24 cores each @ 2.10GHz), and 96GB of main
memory.
Regarding training (and, in fact, also evaluation), the models themselves run in a GPU cluster.
Specifically, a PowerPC cluster in which each node has 4 x GPU NVIDIA V100 (Volta) with
16GB HBM2.85 . Each node has 2 x IBM Power9 8335-GTH @ 2.4GHz (3.0GHz on turbo, 20
cores and 4 threads/core, total 160 threads per node) cores, 512GB of main memory, and 2 units
of SSD of 1TB as local storage. The operating system is Red Hat Enterprise Linux Server 7.5,
with CUDA 10.1 and the required drivers being pre-installed. Notice that these GPUs have of
16GB of memory, unlike all the works using NVIDIA V100 we saw in sections 2 and 3 (having
32GB of memory); this will pose some challenges (see Section 5.6.3).
Both in MareNosturm4 and CTE-POWER nodes are inter-connected with InfiniBand (IB)86 , a
computer networking system with low latency and high throughput.
83
Namely, the new Catalan crawlings, the biomedical crawling, and the MT4All crawlings.
84
https://www.bsc.es/support/MareNostrum4-ug.pdf
85
https://www.bsc.es/support/POWER_CTE-ug.pdf
86
https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf
38
5 Methods
Once we have seen the required background, the current state-of-the-art, and the settings we are
targeting, we will go through the specifics of our proposal. First of all, let us clarify what we
mean by document, document boundaries, or document-level corpora, since these concepts will
be of the uttermost importance in this section and the rest of the thesis. We understand that a
document is a sequence of sentences. Note that we do not mean document as equivalent to file of
a computer system, but unit of meaning, composed of a number of sentences. A given file may
contain one or more documents. For instance, if we take one of the plain text files extracted from
a Wikipedia dump, each of the articles may be considered a document. Paragraphs themselves
could also be considered a document, but in this work we have kept the maximum context as
possible. Document boundaries may be denoted by markup (e.g., different items in an XML),
or by conventions in a plain text file (typically, documents are separated by one empty line).
Finally, a document-level corpus is the one that preserves those boundaries. As we have seen,
most algorithms for training language models do require document-level corpora. Otherwise,
even if they do not, the presence of document-level corpora will let the model learn about long-
term dependencies.
In this work, we propose a method for generating new corpora for training language models at
scale, even if the method is generic enough to generalize for creating datasets for all NLP models
requiring large unlabelled datasets (e.g., unsupervised or semi-supervised machine translation,
or word embedding algorithms). We then show how the generated corpora can be effectively
used for training language models from scratch.
5.1 Overview
The whole process comprises the following steps:
1. Data gathering and storage: It consists of the execution of new crawlings or, alternatively,
downloading or collecting existing corpora. In the case of challenging domains, the mere
fact of gathering existing datasets is a relevant process that cannot be overlooked. It is
of the uttermost importance to list all the available corpora and to properly collect their
respective metadata.
2. Cleaning and formatting: We build from scratch a cleaning pipeline that takes as input raw
text, in a number of different formats, and cleans it. The process is model-agnostic enough,
not to lose information or be too specific for a given architecture, but opinionated enough
to force certain properties that we believe to be generally desirable (e.g, keep document
boundaries whenever is possible).
3. Model-ready preprocessing: Once corpora have been cleaned, they have to be specifically
preprocessed according to the needs of the targeted application and model. We consider
this process as part of the model building phase itself. Typically, this step will involve a
subword based tokenization, but it is not always the case.
4. Model training: Once the data is ready, the model can be trained. We will document the
specifics of training language models from scratch.
39
5. Evaluation: In the case of the generated corpora, it would be both costly and technically
complicated to intrinsically evaluate a specific cleaning process. We will, instead, provide
qualitative analysis of the results of the cleaning. Regarding models, language models
must be evaluated on downstream tasks since the pre-training metrics are not necessarily
informative of their usefulness.
40
Example Desired? Reason
Removing non-natural language Desired Unnecessary noise, waste of compute,
probably not present in inference
Discarding text in other languages Desired Not useful, waste model capacity in
unnecessary use cases
Spell checking Undesired Not realistic in inference, can cause lack of
robustness to typos
Terminology normalization Undesired Not realistic in inference, prevent the
model from learning synonyms
Lower-casing/true-casing Undesired Loss of information, too opinionated
Tokenization Undesired Loss of information, too opinionated
Removing boilerplate Desired Document coherence
Removing duplicates Desired Waste of compute and storage, potential
risk of overfitting
Table 10: Examples of desired and undesired cleaning: The term of text cleaning is ambiguous
(e.g., is normalizing text with spell checkers part of the cleaning?). We show specific examples
of text processing techniques we will apply in the phase of cleaning and formatting. Some of the
methods declared as undesired in this table will be applied, but in the model-ready processing
pipeline, not in the cleaning pipeline. We consider preprocessing steps such as tokenization as
part of the model building phase itself.
The cleaning pipeline has been designed from scratch with the goal of being as generic (while
targeting European languages), flexible, extendable, and memory and compute-efficient as pos-
sible:
The cleaning pipeline takes as input raw text (in whichever format is required), and produces
output in the specified format. It is model agnostic in the sense that it does not prepare the
output for any specific model (e.g., it does not tokenize or add special tokens). Nevertheless,
it is strongly opinionated in the sense that it was conceived with the goal of keeping document
boundaries, which were considered to be of increasing importance, and several components are
designed with this goal in mind.
41
Figure 9: Cleaning pipeline overview.
42
Figure 9 gives an overview of its design.
In this subsection, we will go through the different components, while giving a global under-
standing of the whole process.
This component parses raw data. The parent DataParser class implements the heavy-lifting
utilities for efficiently guessing the encoding (unless explicitly provided) of a file and parsing it,
opening binary files if required, etc. For encoding guessing, it leverages the chardet module88 .
We also experimented with UnicodeDammit89 and MagicLib90 , but chardet library seemed more
complete (in the case of text) and better documented. We lazily feed the contents of each file to
chardet’s UniversalDetector, until a given confidence threshold on the encoding is obtained,
or the process times out (and we assign UTF-8 by default). The parser also lists all the files
to process in the given directory, which will be later distributed to the different workers in the
parallel mode. In fact, the parser could be easily extended to directly read from any stream, not
necessarily based on local files (e.g., a live crawling). In practice, for large crawlings, it is better
to store the results of the raw crawling and then, as a separate step, apply the cleaning pipeline
(which allows, for instance, to sample from the raw data and test different cleaning strategies).
The data parser has the responsibility of keeping document boundaries, if the raw data provides
them (for instance, the Wikipedia data parser parses each article as an independent document),
or even trying to generate them, if they can be inferred. For instance, in the case of web
crawling, each page within a webpage is considered as a document, but only taking the content
inside paragraphs to try to avoid boilerplate (e.g., copyright or footers), with the goal of returning
connected text. Document metadata is also parsed, as long as it is provided by the corresponding
input format.
Inheriting from the parent class, extending the program to parse different formats takes a few
lines of code. The child class must only implement the parse file method and declare the
targeted extensions (if any), such as "*.xml". As we said, the core of the implementation is
build around Python generators91 , which allow the data parser to lazily read potentially large
files with a minimal memory footprint. There are several parsers implemented:
• BSC crawling JSON parser: Parsed a JSON-based format used in some of the crawlings
(especially, in the BNE crawling) run or organized by BSC. Apart from the raw text, it
has other fields related to metadata, such as keywords of the original website, the title, or
the URL.
43
• Onion parser: Document-level format allowing to store metadata of each document. It is
used internally by the program. The name refers to the Onion deduplication tool93 .
• Sentence parser: Sentence-level, plain text file format. Sentences are provided line by line.
• Document parser: Similar to the Onion parser, but for external usage (e.g., without dedu-
plication marks).
• Textfile parser: Similar to Fairseq LM and sentence parser, but in this case, all text in the
same file is assumed to belong to the same document (instead of relying on empty lines as
document boundaries, or being a sentence-level format).
• WARC parser: The WARC94 (Web ARChive) format stores multiple resource records (data
objects), typically consisting of websites dumped from a crawling. The raw crawlings used
in this project were stored in this format.
• Wikipedia parser: Parses extracted Wikipedia dumps, keeping its metadata and considering
each article as a document.
The factory pattern95 is used to build these parsers in a simple way. In the first implementations,
the data parser was implemented as a streamer of documents. Later, while keeping the use of
generators, it was rethought as a mapper (mapping paths to streams of documents), to simplify
the parallel implementation.
To the best of our knowledge, no other text cleaning pipeline implements so many formats by
default, and the encoding guessing feature is not common in other cleaners we have seen. Our
focus on extendability and genericity is especially observed in this component.
The encoding fixes module leverages ftfy [93] to fix common encoding errors, also known as
mojiblake. This component assumes that, encoding errors aside, the encoding of the text is
already UTF-8 (which is guaranteed, at least up to a point, by the data parser). In addition, it
applies Unicode NFKC normalization96 , meaning that certain unicode characters are normalized
with an equivalence table. For instance, the ellipsis standalone character (”. . . ”, U+2026) is
converted to three dots (”...”). We used FTFY for encoding fixing because it is extensively used
in the literature, and in our tests worked very well.
5.3.4 Prefilterer
44
especially treating tags such as <br>) will collapse separate sentences into a single one,
without even an empty space between the end of a sentence and the beginning of the next
one. In the example provided in the documentation of the BeautifulSoup497 library, which
can be easily used to extract pure text from markup text, one can observe that parts of
the same sentence can get separated in different lines (which is undesired), so in this case
it is better to manually build the regular expressions.
2. Early filtering of bad documents: Operations and filters applied to documents are increas-
ingly expensive in terms of computation time. One of the goals of the prefilterer is to
discard as fast as possibly what we consider as potentially bad documents. For instance,
documents too short (according to a configurable threshold) in terms of characters or
heuristically estimated tokens are discarded. Documents in which presumably there is lit-
tle to no natural language (inferred by the percentages of different kinds of characters, or
the presence of certain kind of strings). The presence of language in one of the desired
languages (as requested by the user) is also tested, first via the percentage of characters in
the alphabet of the language in question, and then, via the fast language identifier, with a
low threshold98 .
Regarding the language identifier, we experimented with three language identifiers mentioned in
Section 3.2:
• cld3
• LangId.
To our surprise, Google’s language identifier, the one used in Paracrawl, did not perform that
well as expected in our tests, at least in some of languages and domains (Catalan, Biomedical
Spanish). In this component, we ended up using the FastText one, which offered the best trade-
off in terms of performance and speed. We observed that this language identifier was quite
sensitive to the presence of URLs, so before inputting the text to this language identifier, URLs
are removed. Note that language identification can also serve to discard non-natural text. Once
some documents have been discarded, further transformations and filters, more demanding in
terms of computation, are applied.
At the end of the day, the prefilterer ends up doing a great deal of the heavy-lifting work of the
cleaning pipeline. It exhibits a compilation of many hand-written heuristics 99 that generally
work and that, otherwise, at least can be explicitly deactivated or modulated via threshold or
deactivation parameters.
It is worth noticing that up to this point, the considered unit has been the document as a
whole (in case of the sentence-level input formats or datasets, each document consists of a single
sentence), but still as a raw sequence of characters, without any notion of sentence.
97
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
98
See the cascade of language identifiers detailed in Section 5.3.10.
99
Many more than the ones summarized in this section, for the sake of brevity. Fur a full reference, see Appendix
B.
45
5.3.5 Sentence splitter
Regarding the sentence splitting, we investigated different libraries, with the following require-
ments:
• Work out-of-the-box with as many languages as possible, at least with the European ones.
• Be as fast as possible.
The role of the cleaning pipeline is generating corpora and, thus, it is supposed to be model-
agnostic. Tokenization is part of the model building phase (e.g., one researcher might want to
experiment with another tokenization, without having to re-execute the cleaning process). Thus,
we would rather avoid a sentence splitter that tokenizes, since we would have to detokenize again,
which might be a waste of resources, in case other lighter approaches work out of the box.
As we saw in Section 3, the WikiBERT pipeline used UDPipe, a trainable NLP pipeline, ca-
pable of performing tokenization, dependency parsing, and others. On the other hand, FreeL-
ing [108–112] is a very powerful NLP pipeline written in Java, including tokenization, Part-Of-
Speech tagging, lemmatization, and others. Its sentence splitting, which requires tokenization,
is considerably effective in many languages. Nevertheless, we discard their usage, even if being
the most powerful sentence splitters, since they do not fulfill our requirements.
On the other end of the spectrum, we find the light preprocessing script of BETO, the Spanish
BERT we saw. We revisited their simple script and, for sentence splitting, we observed that it
failed on different cases. For instance, it splits sentences when encountering acronyms.
Instead, as a middle ground between the two approaches, we use Sentence Splitter100 , a heuristic,
yet effective, sentence splitter. It does not require tokenization, and it does not have any depen-
dency. Its rules, though, are considerably effective. It has explicit support for many languages,
including Catalan and Spanish, but it works well with other, not officially supported, such as
Basque.
By sentence filtering we mean the application of sentence-level filters, even (and, perhaps, espe-
cially) in the case of document-level corpora. One could argue, and with a reason, that filtering
individual sentences in a document-level corpus could damage the document coherence. At the
same time, and especially in the case of crawlings, the presence of certain artifacts might damage,
even more, the coherence of the document. On the one hand, the document might be infected
with placeholder text (messages about cookies, privacy, copyright,...). On the other hand, non-
natural text might appear (this is sometimes the case in Wikipedia). Recall that until this point
the filters have been applied at document-level. The fact that the text of a document as a whole
is accepted does not guarantee that all its sentences are necessarily natural text.
100
A Python module Based on the original Perl scripts developed by Philipp Koehn and Josh Schroeder: https:
//github.com/mediacloud/sentence-splitter
46
Our proposal is, thus, filtering individual sentences, but with caution. We apply a cascade of
language identifiers 101 to all sentences, individually. Non-natural text, or placeholder sentences in
a language different from the targeted one, will, most likely, be discarded in this step. Apart from
that and other heuristics, sentences that are repeated within the same document can optionally
be removed.
5.3.7 Normalizer
• Documents coming from PDFs: Corpora coming from datasets had specifically been treated
with PDF extraction tools. Nevertheless, the crawling appeared to have some PDFs (which
we were not aware of). These documents are distinguished from the other two cases by
checking the consistent lack of punctuation in the end of the sentences (and the presence of
punctuation marks in the middle of the sentence) and can be fixed with a regular expression.
For example, a document with more than 10k lines (but considerably less actual sentences)
was detected in the crawling from the Generalitat websites, containing lines such as the
following:
The second and third line are actually part of the same sentence as the first one.
• Non-natural text: Fortunately, this case does not typically happen if the rest of the clean-
ing pipeline components have been applied. Nevertheless, when running the filter as a
101
See Section 5.3.10.
47
standalone script on the original Catalan OSCAR corpus, documents such as the following
appeared:
in3 9AA in3 9AB in3 9AC in3 9AD in3 9AE in3 9AF in3 9AG in3 9AH...
in3 9BA in3 9BB in3 9BC in3 9BD in3 9BE in3 9BF in3 9BG in3 9BH...
in3 9EA in3 9EB in3 9EC in3 9ED in3 9EE in3 9EF in3 9EG in3 9EH...
Believe it or not, this document was present in the Catalan part of the OSCAR corpus.
These kind of documents are removed, but as we said, this case should never occur if the
previous components of the cleaner have been applied.
• Documents consisting of identical sentence repetitions, except a specific detail (e.g., a phone
number, or a date):
From the data parsing to the normalization, each worker prints in its own file all the documents
it generates from the files it is assigned, using the Onion format. Then, in the document filter
component, which is a reducer instead of a mapper, all Onion files are concatenated, and global
deduplication is applied as follows:
• Using Unix’s standard utility awk, we remove sentences that are globally repeated a certain
number of times. We have found that this heuristic helps to remove placeholder sentences
(e.g., many pages happen to have the exact same copyright disclaimer), and if the place-
holder is big enough it will rarely remove sentences we were not supposed to, especially
if only applied to the beginning or the end of the document. We have tried to make the
implementation as Python-based as possible, but in this case, the speed-up provided by
awk (called from Python) was worth it.
• Using the Onion deduplication tool, which is based on n-gram repetition frequencies, filters
out documents with a certain level of duplication with respect to the other ones. The
threshold is configurable. For a better understanding of Onion, we strongly recommend to
take a look at the Ph.D. thesis that originated this tool [113].
Both awk and Onion, being implemented in C, are extremely fast. Once the deduplication has
been applied, the output formatter, the next and final component, is in charge of converting the
48
Onion format into the desired final format. The intermediate results before the deduplication are
also stored, in case the use-case may require them. The program can be run just in deduplication
mode, which is convenient in the case the user desires to deduplicate the result of an aggregation
of corpora that, individually, have already been passed through the rest of the components.
• Fairseq LM: Prints in the aforementioned Fairseq LM format (see Section 5.3.2).
• Onion: Prints text in the aforementioned Onion format. This is used internally by the
program.
From the very beginning, this program was targeting the BNE crawling. Being ≈45 TB of raw
data, it was therefore implemented with HPC and memory efficiency in mind. This program has
the requirement of being usable both for small and large corpora.
Lazy evaluation The program is implemented around Python generators102 . In practice, this
means that the program only loads in memory the contents when it requires them, implying a
tiny memory footprint.
Early exit One of the design principles of this pipeline is the faster a filter, the earlier it is
applied. In other words, we apply first the faster filters (unless there is some dependency), to
exit early from a document if it is not promising enough to apply the more expensive procedures.
Sticking to this technique, as simple as it seems, can easily be overlooked when there are numerous
filters and components, but it is considerably effective.
Sequential optimization Before parallelizing the program, the sequential execution was opti-
mized. Once the first version had been built, it was profiled using PDB103 . Parallel implementa-
tions can, in the best case scenario, decrease execution time by a linear factor of N , assuming N
workers and a completely parallelizable algorithm (which is hardly the case). Profiling revealed
that the language identifier of choice was directly accountable from 50 to 60% of the execution,
depending on the execution. Figures 10 and 11 give more details on the profiling.
49
Figure 10: Cleaning profiling (graph call): The filter by lang method is directly accountable
for more than 50% of execution time, as extracted by the Python debugger. While it is true that
this could be partially speeded up by not requesting the normalized probabilities (and just taking
the argmax), the language identifier itself would still be considerably slow, and the normalized
probabilities are needed by the program heuristics. In the figure, the graph call is cropped for
clarity.
appears in the input text with a certain confidence threshold. First, we apply the ones with high
recall (but low precision), with a low confidence threshold. Then, when most of the documents
we are confident that will not be in the desired language have already been discarded, the slow
(yet with better performance) language identifiers are applied, with a higher threshold. This
idea was inspired by the well-known Viola-Jones algorithm for face detection [114]. In case
of language identification, to the best of our knowledge, the only work we have found using a
similar strategy is [115]. In practice, we found that the best combination was using only two
classifiers, specifically, LangId, as the slow yet precise language identifier, and FastText, as the
faster alternative with a decent enough recall. Other language identifiers did not perform well
in our tests, as described in Section 5.3.4, This resulted in speed-ups of around 1.8x, as shown
in Table 11. Figure 12 shows the proposed cascade of language identifiers.
50
Figure 11: Cleaning profiling (counts): Call count means number of times the corresponding
method has been called during the execution; time (in milliseconds) means total time (i.e, the
overall time of that function, including the internal calls to other functions); own time means
the time spent on the method itself (ignoring internal function calls).
Table 11: Speed-up obtained by the cascade of language identifiers: The best cascade of lan-
guage identifiers combination we found obtained a remarkable speed-up of 1.8x, while implying
a minimal degradation in terms of output size.
First parallelization strategy The first parallelization strategy correctly assumed that all
data sources (i.e., files) would not necessarily imply the same amount of work. For instance, one
file could be smaller, or be easily filtered. This could imply load balancing problems. For this
reason, in the first parallelization strategy we tried, there were two kinds of parallel workers.
The first ones, the streamers, were in charge of reading files in background, providing more data
on as-needed basis. Mappers were applied in parallel to each document. Then, sequentially, the
104
https://www.ibm.com/analytics/hadoop/mapreduce
51
Figure 12: Proposed cascade of language identifiers: First, we apply the ones with high recall
(but low precision), with a low confidence threshold. Then, when most of the documents we are
confident that will not be in the desired language have already been discarded, the slow (yet
with better performance) language identifiers are applied, with a higher threshold. This idea
was inspired by the well-known Viola-Jones algorithm for face detection [114].
reducers were executed. The problem of this implementation was that coordinating the two kind
of workers implied a big overhead, and otherwise the implementation unnecessarily got more
complicated than needed. Figure 13 shows the bad performance in the parallel profiling of this
strategy.
Second parallelization strategy For this reason, the second parallel implementation was
simplified. Instead of having streamers as a separate kind of workers that had to coordinate with
mappers, they were treated as another kind of mapper, which mapped file paths to a stream of
data. Figure 14 shows the improved performance of this strategy. Apart from obtaining a better
speed-up, this implementation scaled considerably better with the number of cores. This parallel
implementation, as well as the first one, as implemented with Python’s multiprocessing module,
and in our experiments it performed well with well up to 48 CPUs (the maximum number of
CPUs per node in MareNostrum 4). Since not all mappers can be serialized (due to having
non-Python dependencies), they are initialized in each worker, instead of being initialized once
and then copied.
52
Figure 13: Traces obtained from the first parallelization strategy: Many barriers, i.e., processes
blocking waiting for others, can be appreciated. Most of the chart is blue (meaning idle, i.e.,
useless, execution time).
reinforcement learning capabilities, but we use it merely as a library for distributing jobs across
nodes. Remarkably, the speedup obtained when testing the distributed execution of the cleaning
pipeline was considerably higher than expected. The reason is that Ray appears to be faster
than Python’s multiprocessing in vanilla intra-node parallelism. Table 12 shows the speed-ups
obtained by the distributed implementation.
Table 12: Speed-up obtained by the distributed implementation: We benchmarked the dis-
tributed implementation using the Catalan crawling, consisting of 312GB of WARC files.
53
Figure 14: Traces obtained from the first parallelization strategy: No barriers are appreciated.
The performance is still not optimal, but the bottleneck is the access to disk, and the reduction
does not leverage all the cores. The performance was further improved when using the Ray
backend, even for intra-node parallelism.
Checkpointing and logging Each time a document is processed, its mapped output is written
to disk in the internal Onion format. Since this format stores metadata, in case the execution was
interrupted for some reason, the state of the cleaning could be restored from these files, but in a
cumbersome manner. For this reason, the cleaning pipeline implements explicit checkpointing,
saving the state such that it can be easily recovered by just launching the pipeline in the same
output directory. Two different checkpointing backends are provided, each of them with different
trade-offs. In addition, the pipeline implements basic logging to monitor the state of the cleaning
process, circumventing the potential bugs (if the implementation had not been careful enough)
induced by logging within a parallel or distributed execution. Both the logs and the arguments
used in the execution are stored in the corresponding output directory (the latter, in a JSON
dictionary), for further reproducibility and traceability.
Deployment Due to the Internet access restrictions in MareNostrum 4 and, furthermore, the
non-Python dependencies, the cleaning pipeline is containerized first, in a Docker105 image, and
then converted to Singularity106 . Singularity, similarly to Docker, is a system for creating con-
tainer images, with the dependencies and environment variables of the program. Unlike Docker,
though, is especially intended for supercomputers. Generating the Singularity image is more
complicated than the Docker one, since no Internet access can be assumed, and mounting vol-
umes with writing permissions (for being able to write to the host file system) can be problematic
depending on the configuration of the host file. The deployment is automated with scripts.
5.3.11 Parameters
The cleaning pipeline is highly configurable and most of its behaviour can be modified via
command-line (or configuration) arguments. Appendix B provides the whole list of parameters
105
https://www.docker.com/
106
https://sylabs.io/docs/
54
at the time of writing this section.
• Corpus-level: I have collaborated with the design of an internal system for organizing and
metadating corpora. However, it is out of the scope of this master thesis.
• Document-level: In the internal Onion format we saw, each of the documents keeps the
information of the original document (e.g., URL, id, keywords,...). This means that in case
it was necessary, we could easily sample from cleaned corpora (e.g., retrieve documents
with a given keyword).
For additional traceability, the specific version of the cleaning pipeline and the arguments used
are logged and stored in JSON files. Besides, the cleaning pipeline registers all the operations
done to each document, even though this information is not usually collected unless running in
the debug mode of the program.
As far as aggregation is concerned, with the goal of generating corpora as large and diverse and
possible, we aggregate corpora within the same target category (e.g., ”all Catalan corpora”, or
”All biomedical Spanish corpora”). For doing so, the simple concatenation might be sub-optimal.
Instead, we do the following:
• The concatenation itself is done preserving the maximum granularity available. For in-
stance, if a sentence-level corpus is concatenated to a document-level aggregation, the
sentences of the sentence-level corpus are concatenated with interleaved newline charac-
ters. Otherwise, document boundaries would miss its meaning and a model could not be
trained with document-level units (or could, but wrongfully). Note that if the aggregation
is required to be sentence-level, it is trivial to make it so, but not the other way around.
• We allow for optional oversampling in case a small corpus had some interesting, underrep-
resented features, especially for vocabulary building, but it is not applied by default.
• Once concatenated, the aggregation is deduplicated again, using the cleaning pipeline itself
(but only its deduplication component), to remove cross-corpora duplicates.
55
boundaries. Steps such as tokenization (or, for instance, tagging) are out of the scope of the
cleaning pipeline itself, and are considered part of the model building part. In this section, we
describe the model-ready pipeline, a second text processing pipeline that takes as input the output
of the cleaning process, and outputs datasets that can directly be used for training language
models. However, in this work, since we are generating new corpora, we run the components in
pipeline, since there are no existing intermediate results to leverage.
The implementation of this second pipeline was considerably easier than the one we have de-
scribed so far. The latter took months to develop, and had to be built and parallelized mostly
from scratch. In addition, the input space of the cleaning pipeline is indeed a surprise box.
Here, instead, we leverage existing libraries (e.g., a tokenization library) that have their own
parallelization. Moreover, the input will be considerably more uniform.
Another difference with respect to the cleaning pipeline is that now the need to run an individual
component in isolation will be considerably more common. Storing intermediate results of the
cleaning process for executing individual cleaning or formatting components will typically be a
waste of time (except the intermediate results just before deduplication), even though we saw
that filters and components can be inhibited via the command line arguments. Instead, when
making the data ready for the model, it might make sense to just run a pre-trained tokenizer, or
to re-use an existing train-validation-test split. In other words, the intermediate results of each
of the components can have an intrinsic value. For this reason, unlike the cleaning pipeline, the
model-ready pipeline is implemented as a series of standalone Python scripts, and they can be
glued together via a provided parameterized Bash script. This pipeline, though, like the previous
one, is also executed on the HPC CPU cluster.
Like the cleaning pipeline, this second pipeline will be open-sourced in the near future, but in
the meantime, we provide its code as supplementary material, attached to the delivery of this
thesis.
The first step of this pipeline extracts some document-level statistics. Specifically, it computes:
• The number of very long documents (i.e., more than 100 sentences).
Thanks to this simple statistics, I realized (and fixed) that the aggregation of Catalan corpora had
been done incorrectly, since two sentence-level corpora had been concatenated to the rest without
first separating the sentences as independent documents, which made the standard deviations
of lines and tokens per document abnormally big. This would have had the harmful effect of
107
Note that here the text has not yet been tokenized, so the number of tokens is approximated with a simple
whitespace-based word count.
56
considering the unrelated sentences of the sentence-level corpora as being semantically related
(i.e., assuming that the second sentence should go after the first one, and so on).
5.5.2 Decontamination
As far as decontamination is concerned, in case we know for sure that target evaluation bench-
marks will not be present in the language modeling corpus, we do not apply any further step.
For instance, if the evaluation benchmark was especially generated for a shared task and it was
not sourced from the web. Otherwise, we do explicitly filter out these sentences. There are
many works that ignore this procedure and, in fact, it cannot necessarily be considered as unfair
evaluation (except perhaps in certain question-answering datasets, in which the answer might
leak), since during the pre-training step the data does not have labels. Nevertheless, being the
generated corpora big enough (and the conflicting sentences representing a negligible percentage
of them), we play it safe and filter them out, which at the very least makes the benchmark results
more reliable because the sentences will not have been ever seen by the model, so we can better
estimate its generalization capabilities.
The function that performs this decontamination does so heuristically, due to computational
constraints. It basically simplifies sentences by removing accents, punctuation signs, casing,
and spaces, and performs string-level comparisons. This might have the potential side-effect
of decreasing the coherence of the affected documents, but, again, the number of conflicting
sentences is negligible.
Before training a model, as usual in machine learning, we generate the validation and test sets.
For doing so, we have a script that does so in a reproducible way. Because we have kept
document boundaries whenever was possible, and the corpora generated are considerably large,
this process is not as trivial as randomly shuffling lines. Besides, for each corpus, we observed the
statistics of length of the documents (in terms of sentences and documents), and set accordingly
length thresholds to have documents of reasonable length in the hold-out subsets (to avoid an
abnormally large document, especially in the validation set, which is periodically evaluated).
5.5.4 Tokenization
57
The user can specify a given number of placeholder tokens (in case one wants to reserve entries in
the vocabulary to introduce custom tokens when fine-tuning or performing domain adaptation),
although in this work we do not (apart from reserving the special tokens required by the model,
in the case of RoBERTa, <s>, <pad>, </s>, and <mask>).
Since we plan to use Fairseq (as described in Section 5.6), there is one aspect that must be taken
into account. In Fairseq, unlike in Huggingface Tokenizers and Sentencepiece, special tokens are
not explicit in the dictionary. Thus, we explicitly request in the call to the tokenizers library not
to include special tokens (except for the additional ones desired by the user). Otherwise, these
special tokens would be duplicated, which is a known issue in models such as CamemBERT110
Unlike the original RoBERTa and GPT-2, we prepend a whitespace to all sentences, to guarantee
that the first word is tokenized the same way as words in the middle or end of the sentence (in
Byte-level BPE, whitespaces are included in the tokenization). In inference, this whitespace is
automatically prepended by the tokenizer, so the user does not have to bother about it.
This component has two sub-components that can be run on isolation, namely, the training of
the tokenizer, and the application of the learned tokenization.
In the dictionary building phase, each token in the vocabulary is assigned to an integer. By
binarization, we mean an off-line preprocessing step in which tokens are binarized. Plain text
files are converted into binary, non-human readable files, by replacing each of the tokens by
their corresponding index in the dictionary and then stored in a non-human readable format
for further efficiency. Many libraries do not implement binarization. In our case, for binarizing
purposes, we use Fairseq’s [117] preprocessing utilities.
As can be deduced from the previous description of the previous component, and as we will see in
the next section, we will be using Fairseq for training the models. Once the model-ready pipeline
has been executed, it automatically launches a dummy training procedure with Fairseq, to check
that the data can be correctly loaded by the model. This component can be easily extended to
do the same with any training library of choice.
Once the model-ready pipeline has been executed, the actual training procedure can be launched.
5.6 Training
In this section, we describe how we propose to train language models, taking into account our
data and hardware. Some of the corpora generated with the tools proposed in this work have been
used for training the unsupervised neural machine translation systems in the MT4All project.
For this reason, we describe this corpora generation process targeting unsupervised machine
translation, showing that the data pipeline is generic and versatile enough to be used for different
languages, domains, and tasks (e.g., machine translation). Nevertheless, even if I did some
110
See https://github.com/pytorch/fairseq/issues/1309.
58
collaborations, I was not in charge of training these models, and thus the building process of
these translation models themselves remains out of the scope of this thesis. Instead, we focus on
the the training of language models (and not the machine translation models). However, many
of the procedures described in this section would still hold for large enough Transformer machine
translation systems.
The training procedure will have to be distributed among different devices, given the model and
data sizes. Let us start by clarifying the difference between two kinds of deep learning training
parallelizations:
• Data parallelism: Batches are distributed among devices. Note that, in this case, each
device (typically, a GPU) has to instantiate a whole copy of the model. Each device com-
putes their own passes on the samples it has received. Notice that by data parallelization
we do not mean the parallelization of just the data loaders. Since we are training (and
not in inference), parameters must be updated. For doing so, after the devices have com-
puted the forward passes, the gradients are aggregated and the parameters, updated and
synchronized across devices.
• Model parallelism: The model itself is distributed among devices. The parameters are not
synchronized across nodes since they are not shared, unless data parallelism is also applied
(such that a given copy of a given part of the model is present in more than one device).
After having extensively reviewed the literature, we did not observe any work having used a
computing device (either Google TPUs or NVIDIA V100 GPUs) with less than 32GB of memory,
at least for pre-training a language model from scratch. The reason why this happens is two-
fold. First of all, and most importantly, the batch size is a crucial hyperparameter in deep
learning, and especially in unsupervised settings. In the case of language models, all models we
saw in Section 2 require large batch sizes for learning properly. In 16GB RAM GPUs, since the
models themselves have a large number of parameters, once they have been allocated, little space
remains for the data itself. The second reason is that the larger the batch size, the faster the
computation, which is especially important when training large models.
Still, we will see that 16GB GPUs can effectively be used for pre-training language models at
scale. We can consider several possibilities:
• Model parallelism: We can split a single, large model across different devices. PyTorch
[118], for instance, supports model parallelism, and OpenAI trained the gigantic GPT-3
with this setting. Nevertheless, model parallelism has more synchronization overhead than
data parallelism, and it is more suited for extremely large models that do not even fit in
large GPUs or TPUs.
59
• Model optimizations: As we saw, models such as ALBERT use certain optimizations to
be more memory-efficient. Nevertheless, especially for new domains and languages, we
may want to run other less memory-efficient models, either as baseline111 or for needing a
specific model with more demanding requirements.
• Training optimizations: Several optimizations can be applied, such as using a floating point
precision of 16 bits (instead of the 32).
Notice that data parallelism (i.e., distributing the batches across compute devices) does not
necessarily solve the problem of training large models on itself. First of all, the model itself must
fit in each of the devices. Second, the remaining memory in each device must be enough to have
a large enough effective batch size and making the training in GPUs efficient (otherwise it would
not compensate the overhead of sending data from the CPU to the GPU).
In our case, apart from using a precision of 16 bits, which will improve the efficiency, we will use
gradient accumulation, which will improve the performance.
A very important (yet, less widely known than others) concept in deep learning is the effective
batch size, which is defined as:
ebs = bspcd × cd × uf
where ebs is the effective batch size, bspcd is the batch size per compute device (e.g., batch size
per GPU), cd is the number of compute devices (e.g, GPUs), and uf is the update frequency.
The concept of effective batch size is central to our work.
The hyperparameter that affects the performance of a model (its results in terms of the desired
metric, not the efficiency) is the effective batch size, not the batch size per compute device.
The batch size per compute device only affects the speed of the training procedure, provided the
effective batch size is left constant. In other words, if we want to reproduce a specific experiment,
we have to replicate their effective batch size; if we want to make two different experiments
directly comparable, effective batch size should remain constant. This consideration is often
overlooked in the deep learning community. Since most books assume single-GPU settings, these
two concepts are usually conflated, but they must not be confused.
This equation means that we can let the effective batch size constant while decreasing the batch
size per device, provided we proportionally increase the number of compute devices and the
update frequency. By update frequency we mean the frequency in which we apply updates to
the model parameters (not backpropagation; backpropagation is still computed in every batch,
otherwise this trick would be useless). The usual setting in deep learning is setting this hyper-
parameter to 1, which means that every time a batch of instances is passed through the model,
the forward and backward passes are computed, and then the computed gradients are used to
update the model parameters depending on the algorithm of the optimizer of choice. If we set
111
As we will see, we believe the reasonable thing to do in a language or domain without reference models is to
first run a standard, widely-used model, such as BERT or RoBERTa, and then (not before) experiment with more
exotic variants.
60
the update frequency to N > 1, we will wait for the model having seen (i.e., computed the
forward and backward passes) N times before updating the parameters. This has the effect of
simulating a larger batch size (i.e., increasing the effective batch size), and is known as gradient
accumulation. For illustration, let us see a simple implementation of gradient accumulation in
PyTorch112 :
Notice that in PyTorch, when calling backward(), the graph is cleared unless stated otherwise
(if the graph was not cleared, we would not be optimizing anything), and only the values in
the leaves are saved. This has the effect of summing the gradients of the batches within each
accumulation, making the updates based on more samples (and, thus, more representative).
• The original Google’s BERT repository113 , which has been receiving several updates ever
since its release: It has been leveraged by language-specific models such as the Finnish
BERT.
• NVIDIA’s Nemo toolkit114 : This library, based on Pytorch Lightning [119] and different
NVIDIA modules, implements utilities and examples for training both speech and NLP
models at scale.
• Fairseq [120], FAIR’s116 sequence modeling toolkit, based on PyTorch: Originally focused
on machine translation, now supports numerous sequence modeling tasks. It was the
112
Adapted from https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3
113
https://github.com/google-research/bert
114
See https://developer.nvidia.com/nvidia-nemo, https://github.com/NVIDIA/NeMo, and https://github.
com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT
115
https://github.com/huggingface/transformers
116
Facebook Artificial Intelligence Research
61
library originally used for training RoBERTA, XLM, and BART, and has been leveraged
by language-specific models such as CamemBERT.
We discarded Google’s repository for technical reasons. It does not support training with GPUs
with less than 32GB of memory, while the available GPUs in our environment are NVIDIA V100
of 16GB of memory. More precisely, it might support it but they do not advise to use it under
this setting since the batch size would have to be too small to train properly. In addition, this
repository only supports BERT (and not other models), and being based on Tensorflow 1, it is
not especially scalable.
Regarding NVIDIA’s NEMO, being developed by this company, one can expect state-of-the-
art performance in terms of GPU usage and distributed training. It is based on PyTorch.
Nevertheless, their community (e.g., Github issues) is less active than other libraries we have
investigated.
As far as Huggingface Transformers is concerned, it originally was not intended for pre-training
models from scratch117 . Nevertheless, more recently it has been been extended to do so. Orig-
inally based on PyTorch, now it supports both Tensorflow and PyTorch training. Some of the
strengths of this library are its extensive documentation and the implementation of many differ-
ent models.
Table 13 shows how these libraries compare, from our point of view (thus, subjective). We
decided to use Fairseq since we believed it to be a compromise between available features and
maturity in terms of having been used for pre-training language models at scale (i.e., it was
the library used by FAIR to train the original RoBERTa, XML and BART models, and both
BERT and GPT-2 have been successfully reproduced with it). In addition, both Fairseq and
Huggingface provide respective utilities for the interoperability between these two libraries118 ,
so in case we wanted to deploy a pre-trained Fairseq model to Huggingface’s hub or use with it
Huggingface’s evaluation scripts, we could easily do so. Crucially in our case, Fairseq implements
gradient accumulations (even if e.g. Huggingface also has this feature) and 16-bit precision.
Regarding Fairseq’s 16-bit precision (FP16) implementation119 , this library implements NVIDIA’s
recommendation of using mixed precision (using FP16 to compute the forward/backward passes
and the losses, but update the model parameters in FP32). The reason why mixed precision is
indicated is that pure FP16 might slightly degrade the quality of the model. It also implements
117
https://twitter.com/Thom_Wolf/status/1122466524860702729
118
E.g., see https://github.com/huggingface/transformers/blob/master/src/transformers/models/
roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py and https://github.com/pytorch/
fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py
119
See https://github.com/pytorch/fairseq/issues/1047
62
vanilla FP16, which is more memory-efficient. We will be using the former, since it still allows
for a large enough batch size in our case and obtains better performance.
Apart from that, unlike most libraries, Fairseq implements binarization for further efficiency, as
described in Section 5.5.5
Regarding the distributed training, we will use multi-node distributed training with data par-
allelism. These means that there will be three levels of parallelism, specifically, between nodes,
between GPUs within a single node, and within the GPU itself. Fairseq provides a distributed
mode based on the c10d package, a distributed backend that comes built-in in PyTorch. Table
14 shows the results of one of the benchmarks we run for checking the performance of Fairseq in
our environment, showing that executions scale properly with the number of GPUs.
Table 14: Fairseq RoBERTa baseline on the CTE-POWER cluster: We train RoBERTa-base
for 4 epochs, with the relatively small dataset (100M tokens once pre-tokenized) of Wikitext103
[47], and measure execution time. The more efficient run are the ones with a single node,
because the overhead of synchronization within a single node is smaller than the one of inter-
node synchronization. Nevertheless, the results show that executions scale properly with the
number of GPU.
5.6.5 Launcher
For automating the launch of the training jobs, we build an automated launcher that executes
the main Fairseq training script in the master node, and then connects to each of the requested
nodes and attaches them to the same training run.
5.6.6 Architecture
We leave as future work the training of other architectures. Having prepared the corpora in a
roughly universal format language modeling and shown the effective training in our environment,
training with other architectures will involve only changes in the parameters of the training script.
In this work, we will use RoBERTa, meaning that the model will be, architecture-wise, like the
original BERT (see Section 2).
Since we cannot afford to conduct searches, we suggest basing our hyperparameters on the ones of
the original model. We do modify the batch size and update frequency for the reasons mentioned
in Section 5.6.3. In Section 6, when we detail the specific applications and results, we will give
more details about this matter. In addition, we will see how the training process evolved. Note
that being RoBERTa, we use dynamic masking (instead of fixed, as in BERT), and we do not
use the next sentence prediction task.
63
5.7 Evaluation
Unfortunately, we cannot afford to extrinsically evaluate the cleaning pipeline by training from
scratch the same model with different cleaning configurations, due to computational constraints.
However, we will try to do a qualitative analysis of a representative subset of the discarded,
transformed, or allowed sentences, in Section 7. For evaluating the models, we convert the
Fairseq models to Huggingface, and implement with this library standard benchmarks for lan-
guage models, consisting of evaluating the transfer learning capabilities in downstream tasks.
Unfortunately, precisely because we are targeting languages or domains that do not have as
many resources as general-domain English, apart from the lack of training data, there is usually
a lack of established benchmarks. The development of new benchmarks for Catalan is an ongo-
ing research line that supersedes this master thesis. Nevertheless, we can find some benchmarks,
such as a Catalan Named Entity and Recognition (NERC) task. For evaluating, we fine-tune
the models with the train set of the downstream task and then evaluate the fine-tuned models
in the test sets. We apply the exact same settings for all the compared models. Again, in the
case of Catalan, the most obvious baseline is Google’s Multilingual BERT.
64
6 Applications and results
In this section, we describe the application of the proposed methodologies (described in Section
5), in terms of corpora and model generation.
6.1.1 Catalan
In the case of Catalan, we are both generating three new datasets, and cleaning existing ones
with the cleaning pipeline introduced in this work. The three new datasets are originated from:
• General crawling: A new crawling targeting the 440 top .cat domains, the top 30 .ad
(Andorra) domains, and the top 5 .barcelona domains, with a depth of 5. The criterion
for selecting these domains was selecting all the websites with these TLDs120 within the
worldwide top 1M (in visitors) websites121 . The crawling was executed during March-April
2020.
• Catalan News Agency Crawling: A crawling consisting of the Catalan News Agency122
news, explicitly allowed to BSC (the website typically blocks crawlers).
Interestingly, these new corpora are so recent that the term of ”coronavirus” appears with enough
frequency to have its own token in the vocabulary when building the model, in Section 6.2.
Open Subtitles, caWac, Wikipedia, and the Catalan section of the DOGC corpus are preprocessed
with the new cleaning pipeline. All cleaned corpora are aggregated (and deduplicated again) in
a new corpus that, to the best of our knowledge, is the biggest Catalan corpus ever generated
for training language models, with as many as 1.7B tokens, being deduplicated and mostly
document-level. Table 17 shows statistics of the different corpora before and after cleaning. The
resulting corpus has:
• 9,892,770 documents.
• 1,758,388,896 tokens.
65
• 1074.5 characters per document (standard deviation of 3558.4).
• A maximum of 23,217 sentences per document (coming from an amateur novel found in
the crawling).
• 58,234 documents with more than 100 sentences, 445 documents with more than 445 sen-
tences, and 1 document with more than 10,000 sentences. Very long documents were man-
ually inspected to verify the correctness of the concatenation and cleaning (e.g., sentence-
level corpora must be concatenated with an empty line between sentences).
For qualitatively evaluating the cleaning, we provide a random sample with the text before and
after cleaning123 . The sample consists of the original sampled sentences, the result after the
cleaning, and the recorded operations of the cleaning pipeline. In Section 7.1, we will refer to
this sample, when analyzing the results.
In sections 3.3 and 4.1, we mentioned the collection and preprocessing of more than 3.3B tokens
for BETO, a Spanish BERT developed in the University of Chile. After inspecting the corpus,
we can confirm that the sentence splitting process presents some artifacts (e.g., the acronyms).
However, at this point, we believe processing it again with our cleaning pipeline is not a priority,
to save computation resources.
In the case of the BNE corpus, there was no alternative, since the dataset is new and had never
been preprocessed. Figure 15 shows the SLURM jobs of execution of the cleaning pipeline on
the raw data from the BNE corpus, on MareNostrum 4.
The dataset was divided in three parts, and the mapping part of the cleaning pipeline (i.e., all
cleaning steps before document deduplication) is applied separately to each of them via three
(the fourth job in Figure 15 corresponds to the biomedical crawling dataset) SLURM jobs of 50
nodes (3 × 50 × 48 = 7200 CPUs).
In 48 hours and with these 150 nodes124 , the cleaning pipeline processed 58.8% of the JSONs
of the dataset (213001/408692, 22001/75275, and 45001/132736, for each of the three parts,
respectively). When aggregating the three parts, we obtain the following statistics:
• File size (plain text, but including metadata): 1.5 TB + 141 GB + 950 GB = 2.591 TB
123
Available at https://docs.google.com/spreadsheets/d/10F792DSr2yg3OI7Tyj5c3p55WjkjVAKARaNDDgubYYs/
edit?usp=sharing("debug"tab)
124
Note that the time limit for jobs on MareNostrum 4 is precisely 48 hours, and for a single job one cannot
request more than 50 nodes.
66
Figure 15: Cleaning pipeline execution on BNE: The dataset was divided in three parts, and the
mapping part of the cleaning pipeline (i.e., all cleaning steps before document deduplication)
is applied separately to each of them via three (the fourth job in the figure corresponds to the
biomedical crawling dataset) SLURM jobs of 50 nodes (3 × 50 × 48 = 7200 CPUs).
Table 15: Comparison between BNE and other big Spanish corpora: The 60% of BNE (the
one corresponding to our cleaning execution), once cleaned, is clearly bigger than the existing
Spanish corpora (although it has not yet been deduplicated). However, coming from so many
different websites, it is reasonable to assume that the corpus will not be decreased to more than
one order of magnitude. Note that the plain text size includes some metadata, in the case of
BNE.
As a comparison, the Spanish section of the OSCAR corpus consists of 25.6B tokens (while
67
the partial results of BNE consist of 128.3B tokens), with around 150GB (of plain text). The
collection of the preprocessed corpora of the BETO work resulted in a corpus of around 3B tokens
(18GB of plain text). Furthermore, the dataset is document-level, allowing for the modeling of
long-range dependencies. The caveat, though, is that it has not yet been deduplicated, so the final
size will be actually smaller and it cannot be directly compared (since OSCAR is deduplicated).
Table 15 shows a simple comparison between the three aforementioned corpora. We provide
the state before and after the cleaning of a random sample125 . We will make references to this
sample in Section 7.1.
After processing the data sources mentioned in Section 4.1, the resulting total corpus consists of
almost 1B tokens. We observe that the data coming from the new biomedical crawling dominates
the dataset (almost 750M tokens out of 972M, once deduplicated). While some of the other
used data sources may be of higher quality or relevance (e.g., scientific articles), the biomedical
crawling is essential to give the aggregated corpus enough scale and diversity for training language
models.
All cleaned corpora are aggregated (and deduplicated again) in a new corpus that, to the best
of our knowledge, is the biggest biomedical Spanish corpus ever generated for training language
models, with as many as 1.7B tokens, being deduplicated and mostly document-level. Table 18
shows statistics of the different corpora before and after cleaning. The resulting corpus has:
• 2,154,539 documents.
• 967,848,477 tokens.
• 43,583,833 sentences.
• A maximum of 130,571 sentences per document (coming from an amateur novel found in
the crawling).
• 59,767 documents with more than 100 sentences, 849 documents with more than 445 sen-
tences, and 70 document with more than 10,000 sentences. Very long documents were man-
ually inspected to verify the correctness of the concatenation and cleaning (e.g., sentence-
level corpora must be concatenated with an empty line between sentences).
In general, we observe more diversity in terms of document length than in the case of Catalan.
This is the case because some of the aggregated corpora comprise (anonymized) clinical stories,
125
Available at https://drive.google.com/drive/folders/1t3DZXPF0F6FEM3bJjgaKIbesh0W1ojWd?usp=
sharing. The sample bne.zip contains the raw data sample, while the output.txt file contains the final, clean
output.
68
or articles. At the time of writing this thesis, this corpus is being used to train a domain-specific
RoBERTa, which supersedes this project. For qualitatively evaluating the cleaning, which is
more sensitive because of this challenging domain126 , we provide results on a sample of more
than 400 sentences per biomedical corpus127 . We will make references to this sample in Section
7. The sample consists of the original sampled sentences, the result after the cleaning, and the
recorded operations of the cleaning pipeline.
6.1.4 MT4All
Table 19 shows statistics of the different corpora before and after cleaning. Note that this table
omits the case of Catalan and biomedical Spanish, which were already described in detail. The
generated corpora are document-level, but the project in which they are applied, MT4All, consists
of sentence-level machine translation. Thus, before feeding the text into the model, empty lines
(the document boundaries) are removed, and a sentence-level deduplication is applied. This
further preprocessing, along with tokenization, is applied by the model preprocessing itself (not
in the cleaning phase). Recall that the cleaning process is designed to be as model-agnostic
as possible, so these corpora can be used in the future for document-level machine translation
models, or models with other tokenization strategies.
These corpora, including the Catalan corpus (but not yet the biomedical Spanish one, which will
be used in a more advanced phase of the MT4All project) are used in the MT4All project for
building machine translation systems without supervision. In the case of Finnish, Latvian, and
Norwegian, the domain-specific corpora are used together with the respective general-domain
OSCAR corpora. Specifically, this is done using the methodology proposed in [107, 121, 122],
described in Section 2.3, using Monoses, Fairseq, Vecmap, and Moses128 . Table 16 shows the
preliminary results. The BLEU scores are still relatively low, especially in the case of the pairs
involving Basque and Finnish. It has to be noted that these results are obtained without any
supervision whatsoever. In Section 7.1 we will further discuss these results.
BLEU BLEU
English→Basque 5.1 Basque→English 12.1
English→Catalan 25.3 Catalan→English 25.6
English→Latvian 17.5 Latvian→English 15.1
English→Finnish 9.1 Finnish→English 8.8
English→Norwegian 25.9 Norwegian→English 23.4
Table 16: Preliminary results of the MT4All project: The BLEU scores are still relatively low,
especially in the case of the pairs involving Basque and Finnish. In Section 7.1 we will discuss
these results.
126
E.g., a language identifier will assign lower probability to sentences with many domain-specific terms.
127
Available at https://docs.google.com/spreadsheets/d/1iuzZiIrca_B2XapD1pML4cW23Ewlu83lldm9W69dtmc/
edit?usp=sharing
128
https://github.com/artetxem/monoses
69
Corpus Original size (GB/tokens) Final size (GB/tokens) Final # sentences Nodes/CPUs Time Cleaning
Catalan Open
0.02GB/3.52M tok 0.02GB/3.52M tok 0.61M - - None
Subtitles
SP,
Catalan Oscar 8.30GB/1,358M tok 4.00GB/695.37M tok 31.39M 1/48 14h
DEDUP
LANG,
Catalan Web Q,
3.50GB/780M tok 3.60GB/650.98M tok 27.35M 3/144 2h
Corpus (cawac) SP,
DEDUP
LANG,
Q,
Catalan
1.20GB/198.36M tok 0.98GB/167.47M tok 6.86M 3/144 1h
Wikipedia
SP,
DEDUP
LANG,
Crawling Q,
312.00GB/- 2.60GB/434.82M tok 19.45M 3/144 4h
General SP,
70
DEDUP
LANG,
Generalitat Q,
75.00GB/- 0.25GB/39.12M tok 1.57M 1/48 4h
crawling SP,
DEDUP
Catalan
SP,
News 0.49GB/81.28M tok 0.45GB/75.61M tok 3.06M 1/48 1h
DEDUP
Agency
Parallel DOGC 0.80GB/126.65M tok 0.80GB/126.65M tok 10.92M - - None
TOTAL 401.31GB/2,547.81M tok 12.69GB/2,193.4M tok 101.21M 12/576 26h
All concat
12.69GB/2,193.4M tok 11.00 GB/1,758.3M tok 83.06M 1/48 1h Q, DEDUP
and dedup
Table 17: Cleaned (or generated from scratch) Catalan corpora, where SP means sentence splitting; DEDUP means deduplication; Q
means Quality filter (including all the cleaning heuristics in the cleaning pipeline); and LANG means language identification filter. In
the case of the Catalan Open Subtitles and Parallel DOGC, the cleaning was applied after the concatenation (for this reason in the
table it says Cleaning: None). In the case of the new crawlings, the original state consisted of WARC files. This explains the large
size (and impossibility to count tokens without first removing the HTML).
Corpus Original size (GB/tokens) Nodes/CPUs Time Final size (GB/tokens) # sentences Cleaning
SP, 1W S,
CardioCC 0.001GB/149,904 1/48 1h 0.001GB/0.15M 9,970
DEDUP
SP, 1W S,
RadioCC 0.001GB/177,366 1/48 1h 0.001GB/0.17M 9,948
DEDUP
SP, 1W S,
Libros Casos Clinicos 0.007GB/1,137,555 1/48 1h 0.007GB/1,024,797 68,833
DEDUP
SP, 1W S,
Covid-CC 0.001GB/82,201 1/48 1h 0.001GB/82,091 3,896
DEDUP
SP, 1W S,
EMEA 0.087GB/13,797,362 1/48 1h 0.034GB/5,377,448 284,575
DEDUP
SP, 1W S,
Patents 0.087GB/14,022,520 1/48 1h 0.084GB/13,463,387 253,924
DEDUP
SP, 1W S,
Wikipedia Life Sciences 0.126GB/18,771,176 1/48 1h 0.088GB/13,890,501 832,027
DEDUP
SP, 1W S,
BARR2-Background 0.188GB/28,868,022 1/48 1h 0.159GB/24,516,442 1,029,600
DEDUP
71
SP, 1W S,
PUBMED 0.31GB/1,957,479 1/48 1h 0.013GB/1,858,966 103,674
DEDUP
SP, 1W S,
REEC 0.048GB/4,581,755 1/48 1h 0.028GB/4,283,453 220,726
DEDUP
SP, 1W S,
SciELO 0.401GB/61,838,972 1/48 1h 0.38GB/60,007,289 2,668,231
DEDUP
SP, 1W S,
Mespen Medline 1.2GB/6,864,901 1/48 1h -/4,166,077 322,619
DEDUP
SP, 1W S,
PDFs general 15GB/109,124,996 1/48 1h 0.631GB/97,146,139 5,252,481
DEDUP
LANG,
Q, SP,
medical crawler 626GB/- 50/2400 48h 4.5GB/746,368,185 32,766,976
4W S,
DEDUP
643.457GB/
TOTAL 63/3024 61h 5.927GB/972,184,775.32 43,827,480
261,373,209
CONCAT,
All concat and dedup 643.46GB/972,184,775.32 1/48 1h 5.9GB/967,847,439 43,583,833
DEDUP
Table 18: Generated or preprocessed biomedical Spanish corpora: {N}-W S means minimum # words per sentence.
Corpus Original size Final size Final # sentences Nodes/CPUs Time Cleaning
LANG,
Basque Q,
2.1GB 0.794GB/106.857M tok 7.448M 1/48 1h
Crawling SP,
DEDUP
LANG,
English
Q,
Finance 636GB 0.357GB/67.732M tok 3.840M 1/48 7h
SP,
Crawling
DEDUP
LANG,
Finnish
Q,
Finance 20GB 0.348GB/44.613M tok 3.669M 1/48 4h
SP,
72
Crawling
DEDUP
LANG,
Latvian
Q,
Finance 1.8GB 0.068GB/8.827M tok 0.502M 1/48 <0.1h
SP,
Crawling
DEDUP
LANG,
Norwegian
Q,
Finance 75GB 0.510GB/93.523M tok 5.618M 1/48 2h
SP,
Crawling
DEDUP
Table 19: Cleaned (or generated from scratch) corpora for MT4All. See Table 17 for the meaning of the used abbreviations. The
original size is not reported in terms of tokens due to being WARC files.
6.2 Model generation
The generated corpora have a great potential to feed new language models. In this work, we
describe the use case of the Catalan corpus for building a language-specific RoBERTa.
Training We omit the test runs of the training procedure. We follow the methodology de-
scribed in Section 5. Regarding the learning rate, we train the model for 115k updates, 10k
of them being warm-up updates with a peak learning rate of 0.0005. This means that, due to
technical issues related to the training of large architectures, especially Transformers [123], the
initial learning rate is set to a tiny value; then, it is gradually increased during the warm-up
phase. This increases the stability and final performance of the learning process. Then, the
learning rate is slowly decreased following a polynomial decay scheduler. Figure 16 shows the
evolution of the learning rate in the course of training. Figure 17 shows the scale of the loss
during training. Fairseq automatically scales the loss in case overflows are detected, which is not
uncommon when using FP16. The scale is more unstable at the beginning, during the warm-up
steps, when gradients have more orders of magnitude. Figure 18 shows the gradient norm during
training.
We use a maximum of 512 positions, which means that the maximum context window observed
in training will consist of 512 tokens. In inference, a sliding window will be used to simulate
larger contexts. We train the model with 8 NVIDIA V100 GPUs of 16GB for 192 hours. We use
a device batch size of 8 sentences per GPU, with approximately 512 tokens per sample. Tokens
coming from different documents are never merged together. If the document is longer than 512
tokens, it is divided into different batches accordingly. Together with an update frequency of 32,
the effective batch size consists of 2048 sentences. Figure 19 shows the word per batch, which
moves around 400k tokens. The number is not constant since documents are never merged and
sentences are never cut, and sentences have different numbers of tokens.
We regularize with a dropout [124] of 0.1 (including attention weights), and a weight decay of
0.01. We use the Adam [125] optimizer with β1 = 0.9, β2 = 0.98, = 1e−6 . We make of a
floating point precision of 16 bits in the setting recommended by NVIDIA (as detailed in Section
5).
The best checkpoint is selected according to the perplexity in the validation set. We keep all
the checkpoints for potential further studies in the future. Figure 20 shows the evolution of the
perplexity during training. Figures 21 and 22 show the best validation loss recorded during the
training procedure and the training and validation loss progress, respectively.
73
Figure 16: RoBERTca learning rate: During the warmup steps, the learning rate is progressively
increased from a tiny value to the peak value (0.0005). Then, it decreases with a polynomial
decay.
Figure 17: RoBERTca loss scale: The loss scale is automatically adjusted when overflows are
detected, which is not especially uncommon when using FP16. These overflows are more common
when gradients are larger and the scale has not been yet adjusted, at the beginning of training.
Then, it is more stable.
74
Figure 18: The gradient norm is considerably bigger at the beginning, during the warmup phase.
Then, the gradient norm is more stable.
Figure 19: RoBERTca Words Per Batch (WPB): The number is not constant since documents
are never merged and sentences are never cut, and sentences have different numbers of tokens.
75
Figure 20: RoBERTca perplexity: Perplexity consistently decreases in both train and validation
sets. It is used for selecting the best checkpoint. The plot is difficult to visualize due to the large
changes in scale of the metric, during training.
Figure 21: RoBERTca best validation loss: The best cross-entropy obtained in the validation
set kept decreasing. When the training finishes, it seems to have been converged.
76
Figure 22: RoBERTca loss curves: The model seems to have converged.
77
Evaluation As intrinsic evaluation, we conduct the following analysis:
• Vocabulary and tokenization: Table 20 shows the number of tokens per sentence in RoBERTca’s
test set, as per the tokenizers of the different evaluated models.
This comes as no surprise, since RoBERTca (as Wikibert, though) has a vocabulary specif-
ically built for Catalan (and with more vocabulary entries and having seen more diverse
data than Wikibert). Many Catalan words have their own token in RoBERTca’s vocabu-
lary (instead of being built from subwords), including ”coronavirus” (the weird characters
in the case of RoBERTca are normal artifacts caused when trying to visualize individual
BBPE tokens, being based on bytes):
mbert: ['L', "'", 'epi', '##d', '##èm', '##ia', 'de', 'corona', '##vir',
'##us', 'va', 'causar', 'una', 'gran', 'crisi', 'e', '##con', '##òmica', '.']
• Mask filling accuracy: In RoBERTca’s test set, we compute the accuracy of sampling
masked tokens (i.e., we mask one token and ask the model to retrieve the original one,
as in training): The results are not directly comparable, since each model has a different
tokenization, but it serves as a sanity check. The results are shown in Table 21.
• Qualitative analysis: We attach some outputs of the model from inputs chosen by us
(and not directly present in the dataset), for the sake of curiosity. The sentences are
inevitably cherry-picked, without having an established methodology or dataset for a better
evaluation, and we cannot extract conclusions. The results are, though, curious, and we
78
can get an intuition of the kind of outputs produced by the model. At least, we can affirm
that there seems to be a certain sex bias, and that in the attached examples (except the
sex bias one) the outputs of the model are reasonably correct in terms of pure semantics
and linguistics. Figures 23, 24, 25, 26, and 27 show some prompts we believed to be
interesting for the reader. Some of them are extracted from the test set, and some others
are manually written for the sake of curiosity. For all the predictions of RoBERTca,
mBERT, and Wikibert ca on the test set, we attach the logs129 .
Figure 23: RoBERTca prompt 1: Our model seems to be aware of the fact that a year is
composed of 365 years, unlike mBERT and Wikibert. Obviously, we cannot extract conclusions
from a single prompt, but we find it interesting.
(a) Output for ”Ell” (he). (b) Output for ”Ella” (she).
Figure 24: RoBERTca prompt 2: Our model seems to have encoded a certain sex bias, induced
by the bias present in the data (especially news, we hypothesize). In particular, with the prompt
”He/She works as a <mask> at Bellvitge’s Hospital”, in the case of ”He” the model predicts
”doctor”, while in the case of ”She”, it predicts ”nurse”.
These analysis allow a sanity check, but they are limited and we will not be able to extract
further conclusions.
129
Logs available at https://drive.google.com/file/d/1_A9aIMagGURu0qN1J5AQHg7blU1T1TMF/view?usp=
sharing
79
Figure 25: RoBERTca prompt 3: Our model is able to predict the ”coronavirus” token (when
prompted with ”The <mask> pandemic has caused a new economic crisis.”), thanks to being
present in the training data.
80
Figure 26: RoBERTca prompt 4: The model successfully infers that it has to predict a female
name. When prompted with a sentence referring to a female, named ”<mask> Villegas”, the
model infers that it has to predict a female name (Montse, Núria, Marta,...).
Figure 27: RoBERTca prompt 5: In this specific example, RoBERTca shows a better under-
standing of units of length than mBERT. The sentence is taken from the RoBERTca test set.
It speaks about a runway with a certain length and width, and the masked token is the one
corresponding to ”meters”, referring to the width. The context is that the only unit that makes
sense as a runway width is, precisely, meters.
81
As extrinsic evaluation, we evaluate the usefulness of RoBERTca contextual representations when
fine-tuned in a downstream task, namely, Named Entity Recognition, Part-Of-Speech tagging,
Question Answering, Text Classification, and Semantic Textual Similarity. Table 22 shows the
results in these tasks for RoBERTa and two reference models. As a Catalan-specific baseline, we
take the Catalan Wikibert (Wikibert ca) [84]; as a multilingual baseline, we take the multilingual
BERT (mBERT). They are described in sections 2.2 and 3.2, respectively.
Table 22: RoBERTca evaluation: The preliminary results in Named Entity Recognition, Part-
Of-Speech, Question Answering, Text Classification, and Semantic Textual Similarity show that
RoBERTca is at least competitive with the existing baselines for Catalan (mBERT, the multi-
lingual BERT, and Wikibert ca, a BERT trained only with the Catalan Wikipedia).
For the first two tasks, we use the Ancora corpus131 , consisting of around 11k train sentences in
130
The execution of the script crashed for an unknown reason.
131
http://clic.ub.edu/corpus/en
82
the case of NER and 13k in the case of POS. For question answering, we use 1. TeMU’s Catalan
translation of XQUAD [126] (a multilingual question answering dataset) consisting of around
1000k annotated questions; 2. TeMU’s 14k manually generated questions from Wikipedia articles.
For Text Classification, we use TeMU’s text classification dataset, automatically constructed
from news (and the corresponding tags) extracted from the Catalan News Agency132 . It consists
of 219k sentences (the largest benchmark, by far). For Semantic Textual Similarity, again, we
use a TeMU’s benchmark, this time of 3k manually annotated pairs of sentences. TeMU-BSC
group is still in the process of developing these benchmarks (and others), so they are still a
preliminary version and have not yet been published. The train-validation-test splits, the number
of sentences, and the total number of benchmarks (e.g., a pronoun resolution benchmark is in
the works) may be different in the final version.
We note that our model is always better than Wikibert ca, having seen more data. In the case
of the comparison with the multilingual BERT, the latter is remarkably competitive. In fact,
with these results, our conclusion is that it outperforms our model, which at is always close to its
results. However, our model is more efficient because it produces shorter sequences due to the
tokenization, as shown before in Table 20. Interestingly, the text classification task is the only
one in which both Catalan-specific models outperform, and for a clear reason, the number of train
sentences (we said that the text classification dataset is the largest of the used benchmarks, by
far) and the language-specific vocabulary (as described in Section 7.1). Also, this evaluation must
be taken with a grain of salt because the set of evaluation benchmarks developed by TeMU-BSC
is still a preliminary version. We hypothesize that the difference in performance will increase in
favour of RoBERTca once we obtain new evaluation datasets and increase the number of training
sentences, which is a work in progress. In Section 7.1, we will further analyze these results.
132
https://www.acn.cat/
83
7 Discussion
In this section, we will discuss the applications and results presented in Section 6.
• Some sentences ending with ”...” seem to be cut. This is an artifact present in the raw
data, typically caused by a collapsed text in the website that the crawler did not expand
133
Appendix D.
84
to retrieve the whole sentence.
• Quotes with more than one sentence (separated by a dot) are split as different sentences.
This is probably the desired behaviour for documents as a whole, but it has the effect of
letting single sentences with not closed quotes.
• We observe that the global sentence deduplication, indeed, is useful for removing boiler-
plate, while typically not removing relevant sentences.
• We observe that the text from the biomedical crawling has, generally, more artifacts than
the one from the general domain. Still, it is of a remarkable quality, coming from a domain-
specific crawling.
Regarding the sentences that have been filtered out, we observe that, while it is true that some
well-formed sentences were discarded, they were mostly at least uninteresting (e.g., short, re-
peated sentences). We believe that our bet on a rule-based system-like proposal, with the help of
some statistical models (basically, the language identifiers) has proven to be effective. However,
we note that the effort of writing these rules has been considerably time-consuming. In addition,
we do not know whether it would adapt to unseen scenarios. If we found a way of introducing
more machine learning components (apart from the language identifiers) in a generic enough
way, that could be an interesting option to explore.
One of the objectives from the very beginning was keeping document boundaries whenever this
was possible. This has been successful, not only in appearance; when observing the formed
documents, sentences exhibit a clear coherence (i.e., one sentence follows from the previous ones,
with few exceptions, mostly caused by non-detected boilerplate).
Regarding the MT4All machine translation systems, the BLEU scores are relatively low. We
note that the system with the best performance is Catalan-English. Letting aside the fact that it
is a general-domain use case and arguably easier to model than Finnish, we note Catalan is one
of the use cases we put the most effort into, apart from Spanish (so far, not used in the MT4All
project). The unsupervised machine translation approach used in MT4All (Section 2.3) was
originally tested on general-domain corpora with relatively close languages (English, German,
French). It was, thus, unclear how would the system behave in the settings of MT4All. Probably,
in cases in which general-domain data were used to increase the size of the domain-specific corpus,
the latter should have had to be treated in a special way, instead of just concatenating, to give
more importance to domain-specific samples.
As far as RoBERTca is concerned, we are certain that the model was well-trained, after observing
its outputs. When evaluated on downstream tasks, it is competitive with existing baselines for
Catalan. We note, though, that the benchmarks for evaluating Catalan language modeling are
limited at this point, and perhaps too easy to observe a noticeable difference. Building more
evaluation benchmarks for Catalan is an ongoing research line at TeMU-BSC. We have seen that
its vocabulary comprises many Catalan words as whole tokens (instead of being composed of
cumbersome subwords). We observe that even if this model has been trained with a document-
level corpus, it typically does not see long documents at once, due to its 512-token limitation
(for the context window). The high performance of mBERT in the benchmarks can, in addition,
be explained due to the fact that cross-lingual transfer learning is easier in the case of Catalan,
because Romance languages are over-represented in Wikipedia (where the multilingual data is
extracted from, in the case of mBERT).
85
In the RoBERTca results, we observe an interesting phenomenon. The only benchmark in
which both language-specific models outperform the multilingual baseline, and with a remarkable
difference, is the one of text classification. We are unsure whether the corresponding dataset
or task has any intrinsic property that makes it easier for the language-specific models than
in the other cases. Nevertheless, there is one clear difference in terms of the data size. The
text classification task is, by far, the largest evaluation benchmarks among the ones we have
used. It seems that the multilingual BERT outperforms the language-specific models when fewer
fine-tuning data are available. The multilingual BERT has seen considerably more pre-training
data (even if in other languages) than the Catalan-specific ones, so it performs better in very
low-resource scenarios. In addition, Romance languages are a big portion of the multilingual
BERT dataset, and they can transfer knowledge to the Catalan representations. When more
training data in the fine-tuning datasets are available, the language-specific models are better.
In this case, the fact that they have a language-specific vocabulary clearly compensates the fact
that they have seen less data during pre-training. The vocabulary on itself is not useful per se;
it needs training the corresponding representations in a large enough datasets (so RoBERTca
outperforms Wikibert ca in this benchmark and the rest of them). This observation about the
dataset size is roughly consistent with the results in the other benchmarks.
7.2 Limitations
One of the main limitations of this work is that whilst we provide enough information to validate
our approach (i.e., it works) and we try to justify our decisions, we do not provide enough
evidence to show that our specific corpora generation approach is better than alternative ones,
other than inspecting the resulting corpora. This is a common situation in the literature, but
still a problem, nonetheless. However, we are certain that the cleaning process has correctly
formatted the text, and has removed non-natural or out-of-domain (or language) text, which at
least would have wasted GPU cycles and model capacity with relations that are not needed.
Apart from that, we have observed certain artifacts in the generated corpora, mostly coming
from the original text. In the case of the MT4All machine translation systems, the BLEU scores
are still relatively low, even if they are still preliminary and need more iterations, but they
need to be improved. As far as the generated RoBERTa, while it is competitive with the existing
baselines for Catalan, we would have expected an important difference, being specific for Catalan
and having been trained with a Catalan corpus orders of magnitude larger than existing systems
(both mBERT and Wikibert ca used the Catalan Wikipedia). Nevertheless, the problem could
lie in the benchmarks used for evaluation, which essentially correspond to relatively easy tasks
only, with little margin for improvement. We expect the difference in performance with respect
to the baselines to increase when the new evaluation benchmarks for Catalan will be available.
What we can affirm, though, is that the tokenizer generates considerably less tokens per sentence,
being more efficient in this regard than the baselines.
86
the user must be well-aware of this problem. We do observe potential misuses in the case of the
BNE dataset. Its scale makes it feasible to train at least GPT-2-level models for Spanish. The
problematic usages of GPT-2 in the case of English are well-known134 .
Regarding possible environmental concerns, BSC clusters are known to be environmentally-
friendly135 , but, still, training RoBERTca has consumed a non-negligible amount of electricity,
while not providing remarkable gains with respect to the multilingual baseline. However, training
is only executed once, while inference is supposed to be run numerous times. The RoBERTca
tokenizer generates considerably less tokens per sentence than the mBERT or Wikibert ca coun-
terparts, meaning that RoBERTca inference is more compute and memory-efficient (because
sequences are shorter). In the case of the generated corpora, BNE is especially concerning in
terms of the computation needed. Nevertheless, the advantage of our model-agnostic approach
is that once preprocessed once, the data need not be cleaned again.
134
https://openai.com/blog/better-language-models/
135
https://www.bsc.es/news/bsc-news/the-new-bsc-machine-europe%E2%80%99s-%E2%80%9Cgreenest%E2%
80%9D-supercomputer
87
8 Conclusions and future work
To sum up, we have introduced a new pipeline for preprocessing corpora, and training language
models, which is, in turn composed of different sub-pipelines, including a cleaning pipeline, a
model-ready preprocessing pipeline, and the training and evaluation scripts.
Regarding the generated corpora, the corpus resulting from aggregating both existing (but pre-
processed with our cleaning pipeline) and new datasets, has resulted in the largest Catalan corpus
to date, to the best of the our knowledge. In the case of general-domain Spanish, by preprocessing
(even if still partially) BNE, we have shown that the cleaning pipeline is suitable for large-scale
corpora. The results are still preliminary and must be deduplicated, but we cannot rule out
the possibility of having generated the largest Spanish corpus for training language models to
date. We have further demonstrated the flexibility of the cleaning pipeline by applying it for
generating, again, arguably the largest biomedical Spanish corpus for training language models,
and for the rest of MT4All pairs. This proves the flexibility in terms of language, scale, domain,
and targeted task.
We have learned of the importance of generating datasets, a machine learning step which is often
underrated (unlike model generation). We have seen that it can be a considerably long, and time
and resource-consuming process.
Regarding model generation, we have shown that the used corpora can effectively be used for
building a language model from scratch. We have built the first Catalan RoBERTa ever, which is
the Catalan language model that has ever seen the most language-specific data. The preliminary
evaluation shows that it is competitive with the existing baselines, but without a significant
difference. We hypothesize that RoBERTca will clearly outperform the baselines when more
difficult and complete benchmarks are built, but it will have to be investigated.
As limitations, we observe the lack of proper evaluation metrics in the case of corpus cleaning.
We will suggest possible solutions to these problems as future work. All in all, we have shown
that our proposed tools can be effectively used for generating corpora and training language
models from scratch.
As future work, we suggest several lines of work:
• Regarding the cleaning pipeline, we suggest introducing more machine learning compo-
nents, but unlike the previous work, in a more generic way, which can work out-of-the-box
for different languages and domains. One possibility could be training a model using the
input-output pairs of the pipeline itself, and observing whether it generalizes to other lan-
guages and domains. The machine learning model, though, should be lightweight, otherwise
it would not be feasible to apply it to large quantities of raw data.
• We suggest building benchmarks for intrinsic quantitative evaluation of the cleaning pro-
cess, inspired by the CleanEval shared task [127].
• The same way language identifiers are a common component of many preprocessing pipelines,
including ours, domain classification systems could be used for organizing corpora from
crawlings into domains (unless manually tagging each page, we have observed that keywords
from the websites are not especially reliable). This could ease the process of generating
domain-specific corpora.
88
• As far as model generation is concerned, a Spanish biomedical RoBERTa using the data
and tools from this work is being trained at the time of writing this thesis. But this is
just the beginning. There are many architectures (with different trade-offs) to explore, for
Catalan and Spanish (both general-domain and biomedical domain). Nevertheless, while
we believe that a BERT-like model specific for languages or domains with enough data is
generally required (as a baseline, and because in inference it can be more efficient thanks to
tokenizing into shorter sequences), the training of further models will have to be justified
in terms of potential utility, not to waste computational resources. The generated data
could also be used for feeding multi-lingual and multi-domain models.
• One of the strengths of our approach is that we have taken into account document bound-
aries. RoBERTca already uses them, but often it does not see the whole document during
training (due to having a maximum context window of 512 tokens). In the MT4All project,
documents are not used. We would be especially keen on investigating document-level
unsupervised machine translation, on the one hand, and language modeling with longer
dependencies (for instance, using the Linformer architecture [128], a model with a lin-
ear approximation of attention), on the other. The document-level data are ready, and
Linformer is implemented in Fairseq136 .
• At TeMU-BSC, there is an ongoing work for generating benchmarks for further evaluating
models for these domains and languages.
Personally, I believe that I have applied many different skills that I learned in the Master in
Artificial Intelligence, ranging from data processing, to deep learning, without forgetting more
classical approaches in NLP, like the use of language identifiers and rule-based systems for clean-
ing corpora.
136
https://github.com/pytorch/fairseq/tree/master/examples/linformer
89
References
[1] S. Ruder, “Neural transfer learning for natural language processing,” Ph.D. dissertation,
National University of Ireland, Galway, 2019.
[2] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning in natural lan-
guage processing,” in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Tutorials, 2019, pp. 15–18.
[6] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional
transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online].
Available: http://arxiv.org/abs/1810.04805
[8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models
are unsupervised multitask learners,” 2019.
[9] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders
as discriminators rather than generators,” 2020.
90
[14] M. Artetxe, G. Labaka, and E. Agirre, “Unsupervised statistical machine translation,” in
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp.
3632–3642. [Online]. Available: https://www.aclweb.org/anthology/D18-1399
[18] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a
pre-trained biomedical language representation model for biomedical text mining,” CoRR,
vol. abs/1901.08746, 2019. [Online]. Available: http://arxiv.org/abs/1901.08746
[20] J. Armengol Estapé, “Neural machine translation and linked data,” Jul 2019. [Online].
Available: http://hdl.handle.net/2117/168617
[21] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural
Networks,” arXiv e-prints, p. arXiv:1409.3215, Sep 2014.
[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” CoRR, vol. abs/1409.0473, 2015.
[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385
[26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03167
[27] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv e-prints, p.
arXiv:1607.06450, Jul 2016.
91
[29] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” 2020.
[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9,
no. 8, p. 1735–1780, Nov. 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.
8.1735
[34] D. Hendrycks and K. Gimpel, “Bridging nonlinearities and stochastic regularizers with
gaussian error linear units,” CoRR, vol. abs/1606.08415, 2016. [Online]. Available:
http://arxiv.org/abs/1606.08415
[35] G. Zhao, J. Lin, Z. Zhang, X. Ren, Q. Su, and X. Sun, “Explicit sparse transformer:
Concentrated attention through explicit selection,” CoRR, vol. abs/1912.11637, 2019.
[Online]. Available: http://arxiv.org/abs/1912.11637
[36] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR,
vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
[37] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” CoRR, vol.
abs/1901.07291, 2019. [Online]. Available: http://arxiv.org/abs/1901.07291
[38] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite
bert for self-supervised learning of language representations,” 2019.
[39] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Lukasz Kaiser, “Universal trans-
formers,” 2018.
[41] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer,
“Multilingual denoising pre-training for neural machine translation,” 2020.
[42] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter,” 2019.
[43] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-
attention: Specialized heads do the heavy lifting, the rest can be pruned,” 2019.
92
[44] P. H. Le-Khac, G. Healy, and A. F. Smeaton, “Contrastive representation learning: A
framework and review,” IEEE Access, vol. 8, p. 193907–193934, 2020. [Online]. Available:
http://dx.doi.org/10.1109/ACCESS.2020.3031549
[46] S. J. Mielke, “Can you compare perplexity across different segmentations?” Apr 2019.
[Online]. Available: https://sjmielke.com/comparing-perplexities.htm
[47] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”
CoRR, vol. abs/1609.07843, 2016. [Online]. Available: http://arxiv.org/abs/1609.07843
[49] G. Majumder, D. P. Pakray, A. Gelbukh, and D. Pinto, “Semantic textual similarity meth-
ods, tools, and applications: A survey,” Computacion y Sistemas, vol. 20, pp. 647–665, 12
2016.
[51] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+ questions for
machine comprehension of text,” CoRR, vol. abs/1606.05250, 2016. [Online]. Available:
http://arxiv.org/abs/1606.05250
[56] M. Artetxe, G. Labaka, and E. Agirre, “A robust self-learning method for fully unsu-
pervised cross-lingual mappings of word embeddings,” in Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018,
pp. 789–798.
[57] ——, “Generalizing and improving bilingual word embedding mappings with a multi-step
framework of linear transformations,” in Proceedings of the Thirty-Second AAAI Confer-
ence on Artificial Intelligence, 2018, pp. 5012–5019.
93
[58] ——, “Learning bilingual word embeddings with (almost) no bilingual data,” in Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), 2017, pp. 451–462.
[59] ——, “Learning principled bilingual mappings of word embeddings while preserving mono-
lingual invariance,” in Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, 2016, pp. 2289–2294.
[60] ——, “An effective approach to unsupervised machine translation,” CoRR, vol.
abs/1902.01313, 2019. [Online]. Available: http://arxiv.org/abs/1902.01313
[62] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic
evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA:
Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available:
https://www.aclweb.org/anthology/P02-1040
[63] M. Artetxe, S. Ruder, D. Yogatama, G. Labaka, and E. Agirre, “A call for more rigor in
unsupervised cross-lingual learning,” 2020.
[64] N. Indurkhya and F. J. Damerau, Handbook of Natural Language Processing, 2nd ed. Chap-
man & Hall/CRC, 2010.
[67] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with
Subword Units,” arXiv e-prints, p. arXiv:1508.07909, Aug 2015.
[68] P. Gage, “A new algorithm for data compression,” C Users J., vol. 12, no. 2, pp. 23–38,
Feb. 1994. [Online]. Available: http://dl.acm.org/citation.cfm?id=177910.177914
[69] C. Wang, K. Cho, and J. Gu, “Neural machine translation with byte-level subwords,”
CoRR, vol. abs/1909.03341, 2019. [Online]. Available: http://arxiv.org/abs/1909.03341
[70] S. Ding, A. Renduchintala, and K. Duh, “A call for prudent choice of subword
merge operations,” CoRR, vol. abs/1905.10453, 2019. [Online]. Available: http:
//arxiv.org/abs/1905.10453
94
[71] I. Provilkov, D. Emelianenko, and E. Voita, “Bpe-dropout: Simple and effective
subword regularization,” CoRR, vol. abs/1910.13267, 2019. [Online]. Available:
http://arxiv.org/abs/1910.13267
[72] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing,” CoRR, vol. abs/1808.06226, 2018.
[Online]. Available: http://arxiv.org/abs/1808.06226
[74] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models
are unsupervised multitask learners,” 2019.
[76] W. de Vries and M. Nissim, “As good as new. how to successfully recycle english gpt-2 to
make models for other languages,” 2020.
[77] P. J. Ortiz Suárez, B. Sagot, and L. Romary, “Asynchronous Pipeline for Processing
Huge Corpora on Medium to Low Resource Infrastructures,” in 7th Workshop on the
Challenges in the Management of Large Corpora (CMLC-7), P. Bański, A. Barbaresi,
H. Biber, E. Breiteneder, S. Clematide, M. Kupietz, H. Lüngen, and C. Iliadi, Eds.
Cardiff, United Kingdom: Leibniz-Institut für Deutsche Sprache, Jul. 2019. [Online].
Available: https://hal.inria.fr/hal-02148693
[80] I. Beltagy, K. Lo, and A. Cohan, “Scibert: Pretrained language model for scientific text,”
in EMNLP, 2019.
[81] J.-S. Lee and J. Hsiang, “PatentBERT: Patent classification with fine-tuning a pre-trained
BERT model,” World Patent Information, vol. 61, no. 101965, 2020.
95
[83] B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, L. Ro-
mano, C. Girardi, and M. Negri, “Annotazione di contenuti concettuali in un corpus ital-
iano: I - cab,” in Proc.of SILFI 2006, 2006.
[84] S. Pyysalo, J. Kanerva, A. Virtanen, and F. Ginter, “Wikibert models: deep transfer
learning for many languages,” 2020.
[86] S. Lee, H. Jang, Y. Baik, S. Park, and H. Shin, “Kr-bert: A small-scale korean-specific
language model,” 2020.
[87] H. Tanvir, C. Kittask, and K. Sirts, “Estbert: A pretrained language-specific bert for
estonian,” 2020.
[88] W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim,
“Bertje: A dutch BERT model,” CoRR, vol. abs/1912.09582, 2019. [Online]. Available:
http://arxiv.org/abs/1912.09582
[89] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez, “Spanish pre-trained
bert model and evaluation data,” in PML4DC at ICLR 2020, 2020.
[90] J. Cañete, “Compilation of large spanish unannotated corpora,” May 2019. [Online].
Available: https://doi.org/10.5281/zenodo.3247731
[92] D. Nozza, F. Bianchi, and D. Hovy, “What the [mask]? making sense of language-specific
bert models,” 2020.
[93] R. Speer, “ftfy,” Zenodo, 2019, version 5.5. [Online]. Available: https://doi.org/10.5281/
zenodo.2591652
[95] P. Koehn, H. Khayrallah, K. Heafield, and M. Forcada, “Findings of the wmt 2018 shared
task on parallel corpus filtering,” 01 2018, pp. 726–739.
[96] M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta, “The wacky wide web:
A collection of very large linguistically processed web-crawled corpora,” Language
Resources and Evaluation, vol. 43, no. 3, pp. 209–226, 2009. [Online]. Available:
http://www.jstor.org/stable/27743614
[97] R. Schäfer and F. Bildhauer, “Building large corpora from the web using a new efficient
tool chain,” in LREC, 2012.
96
[98] J. Kudela, I. Holubová, and O. Bojar, “Extracting parallel paragraphs from
common crawl,” CoRR, vol. abs/1804.10413, 2018. [Online]. Available: http:
//arxiv.org/abs/1804.10413
[100] P. Ortiz Suárez, B. Sagot, and L. Romary, “Asynchronous pipelines for processing huge
corpora on medium to low resource infrastructures,” 07 2019.
[101] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text clas-
sification,” arXiv preprint arXiv:1607.01759, 2016.
[104] N. Ljubešić and A. Toral, “caWaC – a web corpus of Catalan and its application to
language modeling and machine translation,” in Proceedings of the Ninth International
Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland:
European Language Resources Association (ELRA), May 2014, pp. 1728–1732. [Online].
Available: http://www.lrec-conf.org/proceedings/lrec2014/pdf/841 Paper.pdf
[108] X. Carreras, I. Chao, L. Padró, and M. Padró, “Freeling: An open-source suite of language
analyzers,” in Proceedings of the 4th International Conference on Language Resources and
Evaluation (LREC’04), 2004.
[109] J. Atserias, B. Casas, E. Comelles, M. González, L. Padró, and M. Padró, “Freeling 1.3:
Syntactic and semantic services in an open-source nlp library,” in Proceedings of the fifth
international conference on Language Resources and Evaluation (LREC 2006). Genoa,
Italy: ELRA, May 2006.
97
[110] L. Padró, M. Collado, S. Reese, M. Lloberes, and I. Castellón, “Freeling 2.1: Five years
of open-source language processing tools,” in Proceedings of 7th Language Resources and
Evaluation Conference (LREC’10), La Valletta, Malta, May 2010.
[111] L. Padró, “Analizadores multilingües en freeling,” Linguamatica, vol. 3, no. 2, pp. 13–20,
December 2011.
[112] L. Padró and E. Stanilovsky, “Freeling 3.0: Towards wider multilinguality,” in Proceedings
of the Language Resources and Evaluation Conference (LREC 2012). Istanbul, Turkey:
ELRA, May 2012.
[113] J. Pomikálek, “Removing boilerplate and duplicate content from web corpora,” Ph.D.
dissertation, Masaryk university, Faculty of informatics, Brno, Czech Republic, 2011.
[114] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,”
in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.
[115] D. Kosmajac and V. Keselj, “Slavic language identification using cascade classifier ap-
proach,” in 2018 17th International Symposium INFOTEH-JAHORINA (INFOTEH),
2018, pp. 1–6.
[117] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli,
“fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACL-HLT
2019: Demonstrations, 2019.
98
Florence, Italy: Association for Computational Linguistics, July 2019, pp. 5002–5007.
[Online]. Available: https://www.aclweb.org/anthology/P19-1494
[123] M. Popel and O. Bojar, “Training tips for the transformer model,” CoRR, vol.
abs/1804.00247, 2018. [Online]. Available: http://arxiv.org/abs/1804.00247
[125] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Con-
ference on Learning Representations, 12 2014.
[126] M. Artetxe, S. Ruder, and D. Yogatama, “On the cross-lingual transferability of monolin-
gual representations,” CoRR, vol. abs/1910.11856, 2019.
[128] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear
complexity,” arXiv preprint arXiv:2006.04768, 2020.
99
A Source code and logs
The code will be open-source at TeMU’s Github137 . In the meanwhile, it is provided as a non-
distributable ZIP file attached to the delivery of this thesis. The code is organized as follows:
• evaluation/: Utilities for loading the model in Huggingface and fine-tuning on the down-
stream tasks. The evaluation data is not provided in the delivery, but it will be published
soon (if required, it can be sent on request).
The arguments used in these repositories are mostly described either in the thesis itself, or in
the remaining sections of this appendix.
The text logs of the training of RoBERTca are also provided, in the training/ directory.
100
[--seg-sentences]
[--char-length-filter-sentence CHAR_LENGTH_FILTER_SENTENCE]
[--word-length-filter-sentence WORD_LENGTH_FILTER_SENTENCE]
[--digits-filter-sentence DIGITS_FILTER_SENTENCE]
[--profanity-check]
[--fast-lang-filter-threshold FAST_LANG_FILTER_THRESHOLD]
[--slow-lang-filter-threshold SLOW_LANG_FILTER_THRESHOLD]
[--no-lang-filter-sentence]
[--no-lang-filter-sentence_src_tgt]
[--code-threshold CODE_THRESHOLD]
[--dictionary-filter-sen DICTIONARY_FILTER_SEN]
[--no-dedup-same-doc-sentences] [--no-src-tag-filter]
[--spell-check] [--terminology-norm TERMINOLOGY_NORM]
[--punctuation-norm]
[--document-deduplication-threshold DOCUMENT_DEDUPLICATION_THRESHOLD]
[--remove-glob-rep-sen REMOVE_GLOB_REP_SEN]
[--dedup-buffer DEDUP_BUFFER]
name
positional arguments:
name A name to identify the run
optional arguments:
-h, --help show this help message and exit
--input-path INPUT_PATH
Input data directory
--output-path OUTPUT_PATH
Output data directory
--input-format INPUT_FORMAT
Input data format
--output-format OUTPUT_FORMAT
Output data format
--checkpoint-backend {shelve,file}
Shelve is more convenient but file is more robust. For
distributed executions,we recommend file.
--components COMPONENTS [COMPONENTS ...]
Elements of the pipeline
--parallel Run the cleaner in parallel
--log-every-iter LOG_EVERY_ITER
Log the pipeline every N iterations(-1, silent)
--backend BACKEND Parallel backend (mp or ray)
--only-reduce Only document filter
--only-reduce-output Only document filter for output files
--debug Activate the debug error mode to compare the original
and cleaned sentences
--extensions EXTENSIONS [EXTENSIONS ...]
101
File extensions to work with (eg. json)
--encoding ENCODING Input encoding format (eg. utf-8. If set to auto, the
programtries to guess the encoding
--encoding-threshold ENCODING_THRESHOLD
Encoding threshold if --encoding auto
(ignoredotherwise. If the encoding detector is not
above this threshold, it assigns utf-8.
--encoding-error-policy ENCODING_ERROR_POLICY
Encoding error policy (same options as open()
--url-doc URL_DOC Path to a url list (plain text, one url per line)that
should be filtered and processed
--warc-warn Enable warnings of WARC parser
--no-lang-filter-document
Avoid applying language filter on documents
--no-language-normalization
Avoid applying language-specific normalization
--no-replace-emails Avoid replacing email adresses with "[EMAIL]"
--no-remove-hashtags-mentions
Remove hashtags and mentions.
--no-remove-tags Avoid removing XML/HTML tags
--no-space-normalization
Avoid normalizing white spaces
--no-replace-urls Avoid replacing URLs with "[URL]"
--char-length-filter-document CHAR_LENGTH_FILTER_DOCUMENT
Minimum char length per document. Set to 0 not to
apply any filter.
--no-head-filter Avoid filtering documents coming froma crawler (having
a "heads" attribute) withcommon HTTP errors.
--digits_filter DIGITS_FILTER
Maximum allowed proportion of digit characters
--remove-citations If used, remove citations in the common square
brackets format, e.g [34]
--lang-chars-filter LANG_CHARS_FILTER
Maximum allowed proportion of characters notbelonging
to the alphabet of the language
--alphanum-filter ALPHANUM_FILTER
Maximum allowed proportion of non-
alphanumericcharacters
--uppercase-filter UPPERCASE_FILTER
Maximum allowed proportion of uppercase characters
--alphabet-filter ALPHABET_FILTER [ALPHABET_FILTER ...]
Alphabets that should be present (eg. LATIN)
--lang-filter LANG_FILTER [LANG_FILTER ...]
List of languages that should allowed when filtering
bylang. If not set, no filtering is applied.
--initial-lang-filter-threshold INITIAL_LANG_FILTER_THRESHOLD
If --lang-filter is set, minimumthreshold for the
102
initial langidentifier
--dictionary-filter-doc DICTIONARY_FILTER_DOC
Path to dictionary (plain text, one term perline of
terms that should not appear in adocument
--seg-sentences Segment wrongfully concatenated sentences.
--char-length-filter-sentence CHAR_LENGTH_FILTER_SENTENCE
filter sentences shorter than a given minimum
character length
--word-length-filter-sentence WORD_LENGTH_FILTER_SENTENCE
filter sentences shorter than a given minimum word
length
--digits-filter-sentence DIGITS_FILTER_SENTENCE
Maximum allowed proportion of digit characters in the
sentence
--profanity-check filter sentences with sensible content
--fast-lang-filter-threshold FAST_LANG_FILTER_THRESHOLD
If --lang-filter is set, minimumthreshold for the
faster lang identifier
--slow-lang-filter-threshold SLOW_LANG_FILTER_THRESHOLD
If --lang-filter is set, minimumthreshold for the
slower lang identifier
--no-lang-filter-sentence
Avoid applying language filter on sentences
--no-lang-filter-sentence_src_tgt
Avoid applying language filter on sentences with
"src=" pattern
--code-threshold CODE_THRESHOLD
Threshold (percentage) of code-like chars and tokensto
filter a sentence (-1 to deactivate)
--dictionary-filter-sen DICTIONARY_FILTER_SEN
Path to dictionary (plain text, one term perline of
terms that should not appear in asentence
--no-dedup-same-doc-sentences
Do not deduplicate sentences in the same document.
--no-src-tag-filter Do not remvoe sentences with the pattern "src=".
--spell-check Apply spell checking.
--terminology-norm TERMINOLOGY_NORM
Path to a terminology dictionary to
appliynormalization
--punctuation-norm Apply punctuation normalization.
--document-deduplication-threshold DOCUMENT_DEDUPLICATION_THRESHOLD
Threshold for document de-duplication, expressed as
the percentage of sentencesoverlap between documents
--remove-glob-rep-sen REMOVE_GLOB_REP_SEN
Whether to remove corpus-level repeated sentences
(threshold of repetitions; -1to deactivate)
--dedup-buffer DEDUP_BUFFER
103
Deduplication buffer size, in bytes (default:
100000000)
E Model usage
RoBERTca can be used as follows (in the example, the case of mask filling, but the code for
other tasks is very similar):
104
Instead of the pretrained directory of the model (pretrained directory), the user will be able
to input the name of the model itself, when we upload it to Huggingface’s hub. This will be done
soon. In the meanwhile, the weights of the model can be provided on request.
105