0% found this document useful (0 votes)
31 views113 pages

A Pipeline For Large Raw Text Preprocessing and Model Training of Language Models at Scale

Uploaded by

Captain Jk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views113 pages

A Pipeline For Large Raw Text Preprocessing and Model Training of Language Models at Scale

Uploaded by

Captain Jk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 113

Master in Artificial Intelligence

Master Thesis

A pipeline for large raw text preprocessing


and model training of language models at
scale

Author:
Jordi Armengol Estapé

Advisor:
Marta Ruiz Costa-Jussà
Computer Science (CS) Department - UPC

Co-Advisor:
Maite Melero Nogues
Text Mining Unit - Barcelona Supercomputing Center

January 2021

FACULTAT D’INFORMÀTICA DE FACULTAT DE MATEMÀTIQUES ESCOLA TÈCNICA SUPERIOR


BARCELONA (FIB) (UB) D’ENGINYERIA (URV)
Abstract

The advent of Transformer-based (i.e., based on self-attention architectures) language


models has revolutionized the entire field of Natural Language Processing (NLP). Once pre-
trained on large, unlabelled corpora, we can apply transfer learning to virtually all down-
stream tasks. The paradigmatic example is the BERT model. Recent works have proposed
alternative pre-training algorithms and neural architectures for improving the efficiency or
the performance of the models. Besides, the ecosystem of frameworks for using these models
has flourished.
Nevertheless, less attention has been paid to the practical issues of preparing new cor-
pora for pre-training language models and training them effectively from scratch in High-
Performance Computing (HPC) clusters. Preprocessing new corpora is critical for languages
and domains that do not have enough published resources. In contrast, the practical details
of training language models from scratch are less known than those for fine-tuning existing
models. Also, if the quality of the data is enhanced, language and domain-specific lan-
guage models have already been shown to outperform their multilingual and general-domain
counterparts, at least in some cases. This project consists of developing a preprocessing and
training pipeline for generating language models at scale, especially targeting under-resources.
The preprocessing pipeline’s crucial role consists of cleaning raw text and formatting it as
needed while preserving document-level coherency (if possible) to learn long-range dependen-
cies. Most of the existing data gathering and cleaning methods for NLP have focused more
on the quantity than the quality. Since our approach aims to be compatible with low-resource
languages and domains, the filtering should be as fine-grained as possible (or risk losing useful
data). Unlike other works, we put special emphasis on the generation of resources for training
these models.
Regarding training, learning from scratch large models presents several challenges, even
if leveraging existing libraries. Apart from adapting to the specifics of an HPC cluster and
a careful choice of hyperparameters, ideally, the training procedure should be relatively low-
resource-friendly.
We show our system’s application for generating new corpora in real-world use cases
and how these data can be effectively used for training models from scratch. We thoroughly
document the process. The corpus resulting from aggregating both existing (but preprocessed
with our proposed cleaning pipeline) and new datasets consists of the largest Catalan corpus
ready for language modeling ever compiled, to the best of our knowledge. In the case of
general-domain Spanish, by preprocessing (even if still partially) a big Spanish crawling,
we have shown that the cleaning pipeline is suitable for large-scale corpora. The obtained
corpus is arguably one of the largest ones for Spanish, at least before deduplication. We
further show the cleaning pipeline’s flexibility by applying it for generating a large biomedical
Spanish corpus for training language models and a variety of European languages (in this
case, targeting unsupervised machine translation). These use cases prove the preprocessing
methodology’s flexibility in terms of language, scale, domain, and targeted task.
To show the usefulness of the corpora generated with our approach, we focus on the case
of the Catalan language. We have built the first Catalan RoBERTa ever. To the best of
our knowledge, it is the language model that has ever seen the most Catalan data. The
preliminary evaluation shows that it is at least competitive with the existing baselines, while
needing considerably shorter sequences, because the vocabulary is composed of numerous
Catalan words and the tokenizer does not need to split many of them into subwords.
Acknowledgements

First of all, coinciding with the last phase of the development of this master thesis, I was diagnosed
with COVID-19. I thank my family for their support when I was recovering, especially my parents
and brother.
The resources used in this work have been partially funded by the State Secretariat for Digital-
ization and Artificial Intelligence (SEDIA) to carry out specialised technical support activities
in supercomputing within the framework of the Plan TL1 signed on 14 December 2018 and
the MT4All CEF project2 for developing resources and models for the unsupervised training of
machine translation systems for low-resource languages.
This master thesis would not have been possible without the direction of Marta Ruiz Costa-Jussà,
an expert in Natural Language Processing, especially machine translation, who has recently
been awarded an ERC starting grant. The supervision and insights of Maite Melero, established
researcher at the Text Mining Unit of the Barcelona Supercomputing Center (TeMU-BSC), have
been key to success as well. Finally, the advice of Marta Villegas, co-leader of TeMU-BSC, has
been priceless.
I thank and acknowledge the help of my colleagues at TeMU-BSC, especially Ona de Gibert
and Casimiro Pio Carrino, research engineers in charge of the data gathering process (thus, data
gathering itself is out of the scope of this thesis) for the Catalan, biomedical Spanish, and the
machine translation pairs in the MT4All project. They also helped with the testing and certain
extensions of the pipeline. Carlos Gerardo Rodrı́guez Penagos, a researcher at TeMU-BSC,
provided resources for evaluating the Catalan RoBERTa. In this thesis, I focus on my individual
contributions.
Regarding the data, the raw crawlings used for developing the corpora were conducted by the
Operations departments at the Barcelona Supercomputing Center and the Spanish National
Library. Albert Farrés helped with the profiling of the cleaning pipeline parallelization. Quim
Moré developed the script to store the Spanish National Library in JSON files.
The industry partners of the MT4All project provided the seed URLs for running the crawlers
for their respective use cases. UPV-EHU developed the unsupervised machine translation system
that was used with the corpora resulting from these crawlings.

1
https://www.plantl.gob.es/
2
https://ec.europa.eu/inea/en/connecting-europe-facility/cef-telecom/2019-eu-ia-0031
List of Figures

1 Desired end-to-end architecture overview . . . . . . . . . . . . . . . . . . . . . . . 3


2 Embedding layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Original Seq2seq architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 Transformer architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5 Multi-head attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
6 Attention map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
7 BERT pre-training objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
8 Further in-domain pre-training vs. training from scratch . . . . . . . . . . . . . . 24
9 Cleaning pipeline overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10 Cleaning profiling (call graph) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
11 Cleaning profiling (counts) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
12 Cascade of language identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
13 Traces obtained from the first parallelization strategy . . . . . . . . . . . . . . . 53
14 Traces obtained from the second parallelization strategy . . . . . . . . . . . . . . 54
15 Cleaning pipeline execution on BNE . . . . . . . . . . . . . . . . . . . . . . . . . 67
16 RoBERTca learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
17 RoBERTca loss scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
18 RoBERTca gradient norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
19 RoBERTca Words Per Batch (WPB) . . . . . . . . . . . . . . . . . . . . . . . . . 75
20 RoBERTca perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
21 RoBERTca best validation loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
22 RoBERTca loss curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
23 RoBERTca prompt 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
24 RoBERTca prompt 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
25 RoBERTca prompt 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
26 RoBERTca prompt 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
27 RoBERTca prompt 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
List of Tables

1 Comparison between some of the Transformer language models . . . . . . . . . . 16


2 Comparison between some of the available tokenizers . . . . . . . . . . . . . . . . 21
3 Comparison between some of the available tokenizers . . . . . . . . . . . . . . . . 22
4 Tokenization artifacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Comparison between some of the available language-specific models . . . . . . . . 28
6 Comparison between some language and domain-specific vocabularies . . . . . . 29
7 Results of cleaning strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8 Existing Catalan crawlings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9 Existing Catalan language models . . . . . . . . . . . . . . . . . . . . . . . . . . 34
10 Examples of desired and undesired cleaning . . . . . . . . . . . . . . . . . . . . . 41
11 Speed-up obtained by the cascade of language identifiers . . . . . . . . . . . . . . 51
12 Speed-ups obtained by the distributed implementation . . . . . . . . . . . . . . . 53
13 Comparison between libraries for training language models . . . . . . . . . . . . 62
14 Fairseq RoBERTa baseline on CTE-POWER . . . . . . . . . . . . . . . . . . . . 63
15 Comparison between BNE and other big Spanish corpora . . . . . . . . . . . . . 67
16 Preliminary results of the MT4All project . . . . . . . . . . . . . . . . . . . . . . 69
17 Cleaned (or generated from scratch) Catalan corpora . . . . . . . . . . . . . . . . 70
18 Cleaned (or generated from scratch) biomedical Spanish corpora . . . . . . . . . 71
19 Cleaned (or generated from scratch) corpora for MT4All . . . . . . . . . . . . . . 72
20 Tokens per sentence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
21 Mask filling comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
22 RoBERTca evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Contents

1 Introduction 1
1.1 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 How to read this thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Transformer language models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Decoder-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Encoder-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Unsupervised machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Off-line mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Joint learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Tokenization, subwords, and vocabulary building . . . . . . . . . . . . . . . . . . 19
2.5 Corpora generation and processing methodologies . . . . . . . . . . . . . . . . . . 22

3 Related work 23
3.1 Domain-specific models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Language-specific models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3 Unlabelled corpora generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Summary and conclusions on the state of the art . . . . . . . . . . . . . . . . . . 33

4 Settings 35
4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.1 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 General-domain Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.3 Biomedical Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 MT4All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Methods 39
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Data gathering and storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Cleaning and formatting pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3.1 Design and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 Data parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 Encoding fixer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.4 Prefilterer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3.5 Sentence splitter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.6 Sentence filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3.7 Normalizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.8 Document filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.9 Output formatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.10 Implementation and performance . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.11 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Metadata and aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5 Model-ready pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.5.1 Document-level statistics and sanity checks . . . . . . . . . . . . . . . . . 56
5.5.2 Decontamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.3 Data splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.4 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5.5 Dictionary building and binarization . . . . . . . . . . . . . . . . . . . . . 58
5.5.6 Final sanity checks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6.1 Considerations on parallelization . . . . . . . . . . . . . . . . . . . . . . . 59
5.6.2 Considerations on the training environment . . . . . . . . . . . . . . . . . 59
5.6.3 Effective batch size and gradient accumulation . . . . . . . . . . . . . . . 60
5.6.4 Deep learning backend and distributed training . . . . . . . . . . . . . . . 61
5.6.5 Launcher . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6.6 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6.7 Hyperparameters and training objective . . . . . . . . . . . . . . . . . . . 63
5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Applications and results 65


6.1 Corpora generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.1 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.1.2 General-domain Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.3 Biomedical Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1.4 MT4All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
6.2 Model generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.2.1 RoBERTca: the Catalan RoBERTa . . . . . . . . . . . . . . . . . . . . . 73

7 Discussion 84
7.1 Results analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.3 Impact statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

8 Conclusions and future work 88

References 90

A Source code and logs 100

B Cleaning pipeline parameters 100

C Model-ready pipeline parameters 104

D Data samples (before and after data cleaning) 104

E Model usage 104


1 Introduction

Monolingual corpora have become central to the training of Natural Language Processing (NLP)
models. Long before that, in this field it was thought that supervision was required for obtain-
ing useful enough representations, even in the pre-training stage [1] [2]. More recently, with
the advent of Word2vec [3] and other algorithms for learning word embeddings, unsupervised
pre-training started to become competitive and even state-of-the art for a number of NLP tasks.
ELMo [4] introduced the concept of deep contextualized word embeddings, by taking the hidden
states of a recurrent architecture together with the word embeddings themselves as represen-
tations. It was the first language model explicitly pre-trained with the intend of transferring
knowledge to a number of downstream tasks.
The recently introduced Transformer architecture [5] has revolutionized the NLP scene. Initially
intended for machine translation, researchers have shown that it also serves as a powerful deep
learning backend for pre-training large language models in the seminal works of the encoder-based
system BERT [6] and the decoder-based architecture GPT [7] [8].
Researchers have proposed many improvements to BERT and GPT, both in terms of performance
and efficiency. On the one hand, alternative pre-training algorithms such as the one in ELECTRA
[9] enable a more efficient usage of data. On the other hand, neural-level modifications to the
Transformer such as the one in ALBERT [10] provide different trade-offs for computation time
and memory requirements.
In addition, there is a vibrant ecosystem of open pre-trained models and libraries to leverage
them. Today, transferring knowledge from most of these models (except the closed ones or
computationally infeasible for most institutions, such as GPT-3 [11]) is easier than ever, thanks to
the flourishing of a powerful and diverse ecosystem of libraries. We especially remark the usability
of Huggingface’s Transformers [12], a library with many implementations of Transformer-based
models, as well as an ecosystem of pre-trained weights for these models to which everyone is
welcome to contribute. These libraries can even be used for training models from scratch, but
it is not the typical use-case, and they do not solve the problem of obtaining enough data for
different languages and domains.
In parallel, some works have shown that monolingual corpora can effectively be leveraged for ma-
chine translation systems as well, both in semi-supervised or even unsupervised settings [13] [14]
[15]. Besides, and more related to the more general NLP models mentioned above, monolingual
corpora can be leveraged by cross-lingual models as well, such as XLM [16]. By cross-lingual
models we mean models that explicitly learn from data in different languages (instead of just
concatenating data in all the languages, as in the case of the multilingual BERT), which can
then transfer knowledge to machine translation. Thus, the development of monolingual corpora
for a wide range of languages and domains is of the uttermost importance, together with the
training of the models themselves. Once the resources and infrastructure for training models
from monolingual corpora are deployed, we can transfer the knowledge to numerous downstream
tasks.
However, there are some aspects that, to the best of our knowledge, and from our point of view,
have been partially neglected, at least in comparison with the vast amount of efforts devoted
to the architectural improvements, English-centric resources, and tools for leveraging existing
pre-trained models. First, few works state explicit details on data collection and preprocessing

1
steps (unlike the neural architectures themselves, which are extensively documented and usually
open-sourced), in spite of the fact that data are arguably the most important part of these
models (without data, there would be no models). Obtaining large, model-ready English corpora,
especially if from the general domain, is straightforward. Doing so for low-resource languages
or domains, not so much. It can be challenging even for languages with hundreds of millions of
speakers. In addition, some methodologies, such as the criteria for decontaminating training data
from evaluation sentences, good practices for keeping track of the state of the corpora, or the
importance of maintaining document-level corpora, are not generally well-established3 Second,
practical details on training models from scratch are usually less understood and shared than
the ones related to the use of these models for transferring knowledge to downstream tasks. One
of the reasons this may be the case is that using pre-trained models for downstream tasks is
a common use-case for many companies, institutions, and individuals. Training from scratch
is not a possibility for most users. Nevertheless, we observe a middle ground between massive
language models only feasible for training by big companies and institutions4 and every single
researcher training his or her own model. At least in Europe, there is a network of publicly owned
(or funded) High-Performance Computing centers that could be able to train (and, in fact, has
already done with some degree of success5 ) relatively big models. Otherwise, apart from facing
a reproducibility crisis6 , we risk to neglect many languages and domains big companies and
institutions may no be interested in enough.
In this work, we propose a complete cleaning and processing pipeline for building new corpora for
training this kind of models, which is especially relevant in the case of non-English models and
low-resource domains. The cleaning pipeline is build from scratch with a high degree of linguistic
sensitivity and the purpose of being generic, extensible, and big-data and HPC-ready. Once the
cleaning process has ended and the data are cleaned and formatted, the corpora is organized,
decontaminated from sentences in the target evaluation benchmarks, tokenized and binarized.
Regarding the training, we leverage existing libraries to the specifics of the HPC cluster and the
data of use. Ideally, we would like to have an end-to-end pipeline, from data collection to model
evaluation, although the scope of the project is limited as a master thesis, and we leave some
steps as future work. Figure 1 shows the high-level overview of the desired architecture. Note
that in this work we especially focus on the data processing, and show the corresponding recipes
for effective model training, but having a general understanding of the whole process is key to
success. The proposed architecture is heterogeneous in the sense that different components run
on different kinds of machines.
Far from being a theoretical exercise or a set of experiments with toy data, we apply our pro-
cessing pipeline to real-word use cases, and show how the generated corpora can be used to train
models from scratch at scale. Specifically, both the Spanish PlanTL and the MT4All CEF (for
reference, both mentioned in Acknowledgements) directly benefit from this work.
One may wonder why simply using existing pre-trained systems is not always a solution. First
of all, these systems have been trained with the data that were available at the time of their
development, and by available we don’t always mean literally existing, but being either model-
ready or easy to preprocess. For instance, multilingual BERT does include Catalan, but only the
3
For instance, see Section 4 in the recent article on the GPT-3 model [11].
4
And the rest of NLP researchers only being able to fine-tune these models.
5
See the case of the Finnish BERT [17], which was trained in a European HPC cluster. We will review it in
Section 3.
6
https://www.wired.com/story/artificial-intelligence-confronts-reproducibility-crisis/

2
Wikipedia. Instead, in this work we collect and preprocess a Catalan corpus orders of magnitude
larger. Apart from that, building a language or domain-specific model with an emphasis on
data collection and preprocessing has already been shown to outperform multilingual and general
domain models. We refer to the paradigmatic cases of the Finnish BERT [17] and BioBERT [18],
described in Section 3.

Figure 1: Desired end-to-end architecture overview: High-level overview of the proposed het-
erogeneous (i.e., running on different kinds of clusters) architecture. In this work, we especially
focus on the preprocessing components, which we build from scratch, and show the ersults can
be leveraged by the training component, which we adapt to the specifics of our data and HPC
cluster. Source: Own elaboration.

1.1 Thesis structure


In this section, we introduced our work. In the next section, Section 2, we contextualize our pro-
posal by detailing the required background in terms of NLP, deep learning, and data generation,

3
which motivates the development of this system. In Section 3, we describe the related work,
and how our method fits in as one of the natural next steps. Section 4.1 describes the data on
which we apply our system, and the environment in which we run it. Even if our method aims
to be as generic as possible, these data serves both as the initial motivating example for building
these tools, and validate our architecture. In Section 5, we extensively describe our proposal,
with fine-grained details. In Section 6 we present our results, which are discussed in Section 7.
Finally, we arrive to conclusions in Section 8.

1.2 How to read this thesis


In case the reader wants to first read the most important parts of this thesis, he or she is advised
to read this introduction for an overview of the thesis. Then, Section 2 can be skipped in
case the reader is already familiar with Transformer-based models, except Section 2.4, which we
especially recommend as a summary of subword tokenization and vocabulary building strategies.
The reader is advised not to skip Section 3, as a survey of the related work. In this thesis,
the description of the preprocessing steps is not a mere formality, but central to the work, and
we devote a considerable amount of the subsections to it. Still, the most important parts are
described in Sections 5.3 and 5.5. Regarding training, we especially recommend Section 5.6.3.
For a report of the results and its discussion, we advise to read both sections 6 and 7.

4
2 Background

For understanding the motivation of our proposal, we must first contextualize the NLP scenario
in which we believe our approach is a sensible next step. In this section, we describe the state-
of-the-art architecture in NLP and the implications of the scalability of models based on it, as
well as other concepts that are relevant to the development of rest of the thesis.

2.1 Transformer
We can argue that the closest ancestors of the Transformer7 architecture were recurrent Seq2seq
[19] architectures with encoder-decoder attention. This architecture, as many other NLP models,
leveraged embedding layers (depicted in Figure 2), one for the encoder and another one for the
decoder. Figure 3 depicts the vanilla recurrent Seq2seq architecture.
The problem of the vanilla recurrent Seq2seq was that the model had to compress the source
sequence (of variable size) into a fixed size vector, and this caused an information bottleneck that
prevented the model from recalling the whole original sentence. This motivated the introduction
of an attention mechanism, with the attention mechanism known as Bahdanau attention (named
after his author) [22].
By attention we mean the mechanism by which a model learns a set of attention weights. These
weights are akin to the other parameters of the neural network, but their value depends, dynam-
ically, on the specific input (thanks to being computed by other weights, which are fixed for all
inputs). Each attention weight can be thought as the relative importance of each component of
the input and is multiplied by the said respective component, outputting a context vector, c:

m
X
ci = aij sj
j=1

ai = softmax(fatt (hi , sj ))
where s1 , s2 , ..., sm are the hidden states with respect to which the attention is paid (i.e., the
preceding ones), and h1 , h2 , ..., hn are the hidden states the value of which will be determined
depending on the context vector (i.e., the new ones). fatt is the attention function, and it de-
pends on the specific attention implementation. The original attention used in Seq2seq systems,
Bahdanau attention, is also known as additive attention and is computed as follows:

fatt (hi , sj ) = vaT tanh(Wa [hi ; sj ])

But there are other implementations, such as multiplicative variants [23]. For additional infor-
mation on attention, we refer to [24]. The transformer uses scaled dot product attention, as we
will see. This was regarding the implementation. Nevertheless, attention can also be classified
in terms of to what state attention is paid to. The original attention was encoder-decoder atten-
tion, but apart from that, the Transformer also uses self-attention (i.e., attention with respect to
7
Apart from the original article, we recommend the following resource for getting a better understanding of
this model: http://jalammar.github.io/illustrated-transformer/

5
Figure 2: Embedding layer: Embedding layer used in virtually all neural NLP architectures,
including both the original Seq2seq and the Transformer. For each token, the embedding layer,
which works as a lookup table, retrieves the corresponding dense vector, a better representation
than just one-hot vectors. This representation is more compact and the distance correlates with
semantic distance. This layer is differentiable and, thus, the word embeddings can be learned
end-to-end. Source: Own elaboration, first appeared in [20].

the input itself), both in the encoder (encoder self-attention) and in the decoder (decoder self-
attention), unlike RNN-based architectures. Self-attention can only be multiplicative. Instead of
having the attention mechanism as a complementary component, the Transformer is built using
scaled-dot product attention as the fundamental building block.
The Transformer, like the recurrent Seq2seq architecture, is also composed of an encoder and a
decoder. In the encoder, the tokens are not input sequentially as in RNN architectures. Instead,
they can be all input at once (if running on a GPU), since there is no specific order the inputs
must be passed through the model in each layer. First of all, the tokens are passed through a
conventional embedding table (as in Figure 2), which serves as a differentiable lookup table that
retrieves the word vector corresponding to each word (no different with respect to plain recurrent
Seq2seq). In the embedding layer there is one subtle, yet vital, difference that we will see later on.
Then, the encoder is composed of a stack of identical encoder blocks. Each of them is composed
of scaled dot-product self-attention layer followed by a linear layer. In the former, self-attention
vectors are computed pair-wise (each token attends with respect to the rest, including itself),
which, for a given layer, can be done in parallel since there are no dependencies. Crucially, there

6
Figure 3: Original Seq2seq architecture: The word vector of each token in source sequence is
input one by one into a recurrent encoder. Then, the recurrent decoder autoregressively generates
the target tokens (with teacher forcing, i.e., using the real target tokens instead of the predicted
ones, during training). Source: Own elaboration adapted from [21], first appeared in [20].

are residual connections, as in [25]. These connections skipping one layer (going directly to the
next one) allow to retain information from the previous layers. Both attention and linear layers
are normalized with layer normalization8 [27]. Without these details, the Transformer would not
train properly.
Each encoder block takes as input the outputs of the previous block (except for the first one,
which takes the vectors from the embedding layer). The embeddings generated from these
attentional layers are known as contextual embeddings, in the sense that there is one vector for
each token, but the value depends on the other ones.
The decoder works similarly. It performs decoder self-attention with the target tokens that have
already been predicted (note that masking is required for, in training, preventing the decoder
from seeing future target tokens that have not been yet predicted, which would result in a model
that would not learn properly). In inference this cannot be done in parallel, and is instead run
autoregressively as in Seq2seq (i.e., for each time step, the decoder takes its own previous outputs
as input). The decoder is, thus, also composed of a stack of identical blocks, but each of them
has three layers (instead of two, as the encoder). Apart from the self-attention and linear layers,
it has an additional layer for performing encoder-decoder attention, to attend to the encoder
outputs (the ones of the last layer) as in recurrent Seq2seq with Bahdanau attention. After the
last decoder block, the Transformer decoder has a projection layer for actually predicting the
specific tokens that should be output.
We said that the embedding layer has one subtle but relevant difference with the vanilla one.
The Transformer itself can be seen as a fully-connected graph neural network9 for working with
sets, since unlike RNNs, it does not have an inherent notion of sequence order. For preventing
the model from degenerating into a bag of words-like model, positional embeddings are summed
to the input embeddings. In the original article, they tried both learned positional embeddings,
and sinusoidal positional embeddings, and the latter performed better:
8
Notice the difference with batch normalization [26]. In layer normalization, the normalization is independent
from other samples in the batch.
9
https://thegradient.pub/transformers-are-graph-neural-networks/

7
pos
P E[pos, 2i] = sin( 2i )
10000 dmodel
Newer variants of the Transformer do use learned positional embeddings, but with other imple-
mentations10 . Figure 4 shows an overview of the Transformer architecture.
There is one low-level (but important) detail we have not gone through yet, the attention im-
plementation. As we said, it is called scaled dot product attention. It is a key-value attention
variant, in the sense that query vector and key vector pairs are used to compute a similarity
measure (specifically, the dot product). Each hidden vector hi is split into a key ki and a value
vi . In the Transformer, this is implemented as:

QK T
sof tmax( √ )V = Z
dk

The query, key and value vectors of a given token are projected with a linear layer that decreases
the dimensionality of the embedding of the token. A typical dimension would be 64, when the
embedding itself is of 512 elements (in the original Transformer; there are scaled up versions of
it). First of all, we compute the attention score by taking the dot product between the query
vector and the key vector of the word we are scoring. The query vector is the one of token we
are interested in, and the key is the one we are measuring the relative importance with respect
to the query. So, when computing the new contextual embedding for a given token, we will take
the dot product between its query vector with respect to each of the key vectors we are attending
to. The score is then divided by the square root of the number of dimensions of the query, vector
and value vectors (we said 64; so 8). This division is why it is called scaled dot product. We
apply softmax to each of the scaled scores for a given token and a set of tokens it is attending
to (so that they sum 1), and they are then summed. This sum is multiplied by the value vector
(which comes from the token we are interested in). These operations must be done for all token
pairs, and it is easy to see that they can be efficiently expressed as matrix operations in the
formula above. Notice the versatility of this mechanism. In self-attention layers, queries, keys,
and values come from the input itself. In the encoder-decoder attention layers, the exact same
mechanism is used, but this time the keys and values come from the output of the last encoder
layer, while queries come from the decoder itself. Note that this is only one attention head.
The Transformer uses multiple heads, all of them running in parallel. The motivation for this
multi-head attention is that each head can then specialize on different aspects (e.g., one head
could learn to focus on syntactic features). Figure 5 depicts the scaled dot product multi-head
attention. Figure 6 shows an attention map.
The original Transformer improved the BLEU score on the WMT 2014 English-to-German trans-
lation task by 2 points and has ever since been the standard architecture in neural machine
translation. But with the advent of BERT and GPT-like models, its influence has affected the
whole NLP field. Thanks to being purely attentional, it outperforms RNNs for modeling long-
range dependencies. This comes, though, with the cost of having a quadratic cost ((O(n2 ))) with
respect to the number of tokens. This has motivated the development of approximations of the
attention mechanism of the Transformer (e.g., O(n log n)) [29].

10
See relative positional embeddings [28].

8
Figure 4: Transformer architecture: Both the input tokens and the (right-shifted) target tokens
are input to their respective embedding layers. In both cases, positional embeddings are summed
to encode the position of each token, since the Transformer does not have an inherent notion
of sequentiality. The encoder is composed of a stack of identical blocks, each of them having a
multi-head attention layer and a linear layer. Both of them exhibit residual connections and layer
normalization, for easing the optimization. The decoder is built likewise, but with an additional
layer to compute the encoder-decoder attention. The decoder’s self-attention, unlike the encoder
self-attention, has to be masked to prevent the model from seeing target tokens before they
have been actually predicted. After the last decoder block, there is a linear projection to the
dimension of the target vocabulary size, followed by a softmax (since the output must be a
probability distribution over the vocabulary). Source: Own elaboration based on figure 1 in [30],
first appearing in [20].

9
Figure 5: Multi-head attention: Scaled dot-product attention is derived from V (values), K
(keys) and Q (queries). Each of these vectors is computed from a down-sampling projection
coming from the required token embedding. The query represents the token we are computing.
The key represents the token with respect to which we are computing the relative importance
(attention score). The scores, once scaled (and passed through a softmax), are then multiplied
by the value vector. In self-attention layers, queries, keys, and values come from the input itself.
In the encoder-decoder attention layers, the exact same mechanism is used, but this time the
keys and values come from the output of the last encoder layer, while queries come from the
decoder itself. These operations can be efficiently expressed as matrix operations. In the depicted
example, there are four heads. This means that the same computation is performed four times,
each of them independently (with different weights). Then, the results are concatenated and
passed through another linear layer. The motivation for having multiple heads is letting the
Transformer attend to different aspects at the same time; each head can specialize in detecting
different features, which is reminiscent of the convolutional filters. Source: Own elaboration
adapted from figure 2 in [30], originally appearing in [20].

10
Figure 6: Attention map in a sequence-to-sequence tasks: In machine translation tasks, the
encoder-decoder attention mechanism acts a soft aligner between source and target sequences.
In this attention map, we observe that the English word (of the original sentence) that has
the most importance when re-writing in Spanish the word ”económico” (economic) is, indeed,
”economic”. In self-attentive settings, though, attention maps will reveal relations between the
words of the sentence itself (e.g., adjectives contextualizing a noun, or aco-reference and the
corresponding entity). Source: Own elaboration based on figure 1 in [30], first appearing in [20].

2.2 Transformer language models


As we said, even though the Transformer was originally designed for machine translation, now
it has become ubiquitous in the entire NLP field, being the state-of-the-art architecture in most
tasks. Apart from being specifically trained for a given task, the Transformer has especially
shined in terms of transfer learning capabilities. There may be different reasons why this is the
case, but mainly, we identify two of them:

• Scalability: Transformer language models have been shown to scale with data and number
of parameters to unprecedented sizes in machine learning [31]. Unsupervised pre-training
is more effective the more data are used and the bigger the model, so it benefits from
architectures that scale.
• Pre-training tasks friendliness and versatility: In deep learning, researchers use different
surrogate tasks for pre-training models without supervision. This is also known as self-
supervised learning 11 . Even if many of the pre-training objectives that are used with
Transformers could be used with other architectures, these attentional models happen
to fit well in those (e.g., masked language modeling, the main pre-training objective in
encoder-based models, could also be used with a bidirectional LSTM [32], but they would
hardly benefit from it as as an effective pre-training strategy with enough scale, which is
difficult to obtain in the case of RNNs). For discriminative tasks, in token classification, one
can directly take the contextual embeddings of the token. In sentence classification, it is
trivial to inject a special token representing the sentence. Natural Language Understanding
(NLU) and Natural Language Generation (NLG) can easily be formulated as sequence-to-
sequence tasks, if required.

There were two seminal works as far as transformer language models are concerned, which have
been the base of many other variants. They were proposed almost in parallel.
11
See this talk by the deep learning pioneer Yann LeCun, who coined the term: https://www.youtube.com/
watch?v=8TTK-Dd0H9U.

11
2.2.1 Decoder-based models

Decoder-based Transformer language models, or GPT-like models, basically take the Transformer
decoder and throw away the encoder. The last layer of the decoder block, the one for performing
encoder-decoder attention, is therefore removed, since the decoder no longer is conditioned on
the outputs of any encoder. This model is trained with the conventional language modeling task,
that is, autoregressively predicting the next token from the previous ones. The original work
that crafted this approach successfully applied it on Wikipedia [33].
Nevertheless, the series of works that made this approach famous were the ones of Generative
Pretraining Transformer (GPT), starting with the original GPT [7], which essentially scaled
up Transformer decoder language models, except for a few modifications, such as replacing the
Rectified Linear Units (ReLU) by Gaussian Error Linear Units [34]. They showed that these
pre-training could transfer knowledge to a number of language understanding benchmarks. The
representations from the last layer can be used for different tasks, once fine-tuned. GPT-2 [8],
with as many as a 1.5B parameters, and trained on even more data, was essentially a scaled up
version of GPT. One of the advantages of the GPT approach, apart from the extreme simplicity
of the training objective (pure language modeling), is that these models are a natural fit for
generative tasks (e.g., open question answering), and this is further exploited in GPT-2. Instead
of using the model as a feature extractor, in the article of GPT-2 there is more emphasis on
directly using its generated text. The most recent version GPT-3 [11], which, again is basically a
scaled up version of its ancestors12 , has as many as 175B parameters (more than 10x with respect
to any existing non-sparse language model at the time), and instead of fine-tuning, the authors
claim that the model has outstanding few-shot (or even zero-shot) capabilities. No weights are
modified for downstream tasks; instead, a few examples are shown as context in the inference
itself.

2.2.2 Encoder-based models

Encoder-based Transformer language models, or BERT-like models, take an alternative approach,


by throwing away the decoder. BERT (Bidirectional Encoder Representations from Transform-
ers) [6] is a Transformer encoder, plus a linear layer (followed by a softmax) layer for predicting
the tokens (as a sort of tiny decoder). Like GPT, it also replaces the ReLU activation with
GELU. In addition, BERT injects an additional token, [CLS], which can be used for extracting
sentence-level information. Apart from that, and being considerably scaled up with respect to
the encoder used in the original Transformer, there are no architectural differences with it.
The real novelty of BERT came from the training algorithm. Instead of the conventional language
modeling task, BERT is trained on the masked language modeling task, which consists of the
prediction of the tokens that have been randomly masked in the input (i.e., replaced with the
special token [MASK]). The authors found that, empirically, replacing the 15% was the optimal
setting (the model did not learn properly with more masks than that). Similarly to conventional
language modeling, this task might not be particularly useful on its own, but it is a surrogate task
that happens to be efficient for pre-training representations13 . BERT introduces an additional
12
They even re-used the same vocabulary of GPT-2. This time, though, they implemented two modifications
for dealing with the massive scale of the model. First, they use alternating vanilla Transformer-decoder blocks
and sparse ones, as in [35]. Second, they use a floating point precision of 16 bits (instead of 32).
13
The same way Word2Vec’s [3] skip-gram training objective, the one of predicting the words in the context of

12
pre-training objective. Each example is composed of two sentences. Since the dataset is a
document-level corpus, these two sentences can be sampled such that 50% of the time the second
one is the sentence actually following the first one, and 50% of the time it is a random sentence.
This is predicted from the final representation of the special token [CLS], which is supposed to
learn to extract sentence-level information (i.e., a sentence embedding). The separation between
the two sentences is indicated with the special token [SEP], which can also be used in downstream
tasks implying two sentences, such as text entailment14 . Segment embeddings are also summed
to help BERT identifying which token belongs to which sentence (sentence A vs. sentence B).
Figure 7 depicts a simplified schema of BERT’s pre-training strategy.

Figure 7: BERT pre-training objective: Two sentences are input to the model, a Transformer
encoder, separated with the special token [CLS]. A random subset (with proportion a of 0.15)
of the tokens is masked. BERT is trained to predict the masked tokens (the other ones are
not needed to be predicted). Apart from that, it has to predict whether the second sentence
actually goes after the first one, or it is a random sentence, with a logistic regression from the
last representation of the [CLS] token, which is injected together with the other ones. In this
case, BERT must predict 1 since the second sentence is the one that actually goes after the first
one, not a random one. In the example sentences, words are not split into subwords for the sake
of clarity. Source: Own elaboration.

After having been pre-trained on large corpora, BERT can then be used in different downstream
tasks. As usual in deep learning, transfer learning can be applied by using the model as a
feature extractor (letting the weights frozen), or fine-tuning (updating the weights), usually
attaching a linear layer as the classifier or regressor. For token-level tasks (e.g., Named Entity
Recognition and Classification, NERC), the representations of the last encoder block are used.
Instead, sentence-level tasks, the representations of the [CLS] are used. The main BERT can
also be used for generation, but in a cumbersome way (by recursively demasking), and it is not
the intended use-case. The representations of BERT are said to be bidirectional since the model
another one, is useless on itself, but results in efficient word embedding learning.
14
I.e., detecting whether one sentence is logically coherent with the other, whether there is a contradiction, or,
alternatively, whether the two sentences are completely unrelated.

13
uses the encoder (instead of the decoder, autoregressive), so each contextual token representation
depends on both the tokens from its left and right, instead of only the ones from the right. In
addition, even in inference all contextual embeddings can be computed in parallel (for each
encoder block). In the other side of the coin of this trade-off, as we said, BERT is not especially
suitable for language generation tasks.
BERT improved the state-of-the-art in numerous NLU benchmarks. Not only that, but the
authors claimed BERT to be flexible enough to be transferable to the downstream task of choice
of the user, at least to some extent. The authors also released different versions of BERT, which
set a sort of standard for the models based on this model:

• Cased vs uncased: The authors released both cased and uncased models.

• English vs multilingual: For the multilingual BERT, the authors just concatenated data
from different languages. In this case, there is one additional process of transfer learning,
from languages with more data to low-resource languages, even to languages that were not
seen in training.

• Base vs Large: The BERT architecture can, of course, be instantiated in arbitrary sizes, as
long as the shapes fit (e.g., the embedding dimension, which in the Transformer is set to be
the same for all blocks, must be a multiple of the number of heads, since for merging back
the outputs of each head they must be concatenated and projected back to the embedding
size). However, the archetypal sizes established in BERT are typically used as a reference.
The base model has 12 layers, 12 attention heads, and an embedding size of 768 (resulting
in 110 million parameters). The large model has 24 layers, 12 attention heads, and an
embedding size of 1024 (resulting in 340M parameters).

Since BERT, many works based on it have proposed different modifications. One of the most
widely used ones15 , RoBERTa (Robustly Optimized BERT Pretraining Approach) [36] is essen-
tially BERT but without the next sentence prediction task and trained for longer and with more
data. The next sentence prediction did not appear to help in their ablation studies, and thus was
removed from the training procedure. Training for longer and with more data, instead, resulted
in a more robust model that outperformed the original BERT in different NLU benchmarks.
RoBERTa also removed the segment embeddings, since not training with the next sentence pre-
diction objective made them redundant. In addition, RoBERTa used dynamic masking (instead
of BERT’s static masking), meaning that each sentence could be masked in different ways each
time it was passed through the model during training. In RoBERTa, the used batch size was
even larger than in the case of BERT, which is thought to be beneficial. Note that other lan-
guage models might incorporate their own additional embeddings (e.g, XLM [37] sums language
embeddings).
There are many other works based on BERT16 . XLM [37] is an extension of BERT that makes
it explicitly multilingual17 such that cross-lingual representations are learned. It can be used for
15
In Section 3, we will see that many of the works that consisted of the development of a BERT-like model for
a certain language or domain ending up choosing RoBERTa.
16
https://medium.com/@phylypo/a-survey-of-the-state-of-the-art-language-models-up-to-early-
2020-aba824302c6
17
BERT had a multilingual version from the beginning, but all text is input to the model as in the monolingual
version.

14
initializing an encoder of a machine translation system. ALBERT [38] is a lite version of BERT
that severely decreases memory usage by cross-layer parameter sharing (in a pseudo-recurrent
way, as in [39]) and a clever embedding matrix factorization. There is even BART [40], a model
which combines both BERT and GPT. In fact, it is exactly like the original Transformer, since
it has both an encoder and a decoder. It is trained to restore the original sentence, which is
corrupted (by different means, such as token permutation, document rotation...). The authors
refer to this pre-training objective as language denoising. Apart from the typical downstream
tasks such as token classification (to which the knowledge can be transferred using the last
contextual embedding of the last token in the decoder), since it is a sequence-to-sequence model,
it can be fine-tuned to perform summarization, or as a decoder in machine translation systems.
A multilingual version of this model, mBART [41], shows transfer learning capabilities from the
high-resource language to the low-resource ones, and serves as a powerful initialization for a
machine translation system, if fine-tuned.
Another line of research related to BERT-like models consists of compressing the models, which
are usually huge in terms of parameters and required computation. One possibility is distillation
[42], in which a smaller version of the original model is trained with the whole probability
distribution that outputs the original models (instead of only the actual targets). The other
possibility is applying either model quantization18 or pruning (for instance, removing some of
the attention heads, as in [43]). Note, however, that in these cases the training algorithm is
roughly equivalent to the original one. What changes is that, a posteriori, a compressed version
of the model is generated. Instead, ELECTRA [9] does so by modifying the training algorithm
itself, instead of compressing the model once it has been trained. In ELECTRA, an auxiliary
small BERT model is learned as usual. Then, the actual model is trained to predict whether a
given token comes from the original input or has been predicted from by the auxiliary model.
This pre-training objective turns out to be considerably more data and compute efficient the its
alternatives, since the model receives signal from all tokens (instead of only the masked ones),
and the projection layer just has to model a logistic regression. ELECTRA is reminiscent of,
and could even be considered to be to some extent, contrastive learning19 . Table 1 shows a
comparison of some of the aspects of several Transformer language models.
18
https://pytorch.org/docs/stable/quantization.html
19
To simplify, in contrastive learning, the model is trained to discriminate between instances being positive or
negative of certain property. See [44].

15
Model Architecture base Pre-training objective Convenient for...
BERT Transformer encoder MLM + NSP Discriminative tasks
RoBERTa Transformer encoder MLM Discriminative tasks
GPT Transformer decoder LM Generative tasks
BART Full Transformer LD Seq-to-seq tasks
ALBERT Universal Transformer encoder20 MLM + SOP Discriminative tasks
ELECTRA Transformer encoder MLD Discriminative tasks

Table 1: Comparison between some of the Transformer language models: MLM means Masked
Language Modeling; NSP stands for Next Sentence Prediction; LM means Language Modeling;
LD stands for Language Denoising, which in turn consists of different corruptions (e.g., token
permutation, document rotation,...); SOP means Sentence Order Prediction; MLD means Masked
Language Discrimination. Notice that most models are based on the Transformer encoder, while
the ecosystem of decoder or full Transformer models is considerably less populated. Note that
all models benefit from document-level data. First, because most pre-training objectives require
a notion of order in sentences. Second, because even the ones that only use masked language
modeling, benefit from being able to learn on longer context windows. Regarding the last column,
by convenient we mean that the model is especially adequate for the corresponding task, but it
does not necessarily mean that it cannot perform the other ones. For instance, since BART is
a full Transformer, and it is trained to restore the original sequence from a corrupted version
of it, it transfers well to sequence-to-sequence tasks such as summarization, although it is still
considerably competitive (even if not state-of-the-art) in token classification tasks.

2.2.3 Evaluation

Language modeling has an intrinsic metric that can be computed without the need of any anno-
tations, namely, perplexity:
t
( )
1X
PPL(X) = exp − log pθ (xi |x<i )
t
i

It measures the ability of the language model to predict words given the previous ones, which
is a direct measure of the quality of the language model. The lower the perplexity, the better.
Perplexity can be used for monitoring the training of the model, early stopping, or model selec-
tion, and even for the evaluation itself. Nevertheless, we must take into account some important
considerations. Perplexities of language models with different vocabularies (arising from differ-
ent tokenization or vocabulary seen in training) are not directly comparable. More precisely,
perplexities of different systems are not comparable if the denominator depends on the segmen-
tation. That is, perplexities per predicted token may not be comparable, but perplexities per
character (or per word, if an exogenous criterion for determining words is imposed) is [46]. If
the tokenization is imposed by the benchmark, perplexity can be used as a proper evaluation
metric, such as in the Wikitext-103 benchmark [47]. Second, for computing the perplexity of
fixed-size models correctly, one must apply a sliding-window strategy21 . For instance, GPT-2
18
Cross-layer parameter sharing as a sort of pseudo-recurrence, as in [45].
21
https://huggingface.co/transformers/perplexity.html

16
has a maximum context window of 1024. Then, we cannot directly compute pθ (xi |x<i ) when t
is greater than 1024. Instead, we must break the sequence into sub-sequences with length of the
size of the model maximum context window.
Ideally, for a complete evaluation of a model, it should be passed through a series of bench-
marks for extrinsic evaluation. There are several supervised benchmarks that assess the Natural
Language Understanding capabilities, many of them grouped in the General Language Under-
standing Evaluation (GLUE [48]) 22 . GLUE is composed of different tasks, including Semantic
Text Similarity [49], Text Entailment23 , and others. Since state-of-the-art models score incred-
ibly higher in this benchmark, a new, more challenging version of it, SuperGLUE [50], was
recently introduced. The original GLUE dates back from 2018, so notice how fast the bench-
mark has become somewhat obsolete, a clear indication of the pace of NLP progress. Generative
and sequence-to-sequence models are also evaluated in tasks such as summarization or machine
translation. Encoder models can also be evaluated in terms of their usefulness as initialization
for an encoder in a machine translation system. On the other hand, SQUAD [51] is a question-
answering dataset of more than 100,000 questions, and it is also typically used for evaluating
these models.
In the case of multilingual language models, there are benchmarks for evaluating cross-lingual
representations [52]. Models for languages other than English need, obviously, benchmarks in
the corresponding language which is expensive and difficult to obtain. Recently, a GLUE-like
benchmark for French was released [53]. In the case of the Finnish BERT [54], for instance, they
used classical Finnish datasets (such as a Part-Of-Speech dataset). Another potential source of
benchmarks is translating English evaluation datasets (with the help of machine translation).
Likewise, models targeting specific domains will require specialized benchmarks. For instance,
BioBERT [18] was evaluated on biomedical NERC tasks. NukeBERT24 [55], a BERT model for
the nuclear physics domain, was evaluated on a specific benchmark for nuclear physics question
answering, NQUAD.
The methodology for evaluation is of vital importance. First of all, we that note there are no
standard guidelines for training data decontamination from sentences used in the evaluation,
as discussed in Section 4 in [11]. If we are training a computer vision model on ImageNet, we
will use the train set, and then evaluate on the standard test set. But if only the benchmark is
standard, and in the case of the pre-training data, one is encouraged (and for a good reason)
to use as much data as possible, with millions and millions of documents coming from massive
crawlings and book dumps, it is not unimaginable to conceive that some evaluation sentences
might leak. In the case of GPT-3, the authors admit that there was a bug in the script supposed
to decontaminate the pre-training data, and this affected at least some of the benchmarks used
in the article.
Finally, regarding fairness in model evaluation, comparing models trained with different amounts
of data is perfectly fair as long as what is evaluated is the model itself, not the training algorithm
or the architecture. When comparing models in a given discriminative downstream task, the usual
approach is adding a linear layer (the classifier itself) and fine-tuning the model. Obviously, the
compared models must be evaluated using the same transfer learning strategy (e.g., not fine-
tuning the proposed model and then just using the baseline as feature extractor).
22
https://gluebenchmark.com/leaderboard
23
https://demo.allennlp.org/textual-entailment
24
Yes, that is a legitimate BERT.

17
2.3 Unsupervised machine translation
As we said, machine translation systems can also benefit from unsupervised signals (and, there-
fore, monolingual corpora). The mapping between the spaces of the source language can be
learned either off-line or jointly.

2.3.1 Off-line mapping

In off-line mapping learning, we have two different spaces that have already been learned and
are fixed, and we want to learn a function that maps from one space into the other.
Vecmap [56–59] is an algorithm for learning cross-lingual word embedding mappings. It can run
with different degrees of supervision, including a fully unsupervised mode. For unsupervisedly
learning the mapping, Vecmap starts from the assumption that the two embedding spaces are
roughly isomorphic, which is not unrealistic taking into account that both spaces refer to nat-
ural languages. However, this assumption will hold less strongly in the case of very dissimilar
languages or domains.
Let X and Z be the embeddings matrices in two different languages (in each row they have the
embedding vector of a given word), without any alignment between them. The goal of Vecmap is
learning WX and WZ , such that XWX and ZWZ are in the same embedding space. In addition,
Vecmap also learns a dictionary D, in which Dij = 1 iff the ith word in X is the translation of
the jth word in Z. Unsupervised Vecmap consists of the following steps:

1. Embedding normalization: As a preprocessing step, Vecmap normalizes the embeddings


according to their length and mean-centers each dimension. Apart from easing the learning
process, this normalization guarantees that the final embeddings will have a unit length.

2. Initialization: Firstly, we compute the similarity matrices (i.e. the Gram matrix, XX T )
of each embedding matrix. Because of the isomorphism assumption, one similarity matrix
should be equal to the other, if some (unknown) row and column permutations were applied.
Since trying all the combinations is not feasible, Vecmap sorts the rows of each similarity
matrix. If X and Z were strictly isomorphic, nearest neighbour matching would be enough
for building the mapping.

3. Robust self-learning: Starting from this initialization, Vecmap iteratively applies two steps,
until convergence:

(a) Compute the orthogonal mapping that maximizes the similarities for D.
(b) Update D using nearest neighbours matching.

To us, this mechanism is reminiscent of K-means.

4. Symmetric re-weighting: Re-weight both embedding matrices according to the cross-correlation

Apart from the direct application of bilingual lexicon extraction, Vecmap can be used to initialize
unsupervised statistical machine translation systems, as in [14]. Then, the translations of the
statistical system can be used as a seed to train an unsupervised neural machine translation
system [60], and are further refined.

18
Remarkably, one of the direct applications of the corpora generated with some of the tools
proposed in this work, has been precisely leveraging Vecmap, within the MT4All project.

2.3.2 Joint learning

Despite the success of some unsupervised off-line mappings, it has been shown that jointly
learning the cross-lingual mappings could be, in some cases, more effective than the off-line
setting. The pre-trained embeddings may be optimal on their own, but for having a shared
representation they might cause local minima. By back-translation [13] we mean the generation
and use of synthetic parallel sentences. Say that we have a system for translating (not necessarily
well) from language A to language B. We then translate monolingual sentences of language A
to language B. The synthetic sentences can be used to train a translation system from B to
A. Back-translation can be applied starting from a supervisedly pre-trained system, to increase
the training data, but purely unsupervised approaches are also possible by using iterative back-
translation [15], alternatively generating synthetic data in both directions while the system is
iteratively improved.
In the case of machine translation, zero-shot translations (i.e. translation of unseen pairs in train-
ing) can be obtained by supervisedly training multilingual systems with shared representations
and language tags25 [61]. However, this approach still requires supervision for at least some pairs.
A recent work based on BART, mBART [41], applies the same model and training procedure of
BART but with multilingual (but not parallel) corpora, up-sampling the low-resource languages.
mBART can be used for unsupervised machine translation with iterative back-translation (so,
no supervision at all). For doing so, the system is constrained not to output subwords not be-
longing to the target language by masking the output probabilities of subwords with less than
1% occurrences. As we saw in Section 2.2.2, cross-lingual encoder language models such as XLM
can be used for initializing machine translation systems as well.
Regarding the evaluation, we know that in machine translation BLEU [62] is the most commonly
used metric. Some works in unsupervised MT tried to use an unsupervised BLEU (the one
obtained from translating into one language and then translating back to the original language,
and comparing with the original source) [15]. Nevertheless, this approach has the problem that
degenerate solutions (e.g. a system with the two directions, A → B and B → A, that always
translates to the same sentence) can obtain high BLEU scores. Thus, in practice, parallel sets
are used for evaluating unsupervised models.
Recently, though, some works have warned that assuming that no parallel corpora are available
but large monolingual corpora are may not be realistic in many real-life scenarios [63].

2.4 Tokenization, subwords, and vocabulary building


Ever since the advent of NLP with classical techniques, tokenization has been the first step
(assuming a clean and well-defined corpora for building language technologies systems) [64].
Typically, either general-purpose, heuristic tokenizers, such as Moses tokenizer [65], or tokenizers
25
By language tag we mean a special token that tells the system to which language should the input be translated
(e.g. <2EN>).

19
especially build for a given language or domain, such as Stanford’s tokenizer for Arabic 26 .

The emergence of Transformers has coincided with the partial 27 abandonment of these classical
tokenizers. The reason why this happened is that the newest, subword-based tokenizers have
been shown to generally improve the results, not because any especial limitation or need of
the Transformer architecture. In fact, the Transformer architecture can work with any discrete
sequence (being characters, words, subwords, or even pixels or audio fragments [66]).
Subword tokenization was invented just a bit before Transformers. Interestingly enough, just like
Transformers, the original motivation of subword tokenization was precisely machine translation.
Specifically, the authors of the seminal work in this regard intended (and with success) the
improvement of machine translation systems performance when encountering rare words [67].
The proposed method, Byte Pair Encoding (BPE), is based on a compression algorithm with
the same name [68]. The original algorithm compresses data by recursively replacing the most
frequent pair of bytes with a new symbol (an unused byte). In the case of machine translation,
the idea was to do so with characters instead of bytes. For instance, BPE could encode the word
’internationalization’ as ’inter## national## ization’, with ’#’ being the symbol to
denote subword separation, to the relative frequencies in the training set. This might (or might
not) correlate with morphological features, but note that it is purely based on character counts
and, therefore, language and domain-agnostic. Ever since, some variants of this algorithm have
been proposed. For instance, as in the BPE compression algorithm, researchers have proposed
to use bytes as the unit to encode, instead of characters [69].
Machine translation systems, like language models, suffer from the open vocabulary problem
(i.e., in inference an arbitrary number of new words unseen in training may appear), but at the
same time, a character-level tokenization may be less efficient and difficult to learn (due to having
longer sequences). As a middle ground, taking the bests of the two worlds, BPE will build a
given word from the subwords in its vocabulary. Otherwise, it can still build them character-by-
character, if no subwords correspond to the word, mostly solving the open vocabulary problem
(instead of assigning a special <UNKNOWN> token to unseen words, or words in the train set below
a certain frequency threshold).
BPE is learned from the training set. Once it has build the vocabulary from it, it freezes it and
applies the same tokenization to both validation and test sets. Recovering the initial text is as
easy as replacing the BPE special characters (followed by a space) by an empty string.
BPE, though, introduces another hyperparameter, the number of operations. The more merges,
the larger the vocabulary 28 . There are studies on the effect of the vocabulary size [70], and one
of their main takeaways is that especially in low-resource scenarios, it may be sensible to set a
low number of BPE operations.
Vanilla BPE and some of its variants, such as BPE dropout [71] (a regularization technique
consisting on making the process of merging tokens or characters stochastic), still assume a pre-
tokenization by a classical tokenizer. Instead, more recent alternatives such as sentencepiece [72]
assume no previous tokenization and rely completely on statistics from the train corpus. The
authors of sentencepiece show that their purely language-agnostic outperforms other alterna-
26
https://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/international/arabic/process/
ArabicTokenizer.html
27
Note that they are still extensively used, in many cases in conjunction with subword tokenizers.
28
Specifically, |V | = #ops + |Characters|

20
tives. Apart from a BPE-based algorithm, they experiment with a unigram-based alternative.
Wordpiece [73] is another frequency-based tokenization by Google, but unlike the other systems,
it is not open-source and less details are known about it. Table 2 shows a comparison of some
of the aspects of several existing tokenizers.
Despite their name, most BPE tokenizers do not actually operate in the byte space. They
typically operate in the character or word (if the text has been pre-tokenized) space. In [74],
they introduced a Byte-level BPE (BBPE), which, indeed operates in the byte space. This has
the effect of literally eliminating the out-of-vocabulary problem, since the vocabularies built with
this technique start from the possible 256 values of bytes, and then, as in BPE, recursively apply
symbol merging while keeping the symbols from the previous operations. Thus, this vocabulary
can represent any Unicode string, even if byte by byte in case of characters not seen in training.
In Sentecepiece and Wordpiece, for instance, characters unseen in training will still be replaced
with the [UNK] (unknown) token.
Since tokenization has to be applied to large corpora, at least in the case of language models, some
implementations focused on efficiency have been proposed. For instance, FastBPE29 is a C++
implementation of the original BPE tokenizer. The tokenizers library, released by Huggingface,
provides a fast30 implementation of several subword tokenizers.
At the end of the day, the choice of the tokenization strategy to use is another set of hyperpa-
rameters when building language models, but it is a very relevant aspect that can hardly31 be
changed after training. For instance, one can fine-tune a pre-trained model on a given dataset if
the original one did not have enough data from a given language or domain, but if the vocabulary
was built with text from a very different language or domain, the model may struggle even after
fine-tuning.
The data that have been used for building the vocabulary should be as representative as possible
of those that will be seen during inference. This will be relevant in the comparison of two domain-
specific BERT-based systems, BioBERTa and SciBERT, that we will see in Section 3.1. For the
same reason, in the multilingual model mBART [41] the documents of languages with less data
were over-sampled when building the vocabulary, solving a problem (the one of underrepresented
languages in the vocabularies of multilingual language models) that was already known since at
least the publication of Multilingual BERT.
Table 2 shows a comparison of some of the available tokenizers, while Table 3 gives more details
on the vocabulary building of some of the models we saw.

Algorithms Need pre-tokenization Fast implementation Example of model


BPE Yes Rust, C++ GPT
Byte-level BPE Yes Rust GPT-2
WordPiece Yes (unknown) Unknown BERT
SentencePiece No No (Python) CamemBERT

Table 2: Comparison between some of the available tokenizers.


29
https://github.com/glample/fastBPE
30
Thanks, among other reasons, to being based on the Rust programming language (https://www.rust-lang.
org/).
31
Recently, some possible workarounds have been proposed. See [75] and [76].

21
Model Pre-tok Tok Vocab. size Multi-lingual Lang. oversampling
GPT Spacy32 BPE 40,478 No -
GPT-2 Spacy Byte-level BPE 50,257 Yes No
RoBERTa33 Spacy Byte-level BPE 50,257 Yes No
M-BERT No Byte-level BPE 110,000 Yes No
mBART No SentencePiece 40,000 Yes Yes

Table 3: Comparison between some of the available tokenizers.

2.5 Corpora generation and processing methodologies


Most if not all Transformer language models benefit from having document-level34 corpora. In
fact, most of the different pre-training objectives are not even possible without this level of gran-
ularity. (e.g., one cannot apply the next sentence prediction task with a sentence-level corpus).
Even in the case of models with training objectives that do not necessarily require document-level
corpus, such as language modeling (GPT) or masked language modeling (RoBERTa), they can
still benefit from document-level corpora for modeling long-range dependencies. Needless to say,
a document-level corpus can always be transformed into a sentence-level one, but not the other
way around.
Unfortunately, most available corpora are sentence-level. In some cases, corpora that could be
document-level, have to be shuffled for privacy or copyright concerns. For instance, Oscar [77],
a large repository of monolingual corpora for many languages, is, by default, provided with the
sentences being shuffled. One might contact the authors and ask them for the document-level
version, but it is not open beforehand. In addition, even if Oscar itself is multilingual, many
languages and domains are generally underrepresented.
There are certain methodologies common in pipelines for corpus generation, including language
detection and document or sentence deduplication. Onion [78] is a renowned tool for document-
level deduplication. Several language identifiers exists, each of them with different trade-offs,
such as Fasttext’s [79] language identifier35 . Data should be cleaned enough to remove useless
noise, but not enough to make the model less robust to certain perturbations in inference (e.g.,
misspellings) [54]. Ideally, some tracing over the used data should be kept (i.e., basic metadata),
and decontamination of targeted evaluation benchmarks should be guaranteed, even if this last
point is not yet well-established [74] [11].
High-Performance Computing clusters can be leveraged for processing big quantities of data.
In [74], Spark36 , a set of tools for big data analytics, was used for preprocessing, but raw libraries
for parallelism and distributed computing can be used as well, such as OpenMPI37 . In Section
3, we will see in more detail existing cleaning pipelines and available corpora.

30
https://spacy.io/
31
Re-used GPT-2 vocabulary.
34
By document-level we mean that each instance is a sequence of ordered sentences, such as a small paragraph
or even a whole book.
35
https://fasttext.cc/blog/2017/10/02/blog-post.html
36
https://spark.apache.org/
37
https://www.open-mpi.org/

22
3 Related work

After having contextualized the required background, we now go through the specific works that
are most related with our proposal.

3.1 Domain-specific models


In principle, Transformer language models are supposed to be general enough than they can gen-
erally transfer knowledge to numerous tasks. For a given downstream task, the user is supposed
to take a pre-trained model and fine-tune it for the targeted dataset.
Nevertheless, a number of domain-specific models have been released, with the motivation of
improving the performance in certain kinds of text. Two main approaches for doing so have
been proposed:

• Fine-tuning an existing model: Notice that we mean re-training with unlabelled data with
the same pre-training objective of the model (but starting with the pre-trained weights),
not fine-tuning to the desired downstream task. BioBERT [18] is the canonical example of
this approach. It is based on BERT-Base (not only the architecture itself, but the weights
as well), and further pre-trained in 18B tokens from the biomedical domain, extracted
from PubMed38 abstracts and full articles from PubMed Central (PMC39 ) full articles.
BioBERT clearly outperformed BERT in different biomedical text mining benchmarks,
from NERC to relation extraction.

• Building a model from scratch: If enough data are available, one might start the pre-
training from scratch, as in the case of SciBERT [80]. SciBERT is also based on BERT-
base, but trained from the ground up. SciBERT was trained on scientific articles from
Semantic Scholar40 , with a total of 3.17B tokens, similar to the scale of the corpus of the
original BERT. SciBERT outperformed BioBERT in most benchmarks. The authors hy-
pothesize that one of the reasons why this be may the case is the vocabulary. Crucially,
since BioBERT starts from the pre-trained weights of BERT, it has to re-use the same vo-
cabulary. Instead, SciBERT builds the vocabulary from scratch with in-domain text, using
SentencePiece. The overlapping between the new vocabulary and the one of the original
BERT is of 42%, meaning that the frequencies in the use of subwords are considerably
different.

There are domain-specific BERTs, such as PatentBERT [81], which uses the same approach
as BioBERT but in the domain of patents. The general idea is that there may be a certain
point upon which pre-training BERT on monolingual data, which is expensive, is worth the cost.
Furthermore, there may be another threshold upon which it is worth pre-training from scratch
(rather than fine-tuning). Figure 8 shows a schematic view of the two approaches.
Another example of domain-specific BERTs is the one specifically fine-tuned on the domain of
nuclear physics [55], a challenging, low-resource of domain. They have to build a corpus by
38
https://pubmed.ncbi.nlm.nih.gov/
39
https://www.ncbi.nlm.nih.gov/pmc/
40
https://www.semanticscholar.org/

23
Figure 8: Further in-domain pre-training vs. training from scratch: The two main approaches
for building domain-specific language models are the following ones: (i) Further in-domain pre-
training (above): Take a pre-trained language model in the general model, re-using its vocabulary,
and continue pre-training it in the in-domain (but unlabelled) corpus. Finally, for applying the
model to the desired downstream task, the user can fine-tune it using the corresponding annotated
dataset. This approach has the advantage of leveraging all the pre-trained knowledge of the
language model trained in the general domain, but the disadvantage of having a vocabulary not
adapted to the desired domain. The paradigmatic example is the BioBERT model. (ii) Training
from scratch: Instead, if enough data are available, one can directly pre-train the model on
the in-domain (but unlabelled) data. This approach has the advantage of having a vocabulary
especially built for the domain of choice, but potentially less data. the paradigmatic example is
the SciBERT model. Source: Own elaboration.

developing a text processing pipeline for retrieving tokens from PDFs with Optical Character
Recognition, sentence splitting, and language detection with NLTK41 , but they only obtain
8 million tokens, several orders of magnitude below the number of tokens used to train the
original BERT. For this reason, they take the BioBERT approach (fine-tuning), but they try to
41
https://www.nltk.org/

24
circumvent the problem of the vocabulary by adding around 100 domain words using BERT’s
UNUSED tokens (BERT reserved around 100 tokens for future usages).
Related to the issue of the vocabulary when using a model in a domain sufficiently different
from the one it was trained, but in the case of languages, [75] showed that a BERT model
can be trained on a given language and then replace the English vocabulary by the one of the
target language (by re-initializing the embedding layer). The model is re-trained with all the
weights frozen, except the ones of the embedding layer. We observe that this finding could have
implications in future domain adaptation techniques (apart from the cross-lingual transfer use
case).

3.2 Language-specific models


Similarly to what we saw in the case of domain models, many works have suggested that a
language-specific model can outperform multilingual models. This was already known in the case
of the original BERT, in which the monolingual English BERT outperformed the multilingual
one in English downstream tasks. Nevertheless, it was not that clear for languages for which not
so many resources were available.
The first language-specific models targeted languages with the most resources (which does not
necessarily mean the ones with the most speakers). Google had already released a Chinese
BERT in the official repository of the original BERT. Two French BERT-like models were re-
leased almost simultaneously, developed in parallel by two different groups, FlauBERT [53] and
CamemBERT42 [82]. This time, they used RoBERTa, and, in fact, many language-specific mod-
els are based on this architecture, which seems to be a sensible default choice43 .
Regarding CamemBERT, the authors showed that preprocessing a web crawling for building
a language-specific model was more effective than relying on multilingual models (which easily
rely on Wikipedia for non-English datasets). As tokenizer, they used SentencePiece with a
vocabulary size of 32k. They used the French portion of OSCAR. Surprisingly enough, the
version pre-trained on only 44 4GB of OSCAR already outperformed the multilingual BERT, and
even more surprisingly, its performance was not that far from the one of the model pre-trained
on 130GB of Oscar. According to the authors, the takeaway is that well-processed web crawlers
are considerably more useful than Wikipedia due to their diversity in terms of domains and
genres (while Wikipedia is mostly uniform in this regard). For evaluation, they used existing
classical NLP datasets in French, such as Part-Of-Speech tagging, dependency parsing, and
natural language inference (entailment) datasets, as well as NERC.
In the case of FlauBERT, they released both a French language model and a new evaluation
benchmark, FLUE45 , the French version of the GLUE benchmark. As far as the models are
42
The study of the naming of BERT-like models could, indeed, be a whole master thesis on its own.
43
RoBERTa is supposed to work better than BERT, while having the exact same architecture and a simplified
pre-training algorithm. It has been battle-tested in many situations. If a given language does not have a established
language-specific baseline, probably it makes more sense to first start with RoBERTa, and then investigate whether
more exotic models are worth the investment. This has been the usual approach in most languages. In addition, if
one wants to investigate whether having a language-specific model for a given language outperforms the multilingual
baseline, and to see why this is the case, the more variables remain constant, the better. Training from the ground-
up a model substantially different than BERT, for instance, may complicate this analysis.
44
Notice that being plain text, 4GB is a still a considerable size.
45
https://github.com/getalp/Flaubert/tree/master/flue

25
concerned, the large version clearly outperformed its multilingual counterpart, and even Camem-
BERT. They relied as well on Common Crawl or OSCAR, but starting with more data (250GB
of text) and then cleaning them aggressively, ending up with 41 GB. Concatenated with other
corpora, this resulted in a dataset of 71 GB. It could be the case that the small improvements
in performance in the case of CamemBERT when increasing the data size from 4GB to 130 GB
were caused by the model not being big enough to take advantage of the increase, since they used
RoBERTa base instead of RoBERTa large (unlike FlauBERT). In this case, they used a pretok-
enizer, Moses, and then vanilla BPE, with a vocabulary size of 50k. FlauBERT was trained in
a French HPC center, not GPUs from cloud providers.
In some other language-specific models, the authors leveraged the release of the language-specific
model itself for, in addition, experimenting other more general settings. For instance, an Italian
model named GilBERTo 46 was released, using the same approach as CamemBERT, with OSCAR
as the source of the corpus and a SentencePiece vocabulary of 32k tokens. But another Italian
model, UmBERTo, introduced whole word masking, which had been implemented in an updated
version of the original repository of BERT. Both the original BERT and RoBERTa randomly
masked 15% of the tokens, which happen to be subwords. In UmBERTo [83], the authors made
the masking to be applied to the whole word (even if implied masking more than one consecutive
subword), which seemed to improve training, forcing the model to extract more information
(instead of, perhaps, guessing from local correlations). Works using SentencePiece, without any
pre-tokenization, can still do whole word masking, by using whitespaces as delimiters.
We are especially interested in the Finnish BERT47 [54]. It is one of the works that gives the most
emphasis in data cleaning. Not only they give importance to this factor, but they document it and
devote a large portion of their article to this matter, and we will see more about it in Section 3.3.
Regarding the vocabulary, again, they observed that the multilingual BERT tokenizer produced
too many tokens per Finnish word (in contrast to the case of English words). This tokenization is
thought to make the process of learning more difficult, especially since subwords end up being too
short, such that they are less linguistically interpretable. The authors used a Finnish pretokenizer
and then applied BPE, showing that with the new vocabularies considerably less subwords per
word were generated. The Finnish BERT was trained in a Nordic HPC center.
Apart from documenting their cleaning of the Finnish text, which perhaps should not be literally
copied for other languages but at least be used as a source for inspiration48 , they went one step
forward and published Wikibert [84]. It is an open-source49 and generic pipeline for training
BERTs for many different languages. This pipeline is supposed to support all Indo-European
languages, and consists of the following steps:

1. Download Wikipedia dump.

2. Extract raw text from the Wikipedia dump.


46
https://github.com/idb-ita/GilBERTo
47
We are that interested that some of members of TeMU-BSC, including myself, travelled to the middle of
Norway to assist to their seminar on practical tips on training BERT from scratch, in the Winter School of the
Nordic Language Processing Laboratory, in Norway: http://wiki.nlpl.eu/index.php/Community/training
48
For instance, for building the SVM classifiers, they leverage existing corpus for Finnish, so this specific tech-
nique might not be directly usable in other languages. However, most proposed heuristics can be used in many
Indo-European languages.
49
https://github.com/spyysalo/wiki-bert-pipeline

26
3. Download the corresponding UDPipe50 model.

4. Sentence splitting.

5. Filter documents with hand-written heuristics.

6. Sample sentences accordingly.

7. Basic pre-tokenization.

8. Create vocabulary and tokenize (with SentencePiece).

9. Create TF records51 .

The authors provide numerous pre-trained models with this pipeline, which works mostly out-
of-the-box (yet, not including the training scripts themselves) for many languages, provided they
have enough Wikipedia entries and a UDPipe model available. For instance, Catalan is one of
the compatible languages because it fulfills the two requirements. However, while this system
is an excellent choice for obtaining reasonable baselines, the resulting models will not have seen
data from crawlings (just Wikipedia). Unrestricted web crawlings are larger than Wikipedia,
and in the work of CamemBERT it is claimed that web crawlings are also more diverse in terms
of genre and domains than Wikipedia, which generally implies better generalization. In addition,
processing Wikipedia is not as challenging as doing so with crawlings (so it is unclear whether
this pipeline preprocessing techniques would generalize to unrestricted crawlings). Finally, the
system is intended for using BERT’s original Tensorflow 1 repository, but many other libraries
and models have been introduced ever since.
Other relevant examples of successful language-specific models for languages with less resources
than English or French are the Basque BERT (BERTEus) [85], the Korean BERT [86], the
Estonian BERT [87], or the Dutch BERT [88]. In the case of the basque BERT, they made
an observation regarding tokenization of multilingual languages, consistent with other views we
have seen. The tokenizer of the multilingual BERT, when applied to Basque (a language seen in
training, but underrepresented in the training set) tends to split words into shorter subwords that
happen to be less interpretable. The vocabulary build from Basque corpora results in subwords
that are closer to being linguistically interpretable, due to the relative frequencies. Table 4 shows
some examples of these tokenization artifacts.
Regarding Spanish, a Spanish BERT, named BETO [89], was recently released, and outperformed
its multilingual counterpart in a number of benchmarks. The used architecture, RoBERTa-base
but with the number of attention heads and embedding size of RoBERTa-large, was trained using
whole word masking. As far as the vocabulary is concerned, they used 31k subwords built with
SentencePiece. Regarding the data, they used an aggregation of different Spanish corpora with
almost no preprocessing [90]. We will see more about this preprocessing in Section 3.3.
50
UDPipe is an open-source trainable pipeline for tokenization, tagging, lemmatization, and dependency parsing.
See https://github.com/ufal/udpipe
51
This pipeline assumes that the code of the original BERT, which is based on Tensorflow 1, will be used. The
original repository works with TF records, a data format optimized for this library.

27
Word M-BERT subword tok Lang-specific subword tok
etxeranz52 et #xer #ant #z etxera #ntz
medikuarenera53 medi #kua #rene #ra mediku #aren #era
valtiovarainministeri54 valt #io #vara #in #minister #i valtiovarain # ministeri
vaihtuu55 vai #htuu vaihtuu

Table 4: Tokenization artifacts when text in underrepresented languages (the same happens with
domains with different enough lexicon) is tokenized: When a multilingual tokenizer is applied
to underrepresented languages, it is common to see that a considerable amount of subwords per
word is generated, and these subwords tend to be less interpretable. It is hypothesized that they
are more difficult to learn (since sequences become artificially longer). Source: Recollected from
the articles of the Basque and Finnish BERTs, respectively.

Model Architecture Hardware Time Batch WWM56 Data


CamemBERT RoBERTa-base 256 V100 1d 8192 tok Yes 138GB
FlauBERT RoBERTa-large57 128 V100 16d 8192 tok No 71GB
BERTeus BERT-base N TPUs 1M steps 256 sent Yes 225M tok
Finnish BERT BERT-base58 8 V100s 1M steps 1120-160 sent59 No 3.3B tok
BETO RoBERTa-base N TPUs 2M steps 2048-256 sent60 Yes 3B tok

Table 5: Comparison between some of the available language-specific models. The same units
are used whenever possible (depending on how the authors report them). For example, some
works report training time with time units (can be normalized to days, ”d”), but other ones
provide the number of training steps, instead; the same happens with batch size (tokens, ”tok”,
or sentences, ”sent”), and data disze (file size, in gigabytes, ”GB”, or number of tokens, ”tok”).
Notice that BERT and RoBERTa use the same architecture, but a different pre-training task
(models with BERT use the additional task of next sentence prediction). For the sake of brevity,
we denote it with the same field, architecture. Note as well that all NVIDIA V100 mentioned in
this table, and in most of the works, are actually the 32GB version (instead of 16GB).

The original BERT provided cased and uncased models, as well as base and large architec-
tures. Most of the works, though, do not provide these 4 combinations, due to computational
constraints. For instance, in the case of the Finnish BERT, they provided uncased and cased
versions of the BERT-base architecture. Regarding model-size, if a given domain or language
does not have yet a specific baseline or does not have tons of data, it may make sense to start
52
”To the house” (Basque).
53
”To the doctor” (Basque). Notice the interpretability of the subwords: mediku (”doctor”), aren (”’s”), era
(”to the”).
54
”Finance minister” (Finnish). Again, notice that there are more subwords and that they are more meaningful.
55
”Exchange” (Finnish).
57
Whole Word Masking
58
They also trained a RoBERTa-base version, in this case 32 V100 for 410 hours.
59
Two versions with the same architecture: cased and uncased.
60
They use two batch sizes. The first one, for most of the training, with a maximum sequence length of 128.
Then, with a maximum sequence length of 512, the batch size is decreased to fit in memory. This was thought to
help training, but it is disputed and more recent models do not usually do it.
61
Same is in the case of the Finnish BERT.

28
with the base architecture. Regarding casing, even though the uncased versions might outper-
form their cased counterparts in specific benchmarks, cased models should be the priority, since
they generally perform better as reported in the literature61 .
Table 5 shows basic comparative information between some of the available language-specific
models in terms of the models themselves, while Table 6 does so in terms of their vocabulary,
including some domain-specific ones.

Model Tokenization/vocabulary building Vocabulary size


CamemBERT SentencePiece 32k
FlauBERT Moses + BPE 50k
BERTeus WordPiece with regularization 50k
Finnish BERT BERT’s basic tokenizer + SentencePiece 50k
BETO SentencePiece 32k
BioBERT BERT’s vocabulary (BasicTokenizer + WordPiece) 30k
SciBERT Domain vocabulary with SentencePiece 30k

Table 6: Comparison between the vocabulary building strategies of some of the language and
domain-specific models.

Encoder-based models are considerably more popular in the domain and language-specific scene,
and we observe a series of reasons why this is the case. First of all, they are more conve-
nient for fine-tuning for discriminative downstream tasks (i.e., token-level and sentence-level
classifications), which are the most common use cases. Second, the multilingual BERT is a well-
established baseline for numerous languages, against which language or domain-specific models
can be compared in NLU tasks. In the case of generative or sequence-to-sequence, there are
not so well-established multilingual baselines and tasks. Finally, obtaining useful representations
may be easier than generating coherent sentences, especially in languages or domains with less
resources. Nevertheless, we observe that it would be interesting to explore the development of
generative and sequence-to-sequence models for a variety of languages and domains. A recent
work introduced a French BART [91], being competitive with RoBERTa-based French models in
discriminative tasks, while gaining generative capabilities.
For a reference of the results of language-specific models, we recommend a recent study on the
matter [92].

3.3 Unlabelled corpora generation


For training the kind of models we saw, large corpora are required. Even without annotations,
compiling huge amounts of text and processing it can still be challenging.
FTFY (Fix That For You) [93] is a widely used Python library to fix several encoding problems,
as well as applying unicode normalization62 . Notably, it was used to clean the crawlings and
books GPT models were trained on [74], as well as many other works. In these works, apparently,
61
It depends on the task and the model, but after reviewing numerous works (the ones cited in this master
thesis), it seems that cased models are generally preferred. Uncasing might ease the learning process by decreasing
the vocabulary size, but at the same time there is a loss of information.
62
https://unicode.org/reports/tr15/

29
they followed a preprocessing-lightweight approach. Apart from using FTFY for fixing encoding
errors, it seems that they did not apply many other preprocessing steps, for a number of reasons.
Specifically, these models are huge and may need less cleaning. In addition, they precisely wanted
to prove that their models need not extensive preprocessing.
However, OpenAI never open-sourced the preprocessing scripts. Furthermore, they never released
the crawling-based datasets they built such as the dataset referred as Webtext in their articles.
In addition, the books corpus is no longer publicly available in the original repository, and the
books2 dataset they mention was created by OpenAI and never released63 . In other words, we
are not really sure about the details of their preprocessing, and perhaps applied some additional
steps (e.g., in the WebText corpus, extracted from a crawling, we understand that they extract
the text itself from the HTML).
This encoding-related preprocessing can, indeed, fix certain problems of raw text, but it cannot
magically distinguish between natural and non-natural text, or between high-quality and low-
quality text. Furthermore, regardless of the quality, the user may be interested in a specific kind
of text (e.g., text in a specific language). In other words, FTFY and other encoding-related tools
are only one part of the story, albeit a crucial one.
As a successful example of end-to-end corpora generation process, Paracrawl [94] consists of
a set of large-scale parallel corpora for European languages, targeting sentence-level machine
translation. Interestingly, apart from the datasets themselves, they have released their crawling
and cleaning pipeline, Bitextor. This pipeline comprises several steps, namely 1. Crawling,
2. HTML preprocessing, normalization, and information augmentation, 3. Document alignment,
4. Sentence alignment within documents identified as parallel 5. Filtering of noise, deduplication
and output formatting. Bitextor was designed with scalability in mind, being able to deal with
large amounts of data, and it is HPC-friendly (e.g., it integrates with the SLURM64 job manager,
installed in many supercomputers and clusters). Bicleaner, a tool for filtering parallel data, is
integrated in this pipeline. Bicleaner ranked among the best toolkits for cleaning parallel corpora
at WMT18 [95]. As we will see, Paracrawl will serve us as inspiration, but their use-case (focus
on parallel corpora) is different from ours (monolingual/unlabelled corpora), even if at least
some of their components could be useful to our needs. We observe that 1. Their focus on
machine translation influenced the whole design of the pipeline towards the goal of obtaining
parallel corpora, while a focus on monolingual text could probably lead to obtaining more text,
2. Their language identifiers65 did not work especially well in some of our use-cases (e.g., Catalan,
Biomedical Spanish), 3. We needed more flexibility in terms of the applied filters and input and
output formats, 4. We needed document-level deduplication, and we do care about document
coherency.
Regarding monolingual corpora generation, the WaCky initiative [96], which dates back to 2009,
is worth mentioning. It was one of the first large-scale projects for building monolingual, un-
labelled corpora from the web. They produced 1B-token datasets for English, German, and
Italian, and shared the tools for doing so. At the time, this was considered to be large, but,
nowadays, this size would not be considered impressive taking into account that those languages
63
See https://github.com/pytorch/fairseq/issues/2947 in the case of RoBERTa and https://gist.
github.com/alvations/4d2278e5a5fbcf2e07f49315c4ec1110 in the case of GPT. We were particularly aware
that for certain languages and domains obtaining corpora for training language models could be problematic, but
it seems that even for English this can be the case as well.
64
https://slurm.schedmd.com/documentation.html
65
Namely, Cld2 and Cld3 https://github.com/google/cld3

30
are not low-resource. A more recent (2012) approach based on the WaCky initiative introduced
a software toolkit that applies basic cleaning, simple connected text detection (by removing boil-
erplate), and deduplication [97]. The pipeline used66 is an impressive piece of engineering, even
if the fact that it is based on Pascal might make it less attractive to other developers nowadays.
For removing boilerplate, they use a simple neural network with hand-engineered features such
as the ratio of text characters vs. markup characters, or the ratio of uppercase vs. lowercase
characters. They generate document-level corpora, removing near duplicate documents.
In addition, there is CommonCrawl67 , which is a large-scale multilingual crawling. While it
can be used for extracting parallel sentences [98], perhaps it is better known for generating the
OSCAR corpus. OSCAR (”Open Super-large Crawled ALMAnaCH coRpus”) [99] [100] is a
multilingual (yet, not parallel) corpus precisely targeting models that benefit from unlabelled
text. Once the data are downloaded from the CommonCrawl repositories, the authors apply a
language identifier to organize the corpus into different files, one for each language. Specifically,
they use FastText’s [101] [102] language identifier68 . Thus, unlike Paracrawl, their scripts do
not discard sentences without a parallel counterpart. Like Paracrawl, though, they also open-
sourced they pipeline, GoClassy, which is considerably simpler than Paracrawl’s Bitextor. It
does not include the scripts for downloading the data, but they do include utilities for language
identification, as mentioned, and sentence-level deduplication. Sentence-level OSCAR is publicly
available, but for obtaining the document-level version, one must directly contact the authors.
Some of the language-specific models that have been published have released their code for
cleaning the data. However, in most cases the code was quite specific to their respective use
case. We highlight the case of BETO [89], the first Spanish BERT, for which authors aggregated
different existing Spanish corpora and concatenated them using a very simple preprocessing
script. As an example, we observed that this script did not split sentences correctly (e.g., dots
in acronyms were detected as end of sentences), and all corpora were concatenated sentence-by-
sentence, thus missing document-level boundaries.
As said in Section 3.2, we find especially interesting the case of the Finnish BERT, since the
relevant paper gives more details about text cleaning than usual. They aggregated text from
different sources, namely, two news corpora, a corpus coming from a large forums website in
Finland, and unrestricted web crawls. The cleanup and filtering consisted of the following steps:

1. Removal of the header and tag material.

2. SVM-based removal of automatically generated text.

3. Agressive filtering using language detection and hand-written heuristics (e.g., removing
documents with too high a ratio of digits, non-Finnish alphabet characters...).

4. SVM-based removal of morpho-syntactically bad formed sentences.


66
https://github.com/rsling/texrex
67
https://commoncrawl.org/
68
https://fasttext.cc/docs/en/language-identification.html

31
Work Data source Data size before cleaning Data size after cleaning (%)
Finnish BERT Crawlings 13.5B tok 3.3B tok (22.76%)
FlauBERT CommonCrawl 215GB 43.4GB (20.19%)

Table 7: Results of two of the cleaning strategies studied: We observe considerably aggressive
strategies when cleaning crawlings, resulting in large reductions in corpus size. Sizes are provided
in different units due to the different criteria used to report them in the original works, but relative
sizes are also provided for easing the comparison.

Table 7 shows statistics of the different cleaning strategies in some of the referenced works.
Finally, regarding the specific data we are targeting in this work, we highlight:

• General domain Spanish: We already saw that for training the Spanish BERT, BETO,
the authors collected and preprocessed data from many different sources, totalling around
3.3B tokens. The corpus is, though, sentence-level, and we identified some problems in the
sentence splitting.

• Biomedical Spanish: For the English biomedical domain, in the works of BioBERT and
SciBERT, large amounts of scientific articles (e.g., from PubMEd) were extracted, ranging
between 3B and 18B tokens. In the case of Spanish, similar collections (yet, in a smaller
scale; 182M tokens in total) have been generated for learning domain word embeddings
[103], aggregating articles from Scielo69 and health articles in Wikipedia. In this case, the
problem is that this scale, while enough for word embeddings, could be too small for a
language model. In addition, domain documents collected from Wikipedia were not always
actually related to the biomedical domain (using Wikipedia tags can be problematic).

• General domain Catalan: For most languages, including Catalan, Multilingual BERT was
trained on the corresponding Wikipedia, with around 200M tokens. caWac [104], the
largest Catalan corpus ever published to date, is composed of more than 780M tokens
coming from a large-scale crawling of the top .cat domains. For the release, authors
tokenized and tagged the corpus. CuCWeb [105], a predecessor of caWac, consisted of
a corpus of 166M tokens coming from a crawling. Table 8 shows how existing Catalan
crawlings compare.

• MT4All language pairs: The MT4All project targets the following language pairs: 1. Finnish,
Norwegian, Latvian ↔ English for the financial domain; 2. Ukrainian, Georgian, and
Kazakh ↔ English for the legal domain; 3. Norwegian, Spanish, German ↔ English for
the customer support; 4. Spanish ↔ English for the biomedical domain; 5. Basque, and
Catalan ↔ English for the general domain. While the biomedical and general domain use
cases are covered by the referenced literature (and, in the case of Basque, see [106] and the
aforementioned Basque BERT, BERTEus), the other domains are considerably specific.
The literature on generating corpora for those domains is limited, especially for languages
other than English. In this work, we will explore the Finnish, Norwegian, Latvian, Catalan,
Basque, and Spanish use cases.
69
https://scielo.org/es/

32
Work Source Size Preprocessing Document-level
CuCWeb (2006) [105] Spanish websites (IP) 166M LI, TOK, SP, TAG X
caWac (2014) [104] Top .cat domains 780M LI, TOK, SP, TAG X
OSCAR (2019) [99] CommonCrawl 728M LI, SP

Table 8: Existing Catalan crawlings. LI means Language Identification; TOK means tokeniza-
tion; SP means sentence splitting; TAG means tagging.

3.4 Summary and conclusions on the state of the art


As far as domain and language-specific models are concerned, we have seen that developing
models from the ground-up has proven to be worth the effort, provided the domain is challenging
enough and that it meets a minimum requirement of available resources, or that the targeted
language has enough data available. One of the main advantages of building new, specific
models is that the vocabulary has been especially crafted for it, resulting in considerably shorter
sequences. In a multilingual (or general-domain) model, since the vocabulary is shared among
languages (or domains), many words will be split into subwords, since in the vocabulary will
not be enough entries for all terms. This will especially harm the underrepresented languages or
domains. In a language or domain-specific model, more words will have their own token. This
will result in shorter sequences, easier and more efficient to model.
Regarding corpora generation, we have seen that in the NLP community there is a long tradition
of developing unlabelled corpora. The scale and preprocessing techniques of these corpora have
evolved with the needs of the NLP models. On the other hand, NLP models have also evolved
in function of the available corpora. The idea of understanding the web as a corpus is decades
old. At first the generated corpora needed not be especially large, were generated only for high-
resource languages in the general domain, and were preprocessed for more classical models (e.g.,
releasing the corpus tokenized at word level and having been automatically tagged with Part-
Of-Speech). Nowadays, new datasets are becoming increasingly large, and what yesterday was
considered large for general-domain English, today is believed to be tiny for biomedical Spanish.
Also, exploring other sources other than the web is becoming more common.
While some pipelines for text processing exist, we believe it interesting to develop a new one from
the ground up, even if leveraging existing libraries and components, for optimizing flexibility,
configurability, and targeting models that need unlabelled corpora, but without imposing specific
model-specific decisions (e.g., tokenization). We take, though, a great deal of inspiration from
these existing cleaning pipelines. We have seen that language detection, document boundaries
detection, boilerplate removal, and deduplication are among some of the main challenging tasks,
at least for the kind of corpora generation of our interest. We have also seen that carefully
hand-crafted heuristics are of extreme effectiveness, and we will base some of our rule-based
components on the ones described in this section. There is room for improvement and creativity,
though. We considered making use of machine learning, as in the Finnish BERT preprocessing
or the work based on the WaCky corpora. We concluded that, while definitely promising, this
approach can be problematic in our case, due to the different array of use cases we are interested
in, as well as the lack of literature for some of them. We will use machine learning, though, for
language identification, with an innovative approach targeting efficient detection.

33
Work Model Data Parameters Catalan vocabulary
mBERT [6] BERT Wikipedia (200M tokens) 110M
Wikibert ca [84] BERT Wikipedia (200M tokens) 110M X
Calbert70 ALBERT Oscar (720M tokens, no cleaning) 12M X

Table 9: Existing Catalan language models: mBERT (Google’s Multilingual BERT) was trained
with tons of multilingual data, but specifically for Catalan it only saw around 200M tokens,
from the Catalan Wikipedia. Wikibert ca is a monolingual model just trained with the Catalan
Wikipedia. Finally, Calbert is an ALBERT model trained with the Catalan section of OSCAR,
with no preprocessing. Calbert was published as an experiment in the Github repository cited
in the table, but it does not have any accompanying paper or evaluation.

Finally, regarding the specific languages and domains we want to generate new corpora for
(and, ideally, models), we observe that while there are already corpora and models for Catalan,
general-domain Spanish, and biomedical Spanish, they are considerably smaller than those for
other languages. The same applies to the languages targeted by the already mentioned MT4All
project, namely Finnish, Norwegian, Latvian, and Basque. Specifically for the case of Catalan,
Table 9 shows existing Catalan language models.

67
https://github.com/codegram/calbert

34
4 Settings

In this section, we describe the settings which we intend to use for this project. These settings
will serve both as initial motivation (even if the approach aims to be more generic rather than
just addressing these specific use cases), and validation of the methods.

4.1 Data
Regarding the data, we will apply our proposed text processing pipeline to an heterogeneous set
of corpora, in terms of languages, domains, scales, and targeted use-cases. Using our pipeline
with each of these data sets has its own motivation, as will see, but at the same time, it will
show the performance of the pipeline under different scenarios.

4.1.1 Catalan

In the case of Catalan, we will make use of the following raw data sources:

• New crawlings: During 2020, three new Catalan crawlings were run at the Barcelona
Supercomputing Center (BSC). The first one targeted the top .cat, .ad (Andorra), and
.barcelona domains. The second one targeted websites from the Catalan Government, for
which the government, in collaboration with BSC, had to explicitly allow access to BSC
crawlers. Finally, a similar crawling, also with explicit access allowance, was done for the
Catalan News Agency website. Section 6.1.1 gives more details about these crawlings.

• Existing crawlings: In Section 3.3, the caWac corpus, a relatively large crawling, consisting
of 780M tokens, is presented. Instead of using the tokenized version that was published,
we contacted the authors in order to obtain the original raw version, to which we apply
our cleaning pipeline described in Section 5.3.

• Wikipedia: The Catalan Wikipedia consists of approximately 200M tokens at the time of
writing this section. It is naturally document-level, and using a lightweight preprocessing
one can easily obtain high-quality text.

• Other sources: The Catalan section of the OSCAR corpus, which is a mostly unprocessed
document-level web corpus consisting of around 720M tokens. Sentences from OpenSubti-
tles71 and the DOGC dataset72 are publicly available. A In all cases, after inspection, it
became clear the need to preprocess them before being able to use them .

4.1.2 General-domain Spanish

In the case of general-domain Spanish, the following sources are available:


71
http://opus.nlpl.eu/OpenSubtitles-v2018.php
72
http://opus.nlpl.eu/DOGC.php

35
• New crawling: The Spanish National Library (Biblioteca Nacional Española, BNE73 provided
a massive web crawling of the top 4557 .es domains, with a depth of 5. The crawling was
conducted between 22/05/2019 and 13/10/2019. All in all, the crawling has resulted in
616703 files, each line of which being a JSON with the data and metadata for a specific
web-page. The huge size of the raw data (around 45TB of raw WARC files which are even
larger than the corresponding JSON) is remarkable and makes it challenging to process.
The WARC files were extracted with Selectolax74 . The software used for the crawling was
the one used by Internet Archive75 . After inspecting the raw data, the need for cleaning is
clear, since it cannot even be assumed that all websites are in Spanish.

• BETO corpora aggregation: In Section 3.3, we mentioned that for building a Spanish
BERT, BETO, the authors already aggregated and preprocessed different sources for
general-domain Spanish, including Wikipedia, OSCAR, and others. All in all, the cor-
pus consists of almost 3B tokens, after a light preprocessing (by the authors of BETO).

4.1.3 Biomedical Spanish

For biomedical Spanish, we intend to process data from these sources:

• A new Spanish biomedical crawling: The Spanish Health Web Corpus (referred later as just
”Medical Crawler”), or ”Corpus Web Salud Español (CoWeSE)”, is a new crawling run at
BSC. The selected websites belong to at least one of the following categories: 1. Medical
communities, 2. Scientific communities, 3. Medical journals, 4. Research centres, 5. Phar-
maceutical companies, 6. Informative websites about health issues, 7. Patient associations,
8. Personal blogs from healthcare professional, 9. Hospital websites, 10. Public health or-
ganizations. In total, it consists of more than 4500 websites, stored in one WARC file for
each of them. While mostly in Spanish, text in Catalan and Galician is also present.

• TeMU’s biomedical Spanish collection: The Text Mining Unit of Barcelona Supercomputing
Center76 has collected different corpora of biomedical articles and anonymized clinical
histories, including cardiology, covid-19, and radiology:

– CardioCC, CovidCC, and RadioCC: Corpora extracted from cardiology clinical cases
(150k tokens), covid-19 clinical cases (82k tokens), and clinical cases involving radiol-
ogy (177k tokens), respectively.
– Libros Casos Clinicos: A miscellaneous collection of clinical cases, totalling more than
1 million tokens.
– EMEA: EMEA is a corpus of biomedical text retrieved from the European Medicines
Agency77 (EMEA). It consists of around 13.8M tokens.
– Patents: A corpus generated with patents (related to the biomedical domain), com-
posed of more than 14M tokens.
73
http://www.bne.es/
74
https://pypi.org/project/selectolax/
75
https://github.com/internetarchive/heritrix3
76
https://temu.bsc.es/
77
https://www.ema.europa.eu/en

36
– BARR2 Background set: The background set of TeMU’s Biomedical Abbreviation
Recognition and Resolution task78 : Corpus extracted from biomedical literature in
Spanish used in an abbreviation recognition and resolution task. The text itself
is document-level and basically consists of abstracts from biomedical articles, with
28.87M tokens.
– REEC: ”Registro Español de Estudios Clı́nicos”79 (REEC) is the Spanish Registry of
Clinical Studies. The resulting corpus consists of 4.58M tokens.
– SciELO: A corpus made of biomedical articles, extracted from the SciELO reposi-
tory80 . It consists of a document-level corpus of 61.84M tokens.
– Mespen Medline: Corpus generated from medical articles extracted from Medline81 .
This repository is better known for storing papers in English, but it has a portion of
Spanish articles as well. In total, it has almost 110M tokens.
– PDFs general: A document-level, miscellaneous collection of biomedical Spanish text
extracted from PDFs. It consists of 129.12M tokens.

For more information on those, we refer to TeMU’s website82 .

• Wikipedia Life Sciences: For this work we use a new crawling of the health-related articles
in the Spanish Wikipedia, run at BSC, with a more sophisticated strategy that the one
in [103] (traversing the nodes belonging to health-related categories with a certain depth,
and discarding records belonging to a list of categories not related, but potentially conflated,
with health topics).

4.1.4 MT4All

We have already mentioned MT4All. It is a European project aimed at providing bilingual


resources (including machine translation systems) for under-resourced language pairs. MT4All
plans to leverage the unsupervised machine translation approach proposed in [107], described in
Section 2.3. Thus, it needs to collect monolingual corpora. BSC, as a partner of the project,
is in charge of the data collection and processing and the HPC work packages. We mentioned
above the different use cases targeted by the project. Specifically in this thesis, we are focusing
on the data preprocessing required by the following ones: financial domain (Finnish, Norwegian,
Latvian ↔ English), biomedical domain (Spanish ↔ English), and general domain (Basque,
Catalan ↔ English).

• Financial domain for Finnish, Norwegian, Latvian ↔ English: One option could be select-
ing financial domain text from large general domain corpora, although this could hardly be
effective for languages other than English (because they will not have enough data of this
domain in a general crawling). Instead, we will opt for running new crawlings on websites
belonging specifically linked to the financial domain. These crawlings will be cleaned with
the cleaning pipeline described in Section 5.3.
78
https://temu.bsc.es/BARR2/
79
https://reec.aemps.es/reec/public/web.html
80
http://scielo.isciii.es/scielo.php
81
https://medlineplus.gov/
82
https://temu.bsc.es/

37
• Biomedical domain for Spanish ↔ English: For biomedical Spanish, we refer to Section
4.1.3. For biomedical English, the data collected in BioBERT [18] and SciBERT [80] can
be used out-of-the-box, so it is not really a concern.

• General domain: Basque, Catalan ↔ English. For Catalan, we refer to Section 4.1.1. For
English and Basque, we will apply the cleaning pipeline described in Section 5.3 to the
English and Basque sections of the OSCAR corpus, respectively.

4.2 Environment
As we said, the proposed architecture (meaning the general architecture of the system, not the
neural architecture) is heterogeneous in the sense that each component may leverage a different
kind of device.
Regarding data gathering, most of the raw data used for this project came from web crawlings.
I was not personally involved in this part of the process, but for the sake of completeness, let
us describe the hardware that was used. The crawlers executed at BSC by the Operations
department83 were executed in a regular Linux virtual machine with 6 cores, parallelized with
MPI. We do not know the specifics of the hardware used in the BNE crawling, which was executed
in the BNE servers. The output of this crawling was transferred to BSC’s storage facilities.
All data, regardless of their origin and state during the whole process, is stored at the large
storage facilities of BSC, which leverages IBM’s General Parallel File System (GPFS). Data
partitions connected to the computer nodes are SSD-based.
As far as the text processing pipelines are concerned, they are run on a supercomputer, MareNos-
trum484 . This supercomputer is based on Intel Xeon Platinum processes (Skylake). Regarding
the operating system, it runs SuSE Linux Enterprise Server. Each compute node has 48 cores
(2 sockets Intel Xeon Platinum 8160 CPU with 24 cores each @ 2.10GHz), and 96GB of main
memory.
Regarding training (and, in fact, also evaluation), the models themselves run in a GPU cluster.
Specifically, a PowerPC cluster in which each node has 4 x GPU NVIDIA V100 (Volta) with
16GB HBM2.85 . Each node has 2 x IBM Power9 8335-GTH @ 2.4GHz (3.0GHz on turbo, 20
cores and 4 threads/core, total 160 threads per node) cores, 512GB of main memory, and 2 units
of SSD of 1TB as local storage. The operating system is Red Hat Enterprise Linux Server 7.5,
with CUDA 10.1 and the required drivers being pre-installed. Notice that these GPUs have of
16GB of memory, unlike all the works using NVIDIA V100 we saw in sections 2 and 3 (having
32GB of memory); this will pose some challenges (see Section 5.6.3).
Both in MareNosturm4 and CTE-POWER nodes are inter-connected with InfiniBand (IB)86 , a
computer networking system with low latency and high throughput.

83
Namely, the new Catalan crawlings, the biomedical crawling, and the MT4All crawlings.
84
https://www.bsc.es/support/MareNostrum4-ug.pdf
85
https://www.bsc.es/support/POWER_CTE-ug.pdf
86
https://www.mellanox.com/pdf/whitepapers/IB_Intro_WP_190.pdf

38
5 Methods

Once we have seen the required background, the current state-of-the-art, and the settings we are
targeting, we will go through the specifics of our proposal. First of all, let us clarify what we
mean by document, document boundaries, or document-level corpora, since these concepts will
be of the uttermost importance in this section and the rest of the thesis. We understand that a
document is a sequence of sentences. Note that we do not mean document as equivalent to file of
a computer system, but unit of meaning, composed of a number of sentences. A given file may
contain one or more documents. For instance, if we take one of the plain text files extracted from
a Wikipedia dump, each of the articles may be considered a document. Paragraphs themselves
could also be considered a document, but in this work we have kept the maximum context as
possible. Document boundaries may be denoted by markup (e.g., different items in an XML),
or by conventions in a plain text file (typically, documents are separated by one empty line).
Finally, a document-level corpus is the one that preserves those boundaries. As we have seen,
most algorithms for training language models do require document-level corpora. Otherwise,
even if they do not, the presence of document-level corpora will let the model learn about long-
term dependencies.
In this work, we propose a method for generating new corpora for training language models at
scale, even if the method is generic enough to generalize for creating datasets for all NLP models
requiring large unlabelled datasets (e.g., unsupervised or semi-supervised machine translation,
or word embedding algorithms). We then show how the generated corpora can be effectively
used for training language models from scratch.

5.1 Overview
The whole process comprises the following steps:

1. Data gathering and storage: It consists of the execution of new crawlings or, alternatively,
downloading or collecting existing corpora. In the case of challenging domains, the mere
fact of gathering existing datasets is a relevant process that cannot be overlooked. It is
of the uttermost importance to list all the available corpora and to properly collect their
respective metadata.

2. Cleaning and formatting: We build from scratch a cleaning pipeline that takes as input raw
text, in a number of different formats, and cleans it. The process is model-agnostic enough,
not to lose information or be too specific for a given architecture, but opinionated enough
to force certain properties that we believe to be generally desirable (e.g, keep document
boundaries whenever is possible).

3. Model-ready preprocessing: Once corpora have been cleaned, they have to be specifically
preprocessed according to the needs of the targeted application and model. We consider
this process as part of the model building phase itself. Typically, this step will involve a
subword based tokenization, but it is not always the case.

4. Model training: Once the data is ready, the model can be trained. We will document the
specifics of training language models from scratch.

39
5. Evaluation: In the case of the generated corpora, it would be both costly and technically
complicated to intrinsically evaluate a specific cleaning process. We will, instead, provide
qualitative analysis of the results of the cleaning. Regarding models, language models
must be evaluated on downstream tasks since the pre-training metrics are not necessarily
informative of their usefulness.

5.2 Data gathering and storage


While having collaborated in the data gathering and storage process, I was personally not in
charge, and it is therefore out of the scope of this thesis to describe it. Nevertheless, for the rest
of the steps, we will have to take into account that we are using data coming from heterogeneous
sources, in terms of format, quality, domain, languages, and size. Several new crawlings targeting
the desired domains and languages were conducted. In most cases, the crawlings were run in
BSC’s virtual machines; in the case of the BNE crawling, it was run at BNE facilities and
transferred to BSC. On the other hand, existing corpora had to be searched and documented,
both in the case of public datasets, or datasets belonging to BSC but not indexed or documented.
All raw data have been stored at BSC’s storage facilities, while the associated metadata has been
conveniently registered.

5.3 Cleaning and formatting pipeline


At the beginning of this section (Section 5), we have already described our high-level desideratum
for cleaning and formatting text. However, text cleaning is an ambiguous term. For instance,
some works will consider lower-casing as part of the cleaning process. Instead of arguing about
the semantics of text cleaning, we will clarify the kind of text processing methods we are keen
on, at least in this phase. In unsupervised or semi-supervised settings, i.e. the ones that can
leverage better the kind of corpora we are generating, we have seen that the more data, the
better, even if noisy (up to a point). In addition, a too aggressive normalization is not realistic
in inference and will make the models less robust to noise. On the other hand, there are certain
kinds of noise that most likely will not only be useless, but will waste capacity of the models
to learn about out of scope cases (e.g., modeling non-natural text87 present in raw crawlings).
Table 10 shows some examples of desired and undesired cleaning processes as per our goals.
87
Modeling non-natural text, like programming languages and equations, can be of extreme interest. But, unless
aiming to build models with huge capacity (à la GPT-3), which is not realistic for most languages and domains,
these will be out of scope of the interest in building the model. In addition, many sequences of non-natural text will
hardly be interesting to any researcher (e.g., sequences of apparently random characters, present in the Catalan
Section of OSCAR).

40
Example Desired? Reason
Removing non-natural language Desired Unnecessary noise, waste of compute,
probably not present in inference
Discarding text in other languages Desired Not useful, waste model capacity in
unnecessary use cases
Spell checking Undesired Not realistic in inference, can cause lack of
robustness to typos
Terminology normalization Undesired Not realistic in inference, prevent the
model from learning synonyms
Lower-casing/true-casing Undesired Loss of information, too opinionated
Tokenization Undesired Loss of information, too opinionated
Removing boilerplate Desired Document coherence
Removing duplicates Desired Waste of compute and storage, potential
risk of overfitting

Table 10: Examples of desired and undesired cleaning: The term of text cleaning is ambiguous
(e.g., is normalizing text with spell checkers part of the cleaning?). We show specific examples
of text processing techniques we will apply in the phase of cleaning and formatting. Some of the
methods declared as undesired in this table will be applied, but in the model-ready processing
pipeline, not in the cleaning pipeline. We consider preprocessing steps such as tokenization as
part of the model building phase itself.

5.3.1 Design and features

The cleaning pipeline has been designed from scratch with the goal of being as generic (while
targeting European languages), flexible, extendable, and memory and compute-efficient as pos-
sible:

• Genericity: Without claiming to be universal, the cleaning pipeline is able to handle a


diversity of languages and domains, with shared components.
• Flexibility: The program is heavily parameterized. Most functionalities can be customized
via command-line options.
• Extendability: It is easily extendable by sub-classing or adding methods.
• Memory-efficiency: The pipeline can process terabytes of input since data transformations
are lazily evaluated.
• Compute-efficiency: The sequential version was profiled and optimized. Apart from that,
if the program has access to multiple cores, different parallel implementations are available.

The cleaning pipeline takes as input raw text (in whichever format is required), and produces
output in the specified format. It is model agnostic in the sense that it does not prepare the
output for any specific model (e.g., it does not tokenize or add special tokens). Nevertheless,
it is strongly opinionated in the sense that it was conceived with the goal of keeping document
boundaries, which were considered to be of increasing importance, and several components are
designed with this goal in mind.

41
Figure 9: Cleaning pipeline overview.

42
Figure 9 gives an overview of its design.
In this subsection, we will go through the different components, while giving a global under-
standing of the whole process.

5.3.2 Data parser

This component parses raw data. The parent DataParser class implements the heavy-lifting
utilities for efficiently guessing the encoding (unless explicitly provided) of a file and parsing it,
opening binary files if required, etc. For encoding guessing, it leverages the chardet module88 .
We also experimented with UnicodeDammit89 and MagicLib90 , but chardet library seemed more
complete (in the case of text) and better documented. We lazily feed the contents of each file to
chardet’s UniversalDetector, until a given confidence threshold on the encoding is obtained,
or the process times out (and we assign UTF-8 by default). The parser also lists all the files
to process in the given directory, which will be later distributed to the different workers in the
parallel mode. In fact, the parser could be easily extended to directly read from any stream, not
necessarily based on local files (e.g., a live crawling). In practice, for large crawlings, it is better
to store the results of the raw crawling and then, as a separate step, apply the cleaning pipeline
(which allows, for instance, to sample from the raw data and test different cleaning strategies).
The data parser has the responsibility of keeping document boundaries, if the raw data provides
them (for instance, the Wikipedia data parser parses each article as an independent document),
or even trying to generate them, if they can be inferred. For instance, in the case of web
crawling, each page within a webpage is considered as a document, but only taking the content
inside paragraphs to try to avoid boilerplate (e.g., copyright or footers), with the goal of returning
connected text. Document metadata is also parsed, as long as it is provided by the corresponding
input format.
Inheriting from the parent class, extending the program to parse different formats takes a few
lines of code. The child class must only implement the parse file method and declare the
targeted extensions (if any), such as "*.xml". As we said, the core of the implementation is
build around Python generators91 , which allow the data parser to lazily read potentially large
files with a minimal memory footprint. There are several parsers implemented:

• BSC crawling JSON parser: Parsed a JSON-based format used in some of the crawlings
(especially, in the BNE crawling) run or organized by BSC. Apart from the raw text, it
has other fields related to metadata, such as keywords of the original website, the title, or
the URL.

• Fairseq LM parser: Document-level format, sentence-by-sentence (one sentence in each


line) plain text file. Documents are separated by a single empty line. Despite the name we
chose92 , this format is quite universal.
88
https://pypi.org/project/chardet/
89
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
90
https://github.com/threatstack/libmagic
91
https://wiki.python.org/moin/Generators
92
Since to our knowledge it was originally described in Fairseq’s repository: https://github.com/pytorch/
fairseq

43
• Onion parser: Document-level format allowing to store metadata of each document. It is
used internally by the program. The name refers to the Onion deduplication tool93 .

• Sentence parser: Sentence-level, plain text file format. Sentences are provided line by line.

• Document parser: Similar to the Onion parser, but for external usage (e.g., without dedu-
plication marks).

• Textfile parser: Similar to Fairseq LM and sentence parser, but in this case, all text in the
same file is assumed to belong to the same document (instead of relying on empty lines as
document boundaries, or being a sentence-level format).

• WARC parser: The WARC94 (Web ARChive) format stores multiple resource records (data
objects), typically consisting of websites dumped from a crawling. The raw crawlings used
in this project were stored in this format.

• Wikipedia parser: Parses extracted Wikipedia dumps, keeping its metadata and considering
each article as a document.

The factory pattern95 is used to build these parsers in a simple way. In the first implementations,
the data parser was implemented as a streamer of documents. Later, while keeping the use of
generators, it was rethought as a mapper (mapping paths to streams of documents), to simplify
the parallel implementation.
To the best of our knowledge, no other text cleaning pipeline implements so many formats by
default, and the encoding guessing feature is not common in other cleaners we have seen. Our
focus on extendability and genericity is especially observed in this component.

5.3.3 Encoding fixer

The encoding fixes module leverages ftfy [93] to fix common encoding errors, also known as
mojiblake. This component assumes that, encoding errors aside, the encoding of the text is
already UTF-8 (which is guaranteed, at least up to a point, by the data parser). In addition, it
applies Unicode NFKC normalization96 , meaning that certain unicode characters are normalized
with an equivalence table. For instance, the ellipsis standalone character (”. . . ”, U+2026) is
converted to three dots (”...”). We used FTFY for encoding fixing because it is extensively used
in the literature, and in our tests worked very well.

5.3.4 Prefilterer

The prefilterer has two main objectives:

1. Apply simple string transformations: Markup is either removed or replaced by sentence


or document boundaries. This is crucial, since just removing all markup tags (instead of
93
http://corpus.tools/wiki/Onion
94
https://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/
95
https://refactoring.guru/design-patterns/factory-method
96
See https://unicode.org/reports/tr15/.

44
especially treating tags such as <br>) will collapse separate sentences into a single one,
without even an empty space between the end of a sentence and the beginning of the next
one. In the example provided in the documentation of the BeautifulSoup497 library, which
can be easily used to extract pure text from markup text, one can observe that parts of
the same sentence can get separated in different lines (which is undesired), so in this case
it is better to manually build the regular expressions.

2. Early filtering of bad documents: Operations and filters applied to documents are increas-
ingly expensive in terms of computation time. One of the goals of the prefilterer is to
discard as fast as possibly what we consider as potentially bad documents. For instance,
documents too short (according to a configurable threshold) in terms of characters or
heuristically estimated tokens are discarded. Documents in which presumably there is lit-
tle to no natural language (inferred by the percentages of different kinds of characters, or
the presence of certain kind of strings). The presence of language in one of the desired
languages (as requested by the user) is also tested, first via the percentage of characters in
the alphabet of the language in question, and then, via the fast language identifier, with a
low threshold98 .

Regarding the language identifier, we experimented with three language identifiers mentioned in
Section 3.2:

• cld3

• FastText-based language identifier.

• LangId.

To our surprise, Google’s language identifier, the one used in Paracrawl, did not perform that
well as expected in our tests, at least in some of languages and domains (Catalan, Biomedical
Spanish). In this component, we ended up using the FastText one, which offered the best trade-
off in terms of performance and speed. We observed that this language identifier was quite
sensitive to the presence of URLs, so before inputting the text to this language identifier, URLs
are removed. Note that language identification can also serve to discard non-natural text. Once
some documents have been discarded, further transformations and filters, more demanding in
terms of computation, are applied.
At the end of the day, the prefilterer ends up doing a great deal of the heavy-lifting work of the
cleaning pipeline. It exhibits a compilation of many hand-written heuristics 99 that generally
work and that, otherwise, at least can be explicitly deactivated or modulated via threshold or
deactivation parameters.
It is worth noticing that up to this point, the considered unit has been the document as a
whole (in case of the sentence-level input formats or datasets, each document consists of a single
sentence), but still as a raw sequence of characters, without any notion of sentence.
97
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
98
See the cascade of language identifiers detailed in Section 5.3.10.
99
Many more than the ones summarized in this section, for the sake of brevity. Fur a full reference, see Appendix
B.

45
5.3.5 Sentence splitter

Regarding the sentence splitting, we investigated different libraries, with the following require-
ments:

• Work out-of-the-box with as many languages as possible, at least with the European ones.

• Be as fast as possible.

• Be as lightweight (in terms of dependencies and run-time requirements) as possible.

• Related to the previous point, avoid tokenization.

The role of the cleaning pipeline is generating corpora and, thus, it is supposed to be model-
agnostic. Tokenization is part of the model building phase (e.g., one researcher might want to
experiment with another tokenization, without having to re-execute the cleaning process). Thus,
we would rather avoid a sentence splitter that tokenizes, since we would have to detokenize again,
which might be a waste of resources, in case other lighter approaches work out of the box.
As we saw in Section 3, the WikiBERT pipeline used UDPipe, a trainable NLP pipeline, ca-
pable of performing tokenization, dependency parsing, and others. On the other hand, FreeL-
ing [108–112] is a very powerful NLP pipeline written in Java, including tokenization, Part-Of-
Speech tagging, lemmatization, and others. Its sentence splitting, which requires tokenization,
is considerably effective in many languages. Nevertheless, we discard their usage, even if being
the most powerful sentence splitters, since they do not fulfill our requirements.
On the other end of the spectrum, we find the light preprocessing script of BETO, the Spanish
BERT we saw. We revisited their simple script and, for sentence splitting, we observed that it
failed on different cases. For instance, it splits sentences when encountering acronyms.
Instead, as a middle ground between the two approaches, we use Sentence Splitter100 , a heuristic,
yet effective, sentence splitter. It does not require tokenization, and it does not have any depen-
dency. Its rules, though, are considerably effective. It has explicit support for many languages,
including Catalan and Spanish, but it works well with other, not officially supported, such as
Basque.

5.3.6 Sentence filter

By sentence filtering we mean the application of sentence-level filters, even (and, perhaps, espe-
cially) in the case of document-level corpora. One could argue, and with a reason, that filtering
individual sentences in a document-level corpus could damage the document coherence. At the
same time, and especially in the case of crawlings, the presence of certain artifacts might damage,
even more, the coherence of the document. On the one hand, the document might be infected
with placeholder text (messages about cookies, privacy, copyright,...). On the other hand, non-
natural text might appear (this is sometimes the case in Wikipedia). Recall that until this point
the filters have been applied at document-level. The fact that the text of a document as a whole
is accepted does not guarantee that all its sentences are necessarily natural text.
100
A Python module Based on the original Perl scripts developed by Philipp Koehn and Josh Schroeder: https:
//github.com/mediacloud/sentence-splitter

46
Our proposal is, thus, filtering individual sentences, but with caution. We apply a cascade of
language identifiers 101 to all sentences, individually. Non-natural text, or placeholder sentences in
a language different from the targeted one, will, most likely, be discarded in this step. Apart from
that and other heuristics, sentences that are repeated within the same document can optionally
be removed.

5.3.7 Normalizer

By normalization we mean classical NLP normalizations such as punctuation normalization.


By default, we generally should not apply this kind of normalizations, since variation makes
the model more robust to different kinds of input. Still, punctuation normalization could be
considerably useful in low-resource scenarios (especially if targeting machine translation), such
as in the case of the corpora generated for the MT4All project. At the end of the day, is it an
optional step.
Perhaps one of the most interesting filters is the one based on the standard deviation of the
character length per sentence. This filter was created after the development of the original
implementation of the pipeline (and was applied as a separate script), but the normalizer can be
a sensible component to inject it, since it requires the documents already being split by sentences
and it is more expensive than other functions that have already discarded documents.
The filter is defined as follows. First, we count the number of characters per sentence. Then,
we compute the standard deviation of this value for a given document. If the resulting value is
extremely low and the document has at least a certain amount of sentences, it is obvious that
the document is abnormal. In the case of the Catalan aggregation of corpora, we inspected by
hand all the documents detected by this criterion, and, indeed, all of them were abnormal. We
observed that all these documents belonged to either one of the three following categories:

• Documents coming from PDFs: Corpora coming from datasets had specifically been treated
with PDF extraction tools. Nevertheless, the crawling appeared to have some PDFs (which
we were not aware of). These documents are distinguished from the other two cases by
checking the consistent lack of punctuation in the end of the sentences (and the presence of
punctuation marks in the middle of the sentence) and can be fixed with a regular expression.
For example, a document with more than 10k lines (but considerably less actual sentences)
was detected in the crawling from the Generalitat websites, containing lines such as the
following:

Dictamen sol·licitat pel Parlament de Catalunya, amb relació


al Dictamen de la Comissió d'Organització i
Administració de la Generalitat i Govern Local, sobre la...

The second and third line are actually part of the same sentence as the first one.

• Non-natural text: Fortunately, this case does not typically happen if the rest of the clean-
ing pipeline components have been applied. Nevertheless, when running the filter as a
101
See Section 5.3.10.

47
standalone script on the original Catalan OSCAR corpus, documents such as the following
appeared:

in3 9AA in3 9AB in3 9AC in3 9AD in3 9AE in3 9AF in3 9AG in3 9AH...
in3 9BA in3 9BB in3 9BC in3 9BD in3 9BE in3 9BF in3 9BG in3 9BH...
in3 9EA in3 9EB in3 9EC in3 9ED in3 9EE in3 9EF in3 9EG in3 9EH...

Believe it or not, this document was present in the Catalan part of the OSCAR corpus.
These kind of documents are removed, but as we said, this case should never occur if the
previous components of the cleaner have been applied.

• Documents consisting of identical sentence repetitions, except a specific detail (e.g., a phone
number, or a date):

CATCert certifica que en data 15-11-2018 13:59:42 va segellar el contingut


d'aquest fitxer i des de llavors no ha estat modificat.
CATCert certifica que en data 11-10-2018 08:54:40 va segellar el contingut
d'aquest fitxer i des de llavors no ha estat modificat.
CATCert certifica que en data 11-10-2018 08:53:54 va segellar el contingut
d'aquest fitxer i des de llavors no ha estat modificat.
CATCert certifica que en data 11-10-2018 08:53:01 va segellar el contingut
d'aquest fitxer i des de llavors no ha estat modificat.

These documents are trimmed.

5.3.8 Document filter

From the data parsing to the normalization, each worker prints in its own file all the documents
it generates from the files it is assigned, using the Onion format. Then, in the document filter
component, which is a reducer instead of a mapper, all Onion files are concatenated, and global
deduplication is applied as follows:

• Using Unix’s standard utility awk, we remove sentences that are globally repeated a certain
number of times. We have found that this heuristic helps to remove placeholder sentences
(e.g., many pages happen to have the exact same copyright disclaimer), and if the place-
holder is big enough it will rarely remove sentences we were not supposed to, especially
if only applied to the beginning or the end of the document. We have tried to make the
implementation as Python-based as possible, but in this case, the speed-up provided by
awk (called from Python) was worth it.

• Using the Onion deduplication tool, which is based on n-gram repetition frequencies, filters
out documents with a certain level of duplication with respect to the other ones. The
threshold is configurable. For a better understanding of Onion, we strongly recommend to
take a look at the Ph.D. thesis that originated this tool [113].

Both awk and Onion, being implemented in C, are extremely fast. Once the deduplication has
been applied, the output formatter, the next and final component, is in charge of converting the

48
Onion format into the desired final format. The intermediate results before the deduplication are
also stored, in case the use-case may require them. The program can be run just in deduplication
mode, which is convenient in the case the user desires to deduplicate the result of an aggregation
of corpora that, individually, have already been passed through the rest of the components.

5.3.9 Output formatter

Several output formatters are implemented:

• Fairseq LM: Prints in the aforementioned Fairseq LM format (see Section 5.3.2).

• Sentence: Similarly as before, prints sentence-by-sentence (it is a sentence-level format).

• Onion: Prints text in the aforementioned Onion format. This is used internally by the
program.

5.3.10 Implementation and performance

From the very beginning, this program was targeting the BNE crawling. Being ≈45 TB of raw
data, it was therefore implemented with HPC and memory efficiency in mind. This program has
the requirement of being usable both for small and large corpora.

Lazy evaluation The program is implemented around Python generators102 . In practice, this
means that the program only loads in memory the contents when it requires them, implying a
tiny memory footprint.

Early exit One of the design principles of this pipeline is the faster a filter, the earlier it is
applied. In other words, we apply first the faster filters (unless there is some dependency), to
exit early from a document if it is not promising enough to apply the more expensive procedures.
Sticking to this technique, as simple as it seems, can easily be overlooked when there are numerous
filters and components, but it is considerably effective.

Sequential optimization Before parallelizing the program, the sequential execution was opti-
mized. Once the first version had been built, it was profiled using PDB103 . Parallel implementa-
tions can, in the best case scenario, decrease execution time by a linear factor of N , assuming N
workers and a completely parallelizable algorithm (which is hardly the case). Profiling revealed
that the language identifier of choice was directly accountable from 50 to 60% of the execution,
depending on the execution. Figures 10 and 11 give more details on the profiling.

Cascade of language identifiers By cascade of language identifiers we mean our solution of


applying a number of language identifiers. Let us define a binary classifier wrapping a language
identifier, such that the classifier returns a positive answer if at least one of the desired languages
102
https://wiki.python.org/moin/Generators
103
The Python debugger. See: https://docs.python.org/3/library/pdb.html

49
Figure 10: Cleaning profiling (graph call): The filter by lang method is directly accountable
for more than 50% of execution time, as extracted by the Python debugger. While it is true that
this could be partially speeded up by not requesting the normalized probabilities (and just taking
the argmax), the language identifier itself would still be considerably slow, and the normalized
probabilities are needed by the program heuristics. In the figure, the graph call is cropped for
clarity.

appears in the input text with a certain confidence threshold. First, we apply the ones with high
recall (but low precision), with a low confidence threshold. Then, when most of the documents
we are confident that will not be in the desired language have already been discarded, the slow
(yet with better performance) language identifiers are applied, with a higher threshold. This
idea was inspired by the well-known Viola-Jones algorithm for face detection [114]. In case
of language identification, to the best of our knowledge, the only work we have found using a
similar strategy is [115]. In practice, we found that the best combination was using only two
classifiers, specifically, LangId, as the slow yet precise language identifier, and FastText, as the
faster alternative with a decent enough recall. Other language identifiers did not perform well
in our tests, as described in Section 5.3.4, This resulted in speed-ups of around 1.8x, as shown
in Table 11. Figure 12 shows the proposed cascade of language identifiers.

50
Figure 11: Cleaning profiling (counts): Call count means number of times the corresponding
method has been called during the execution; time (in milliseconds) means total time (i.e, the
overall time of that function, including the internal calls to other functions); own time means
the time spent on the method itself (ignoring internal function calls).

Implementation Input Output (sentences) Time (s) Speed-up


Vanilla LangId Wiki subset 6049 41.33 1x
Best Cascade (FastText + LangId) Wiki subset 5950 23.24 1.8x

Table 11: Speed-up obtained by the cascade of language identifiers: The best cascade of lan-
guage identifiers combination we found obtained a remarkable speed-up of 1.8x, while implying
a minimal degradation in terms of output size.

Map-reduce The program is inspired by the MapReduce104 , a programming paradigm con-


venient for big data. Borrowing concepts from functional programming, mappers are functions
independently applied to each element (in this case, document); reducers take a sequence of ele-
ments and transform them in a dependent manner. This will ease its parallelization, as well as
simplify the code base. Apart from mappers and reducers, we define streamers, the components
that generate a stream of raw text (typically, the data parsers).

First parallelization strategy The first parallelization strategy correctly assumed that all
data sources (i.e., files) would not necessarily imply the same amount of work. For instance, one
file could be smaller, or be easily filtered. This could imply load balancing problems. For this
reason, in the first parallelization strategy we tried, there were two kinds of parallel workers.
The first ones, the streamers, were in charge of reading files in background, providing more data
on as-needed basis. Mappers were applied in parallel to each document. Then, sequentially, the
104
https://www.ibm.com/analytics/hadoop/mapreduce

51
Figure 12: Proposed cascade of language identifiers: First, we apply the ones with high recall
(but low precision), with a low confidence threshold. Then, when most of the documents we are
confident that will not be in the desired language have already been discarded, the slow (yet
with better performance) language identifiers are applied, with a higher threshold. This idea
was inspired by the well-known Viola-Jones algorithm for face detection [114].

reducers were executed. The problem of this implementation was that coordinating the two kind
of workers implied a big overhead, and otherwise the implementation unnecessarily got more
complicated than needed. Figure 13 shows the bad performance in the parallel profiling of this
strategy.

Second parallelization strategy For this reason, the second parallel implementation was
simplified. Instead of having streamers as a separate kind of workers that had to coordinate with
mappers, they were treated as another kind of mapper, which mapped file paths to a stream of
data. Figure 14 shows the improved performance of this strategy. Apart from obtaining a better
speed-up, this implementation scaled considerably better with the number of cores. This parallel
implementation, as well as the first one, as implemented with Python’s multiprocessing module,
and in our experiments it performed well with well up to 48 CPUs (the maximum number of
CPUs per node in MareNostrum 4). Since not all mappers can be serialized (due to having
non-Python dependencies), they are initialized in each worker, instead of being initialized once
and then copied.

Distributed mode The parallel implementation leveraged Python’s multiprocessing library.


This module, while being good enough for intra-node parallelization, does not support distributed
execution (i.e., execution with multiple nodes), so it was not enough for our needs. After com-
paring different libraries, we opted for a recent, yet powerful library, Ray [116]. Ray especially
targets AI applications, but in this case we did not use their hyperparameter search or distributed

52
Figure 13: Traces obtained from the first parallelization strategy: Many barriers, i.e., processes
blocking waiting for others, can be appreciated. Most of the chart is blue (meaning idle, i.e.,
useless, execution time).

reinforcement learning capabilities, but we use it merely as a library for distributing jobs across
nodes. Remarkably, the speedup obtained when testing the distributed execution of the cleaning
pipeline was considerably higher than expected. The reason is that Ray appears to be faster
than Python’s multiprocessing in vanilla intra-node parallelism. Table 12 shows the speed-ups
obtained by the distributed implementation.

Implementation Nodes Total cores Time Speed-up


Parallel (Python MP) 1 48 46131.7 1x
Distributed (Ray) 3 144 14672.5 3.14x

Table 12: Speed-up obtained by the distributed implementation: We benchmarked the dis-
tributed implementation using the Catalan crawling, consisting of 312GB of WARC files.

53
Figure 14: Traces obtained from the first parallelization strategy: No barriers are appreciated.
The performance is still not optimal, but the bottleneck is the access to disk, and the reduction
does not leverage all the cores. The performance was further improved when using the Ray
backend, even for intra-node parallelism.

Checkpointing and logging Each time a document is processed, its mapped output is written
to disk in the internal Onion format. Since this format stores metadata, in case the execution was
interrupted for some reason, the state of the cleaning could be restored from these files, but in a
cumbersome manner. For this reason, the cleaning pipeline implements explicit checkpointing,
saving the state such that it can be easily recovered by just launching the pipeline in the same
output directory. Two different checkpointing backends are provided, each of them with different
trade-offs. In addition, the pipeline implements basic logging to monitor the state of the cleaning
process, circumventing the potential bugs (if the implementation had not been careful enough)
induced by logging within a parallel or distributed execution. Both the logs and the arguments
used in the execution are stored in the corresponding output directory (the latter, in a JSON
dictionary), for further reproducibility and traceability.

Deployment Due to the Internet access restrictions in MareNostrum 4 and, furthermore, the
non-Python dependencies, the cleaning pipeline is containerized first, in a Docker105 image, and
then converted to Singularity106 . Singularity, similarly to Docker, is a system for creating con-
tainer images, with the dependencies and environment variables of the program. Unlike Docker,
though, is especially intended for supercomputers. Generating the Singularity image is more
complicated than the Docker one, since no Internet access can be assumed, and mounting vol-
umes with writing permissions (for being able to write to the host file system) can be problematic
depending on the configuration of the host file. The deployment is automated with scripts.

5.3.11 Parameters

The cleaning pipeline is highly configurable and most of its behaviour can be modified via
command-line (or configuration) arguments. Appendix B provides the whole list of parameters
105
https://www.docker.com/
106
https://sylabs.io/docs/

54
at the time of writing this section.

5.4 Metadata and aggregation


For doing this project, we have considered metadata and traceability to be of the uttermost
importance, at two-levels:

• Corpus-level: I have collaborated with the design of an internal system for organizing and
metadating corpora. However, it is out of the scope of this master thesis.

• Document-level: In the internal Onion format we saw, each of the documents keeps the
information of the original document (e.g., URL, id, keywords,...). This means that in case
it was necessary, we could easily sample from cleaned corpora (e.g., retrieve documents
with a given keyword).

For additional traceability, the specific version of the cleaning pipeline and the arguments used
are logged and stored in JSON files. Besides, the cleaning pipeline registers all the operations
done to each document, even though this information is not usually collected unless running in
the debug mode of the program.
As far as aggregation is concerned, with the goal of generating corpora as large and diverse and
possible, we aggregate corpora within the same target category (e.g., ”all Catalan corpora”, or
”All biomedical Spanish corpora”). For doing so, the simple concatenation might be sub-optimal.
Instead, we do the following:

• The concatenation itself is done preserving the maximum granularity available. For in-
stance, if a sentence-level corpus is concatenated to a document-level aggregation, the
sentences of the sentence-level corpus are concatenated with interleaved newline charac-
ters. Otherwise, document boundaries would miss its meaning and a model could not be
trained with document-level units (or could, but wrongfully). Note that if the aggregation
is required to be sentence-level, it is trivial to make it so, but not the other way around.

• We allow for optional oversampling in case a small corpus had some interesting, underrep-
resented features, especially for vocabulary building, but it is not applied by default.

• Once concatenated, the aggregation is deduplicated again, using the cleaning pipeline itself
(but only its deduplication component), to remove cross-corpora duplicates.

5.5 Model-ready pipeline


We have seen that the cleaning pipeline can take as input raw text in different formats (coming,
from instance, from web crawlings) and clean and format it in the desired format, typically the
language modeling format. The cleaning and formatting pipeline is, though, model-agnostic. Its
output can be used by different kinds of models as long as they use unlabelled corpora (word
embeddings, sentence embeddings, document embeddings, unsupervised machine translation,
and all kinds of language models), even though models built for modeling longer context windows
will benefit more from the fact that the cleaning pipeline is designed to preserve document

55
boundaries. Steps such as tokenization (or, for instance, tagging) are out of the scope of the
cleaning pipeline itself, and are considered part of the model building part. In this section, we
describe the model-ready pipeline, a second text processing pipeline that takes as input the output
of the cleaning process, and outputs datasets that can directly be used for training language
models. However, in this work, since we are generating new corpora, we run the components in
pipeline, since there are no existing intermediate results to leverage.
The implementation of this second pipeline was considerably easier than the one we have de-
scribed so far. The latter took months to develop, and had to be built and parallelized mostly
from scratch. In addition, the input space of the cleaning pipeline is indeed a surprise box.
Here, instead, we leverage existing libraries (e.g., a tokenization library) that have their own
parallelization. Moreover, the input will be considerably more uniform.
Another difference with respect to the cleaning pipeline is that now the need to run an individual
component in isolation will be considerably more common. Storing intermediate results of the
cleaning process for executing individual cleaning or formatting components will typically be a
waste of time (except the intermediate results just before deduplication), even though we saw
that filters and components can be inhibited via the command line arguments. Instead, when
making the data ready for the model, it might make sense to just run a pre-trained tokenizer, or
to re-use an existing train-validation-test split. In other words, the intermediate results of each
of the components can have an intrinsic value. For this reason, unlike the cleaning pipeline, the
model-ready pipeline is implemented as a series of standalone Python scripts, and they can be
glued together via a provided parameterized Bash script. This pipeline, though, like the previous
one, is also executed on the HPC CPU cluster.
Like the cleaning pipeline, this second pipeline will be open-sourced in the near future, but in
the meantime, we provide its code as supplementary material, attached to the delivery of this
thesis.

5.5.1 Document-level statistics and sanity checks

The first step of this pipeline extracts some document-level statistics. Specifically, it computes:

• The total number of tokens107 .

• The total number of documents.

• The average and standard deviation of lines (sentences) per document.

• The average and standard deviation of tokens (sentences) per document.

• The number of very long documents (i.e., more than 100 sentences).

Thanks to this simple statistics, I realized (and fixed) that the aggregation of Catalan corpora had
been done incorrectly, since two sentence-level corpora had been concatenated to the rest without
first separating the sentences as independent documents, which made the standard deviations
of lines and tokens per document abnormally big. This would have had the harmful effect of
107
Note that here the text has not yet been tokenized, so the number of tokens is approximated with a simple
whitespace-based word count.

56
considering the unrelated sentences of the sentence-level corpora as being semantically related
(i.e., assuming that the second sentence should go after the first one, and so on).

5.5.2 Decontamination

As far as decontamination is concerned, in case we know for sure that target evaluation bench-
marks will not be present in the language modeling corpus, we do not apply any further step.
For instance, if the evaluation benchmark was especially generated for a shared task and it was
not sourced from the web. Otherwise, we do explicitly filter out these sentences. There are
many works that ignore this procedure and, in fact, it cannot necessarily be considered as unfair
evaluation (except perhaps in certain question-answering datasets, in which the answer might
leak), since during the pre-training step the data does not have labels. Nevertheless, being the
generated corpora big enough (and the conflicting sentences representing a negligible percentage
of them), we play it safe and filter them out, which at the very least makes the benchmark results
more reliable because the sentences will not have been ever seen by the model, so we can better
estimate its generalization capabilities.
The function that performs this decontamination does so heuristically, due to computational
constraints. It basically simplifies sentences by removing accents, punctuation signs, casing,
and spaces, and performs string-level comparisons. This might have the potential side-effect
of decreasing the coherence of the affected documents, but, again, the number of conflicting
sentences is negligible.

5.5.3 Data splitting

Before training a model, as usual in machine learning, we generate the validation and test sets.
For doing so, we have a script that does so in a reproducible way. Because we have kept
document boundaries whenever was possible, and the corpora generated are considerably large,
this process is not as trivial as randomly shuffling lines. Besides, for each corpus, we observed the
statistics of length of the documents (in terms of sentences and documents), and set accordingly
length thresholds to have documents of reasonable length in the hold-out subsets (to avoid an
abnormally large document, especially in the validation set, which is periodically evaluated).

5.5.4 Tokenization

This pipeline implements subword tokenization. It is essentially a command-line parameteriza-


tion of Huggignface’s Tokenizers library108 , a fast tokenization library with different implemen-
tations of subword tokenizations algorithms. It also has the option to use the original Google’s
sentencepiece implementation109 , since we ran into some bugs when trying to use the Hugging-
face’s implementation of this tokenizer.
For this work, we opted for GPT-2’s Byte-level BPE, being the option used in the original
RoBERTa, with a total vocabulary size of 54k, which value we have observed to be common in
the literature; still our program allows other options.
108
https://github.com/huggingface/tokenizers
109
https://github.com/google/sentencepiece

57
The user can specify a given number of placeholder tokens (in case one wants to reserve entries in
the vocabulary to introduce custom tokens when fine-tuning or performing domain adaptation),
although in this work we do not (apart from reserving the special tokens required by the model,
in the case of RoBERTa, <s>, <pad>, </s>, and <mask>).
Since we plan to use Fairseq (as described in Section 5.6), there is one aspect that must be taken
into account. In Fairseq, unlike in Huggingface Tokenizers and Sentencepiece, special tokens are
not explicit in the dictionary. Thus, we explicitly request in the call to the tokenizers library not
to include special tokens (except for the additional ones desired by the user). Otherwise, these
special tokens would be duplicated, which is a known issue in models such as CamemBERT110
Unlike the original RoBERTa and GPT-2, we prepend a whitespace to all sentences, to guarantee
that the first word is tokenized the same way as words in the middle or end of the sentence (in
Byte-level BPE, whitespaces are included in the tokenization). In inference, this whitespace is
automatically prepended by the tokenizer, so the user does not have to bother about it.
This component has two sub-components that can be run on isolation, namely, the training of
the tokenizer, and the application of the learned tokenization.

5.5.5 Dictionary building and binarization

In the dictionary building phase, each token in the vocabulary is assigned to an integer. By
binarization, we mean an off-line preprocessing step in which tokens are binarized. Plain text
files are converted into binary, non-human readable files, by replacing each of the tokens by
their corresponding index in the dictionary and then stored in a non-human readable format
for further efficiency. Many libraries do not implement binarization. In our case, for binarizing
purposes, we use Fairseq’s [117] preprocessing utilities.

5.5.6 Final sanity checks

As can be deduced from the previous description of the previous component, and as we will see in
the next section, we will be using Fairseq for training the models. Once the model-ready pipeline
has been executed, it automatically launches a dummy training procedure with Fairseq, to check
that the data can be correctly loaded by the model. This component can be easily extended to
do the same with any training library of choice.
Once the model-ready pipeline has been executed, the actual training procedure can be launched.

5.6 Training
In this section, we describe how we propose to train language models, taking into account our
data and hardware. Some of the corpora generated with the tools proposed in this work have been
used for training the unsupervised neural machine translation systems in the MT4All project.
For this reason, we describe this corpora generation process targeting unsupervised machine
translation, showing that the data pipeline is generic and versatile enough to be used for different
languages, domains, and tasks (e.g., machine translation). Nevertheless, even if I did some
110
See https://github.com/pytorch/fairseq/issues/1309.

58
collaborations, I was not in charge of training these models, and thus the building process of
these translation models themselves remains out of the scope of this thesis. Instead, we focus on
the the training of language models (and not the machine translation models). However, many
of the procedures described in this section would still hold for large enough Transformer machine
translation systems.

5.6.1 Considerations on parallelization

The training procedure will have to be distributed among different devices, given the model and
data sizes. Let us start by clarifying the difference between two kinds of deep learning training
parallelizations:

• Data parallelism: Batches are distributed among devices. Note that, in this case, each
device (typically, a GPU) has to instantiate a whole copy of the model. Each device com-
putes their own passes on the samples it has received. Notice that by data parallelization
we do not mean the parallelization of just the data loaders. Since we are training (and
not in inference), parameters must be updated. For doing so, after the devices have com-
puted the forward passes, the gradients are aggregated and the parameters, updated and
synchronized across devices.

• Model parallelism: The model itself is distributed among devices. The parameters are not
synchronized across nodes since they are not shared, unless data parallelism is also applied
(such that a given copy of a given part of the model is present in more than one device).

In this case, we will use the former, as we will see.

5.6.2 Considerations on the training environment

After having extensively reviewed the literature, we did not observe any work having used a
computing device (either Google TPUs or NVIDIA V100 GPUs) with less than 32GB of memory,
at least for pre-training a language model from scratch. The reason why this happens is two-
fold. First of all, and most importantly, the batch size is a crucial hyperparameter in deep
learning, and especially in unsupervised settings. In the case of language models, all models we
saw in Section 2 require large batch sizes for learning properly. In 16GB RAM GPUs, since the
models themselves have a large number of parameters, once they have been allocated, little space
remains for the data itself. The second reason is that the larger the batch size, the faster the
computation, which is especially important when training large models.
Still, we will see that 16GB GPUs can effectively be used for pre-training language models at
scale. We can consider several possibilities:

• Model parallelism: We can split a single, large model across different devices. PyTorch
[118], for instance, supports model parallelism, and OpenAI trained the gigantic GPT-3
with this setting. Nevertheless, model parallelism has more synchronization overhead than
data parallelism, and it is more suited for extremely large models that do not even fit in
large GPUs or TPUs.

59
• Model optimizations: As we saw, models such as ALBERT use certain optimizations to
be more memory-efficient. Nevertheless, especially for new domains and languages, we
may want to run other less memory-efficient models, either as baseline111 or for needing a
specific model with more demanding requirements.

• Training optimizations: Several optimizations can be applied, such as using a floating point
precision of 16 bits (instead of the 32).

Notice that data parallelism (i.e., distributing the batches across compute devices) does not
necessarily solve the problem of training large models on itself. First of all, the model itself must
fit in each of the devices. Second, the remaining memory in each device must be enough to have
a large enough effective batch size and making the training in GPUs efficient (otherwise it would
not compensate the overhead of sending data from the CPU to the GPU).
In our case, apart from using a precision of 16 bits, which will improve the efficiency, we will use
gradient accumulation, which will improve the performance.

5.6.3 Effective batch size and gradient accumulation

A very important (yet, less widely known than others) concept in deep learning is the effective
batch size, which is defined as:

ebs = bspcd × cd × uf
where ebs is the effective batch size, bspcd is the batch size per compute device (e.g., batch size
per GPU), cd is the number of compute devices (e.g, GPUs), and uf is the update frequency.
The concept of effective batch size is central to our work.
The hyperparameter that affects the performance of a model (its results in terms of the desired
metric, not the efficiency) is the effective batch size, not the batch size per compute device.
The batch size per compute device only affects the speed of the training procedure, provided the
effective batch size is left constant. In other words, if we want to reproduce a specific experiment,
we have to replicate their effective batch size; if we want to make two different experiments
directly comparable, effective batch size should remain constant. This consideration is often
overlooked in the deep learning community. Since most books assume single-GPU settings, these
two concepts are usually conflated, but they must not be confused.
This equation means that we can let the effective batch size constant while decreasing the batch
size per device, provided we proportionally increase the number of compute devices and the
update frequency. By update frequency we mean the frequency in which we apply updates to
the model parameters (not backpropagation; backpropagation is still computed in every batch,
otherwise this trick would be useless). The usual setting in deep learning is setting this hyper-
parameter to 1, which means that every time a batch of instances is passed through the model,
the forward and backward passes are computed, and then the computed gradients are used to
update the model parameters depending on the algorithm of the optimizer of choice. If we set
111
As we will see, we believe the reasonable thing to do in a language or domain without reference models is to
first run a standard, widely-used model, such as BERT or RoBERTa, and then (not before) experiment with more
exotic variants.

60
the update frequency to N > 1, we will wait for the model having seen (i.e., computed the
forward and backward passes) N times before updating the parameters. This has the effect of
simulating a larger batch size (i.e., increasing the effective batch size), and is known as gradient
accumulation. For illustration, let us see a simple implementation of gradient accumulation in
PyTorch112 :

model.zero_grad() # Reset gradients tensors


for idx, (inputs, target) in enumerate(train_dataloader):
predictions = model(inputs) # Forward pass
loss = loss_function(predictions, targets) # Compute loss function
loss = loss / update_frequency # Normalize loss (if averaged)
loss.backward() # Backward pass
if (idx+1) % accumulation_steps == 0: # Wait for N backward steps
optimizer.step() # Optimizer step
model.zero_grad() # Reset gradients tensors
if (idx+1) % evaluation_steps == 0: # Evaluate the model when we
evaluate_model() # have no gradients accumulated

Notice that in PyTorch, when calling backward(), the graph is cleared unless stated otherwise
(if the graph was not cleared, we would not be optimizing anything), and only the values in
the leaves are saved. This has the effect of summing the gradients of the batches within each
accumulation, making the updates based on more samples (and, thus, more representative).

5.6.4 Deep learning backend and distributed training

For training, we considered the following libraries:

• The original Google’s BERT repository113 , which has been receiving several updates ever
since its release: It has been leveraged by language-specific models such as the Finnish
BERT.

• NVIDIA’s Nemo toolkit114 : This library, based on Pytorch Lightning [119] and different
NVIDIA modules, implements utilities and examples for training both speech and NLP
models at scale.

• Huggingface Transformers115 : This well-known library in the NLP community implements


many different Transformer-based language architectures. In addition, it provides pre-
trained models (both built-in and third-party, via a model hub) and evaluation scripts for
common benchmarks.

• Fairseq [120], FAIR’s116 sequence modeling toolkit, based on PyTorch: Originally focused
on machine translation, now supports numerous sequence modeling tasks. It was the
112
Adapted from https://gist.github.com/thomwolf/ac7a7da6b1888c2eeac8ac8b9b05d3d3
113
https://github.com/google-research/bert
114
See https://developer.nvidia.com/nvidia-nemo, https://github.com/NVIDIA/NeMo, and https://github.
com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT
115
https://github.com/huggingface/transformers
116
Facebook Artificial Intelligence Research

61
library originally used for training RoBERTA, XLM, and BART, and has been leveraged
by language-specific models such as CamemBERT.

We discarded Google’s repository for technical reasons. It does not support training with GPUs
with less than 32GB of memory, while the available GPUs in our environment are NVIDIA V100
of 16GB of memory. More precisely, it might support it but they do not advise to use it under
this setting since the batch size would have to be too small to train properly. In addition, this
repository only supports BERT (and not other models), and being based on Tensorflow 1, it is
not especially scalable.
Regarding NVIDIA’s NEMO, being developed by this company, one can expect state-of-the-
art performance in terms of GPU usage and distributed training. It is based on PyTorch.
Nevertheless, their community (e.g., Github issues) is less active than other libraries we have
investigated.
As far as Huggingface Transformers is concerned, it originally was not intended for pre-training
models from scratch117 . Nevertheless, more recently it has been been extended to do so. Orig-
inally based on PyTorch, now it supports both Tensorflow and PyTorch training. Some of the
strengths of this library are its extensive documentation and the implementation of many differ-
ent models.
Table 13 shows how these libraries compare, from our point of view (thus, subjective). We
decided to use Fairseq since we believed it to be a compromise between available features and
maturity in terms of having been used for pre-training language models at scale (i.e., it was
the library used by FAIR to train the original RoBERTa, XML and BART models, and both
BERT and GPT-2 have been successfully reproduced with it). In addition, both Fairseq and
Huggingface provide respective utilities for the interoperability between these two libraries118 ,
so in case we wanted to deploy a pre-trained Fairseq model to Huggingface’s hub or use with it
Huggingface’s evaluation scripts, we could easily do so. Crucially in our case, Fairseq implements
gradient accumulations (even if e.g. Huggingface also has this feature) and 16-bit precision.

Library Model diversity Training optimizations Usability/Extendability


Google BERT * * *
NVIDIA NEMO ** *** *
Huggingface Transformers *** ** ***
Fairseq ** *** ***

Table 13: Comparison between libraries for training language models.

Regarding Fairseq’s 16-bit precision (FP16) implementation119 , this library implements NVIDIA’s
recommendation of using mixed precision (using FP16 to compute the forward/backward passes
and the losses, but update the model parameters in FP32). The reason why mixed precision is
indicated is that pure FP16 might slightly degrade the quality of the model. It also implements
117
https://twitter.com/Thom_Wolf/status/1122466524860702729
118
E.g., see https://github.com/huggingface/transformers/blob/master/src/transformers/models/
roberta/convert_roberta_original_pytorch_checkpoint_to_pytorch.py and https://github.com/pytorch/
fairseq/blob/master/fairseq/models/huggingface/hf_gpt2.py
119
See https://github.com/pytorch/fairseq/issues/1047

62
vanilla FP16, which is more memory-efficient. We will be using the former, since it still allows
for a large enough batch size in our case and obtains better performance.
Apart from that, unlike most libraries, Fairseq implements binarization for further efficiency, as
described in Section 5.5.5
Regarding the distributed training, we will use multi-node distributed training with data par-
allelism. These means that there will be three levels of parallelism, specifically, between nodes,
between GPUs within a single node, and within the GPU itself. Fairseq provides a distributed
mode based on the c10d package, a distributed backend that comes built-in in PyTorch. Table
14 shows the results of one of the benchmarks we run for checking the performance of Fairseq in
our environment, showing that executions scale properly with the number of GPUs.

Nodes GPUs/node GPUs Time (min) Epochs Epoch/min Epoch/min/GPUs


1 2 2 126.23 4 0.03168 0.01580
1 4 4 64.02 4 0.06248 0.01562
4 4 16 24.16 4 0.1657 0.010347

Table 14: Fairseq RoBERTa baseline on the CTE-POWER cluster: We train RoBERTa-base
for 4 epochs, with the relatively small dataset (100M tokens once pre-tokenized) of Wikitext103
[47], and measure execution time. The more efficient run are the ones with a single node,
because the overhead of synchronization within a single node is smaller than the one of inter-
node synchronization. Nevertheless, the results show that executions scale properly with the
number of GPU.

5.6.5 Launcher

For automating the launch of the training jobs, we build an automated launcher that executes
the main Fairseq training script in the master node, and then connects to each of the requested
nodes and attaches them to the same training run.

5.6.6 Architecture

We leave as future work the training of other architectures. Having prepared the corpora in a
roughly universal format language modeling and shown the effective training in our environment,
training with other architectures will involve only changes in the parameters of the training script.
In this work, we will use RoBERTa, meaning that the model will be, architecture-wise, like the
original BERT (see Section 2).

5.6.7 Hyperparameters and training objective

Since we cannot afford to conduct searches, we suggest basing our hyperparameters on the ones of
the original model. We do modify the batch size and update frequency for the reasons mentioned
in Section 5.6.3. In Section 6, when we detail the specific applications and results, we will give
more details about this matter. In addition, we will see how the training process evolved. Note
that being RoBERTa, we use dynamic masking (instead of fixed, as in BERT), and we do not
use the next sentence prediction task.

63
5.7 Evaluation
Unfortunately, we cannot afford to extrinsically evaluate the cleaning pipeline by training from
scratch the same model with different cleaning configurations, due to computational constraints.
However, we will try to do a qualitative analysis of a representative subset of the discarded,
transformed, or allowed sentences, in Section 7. For evaluating the models, we convert the
Fairseq models to Huggingface, and implement with this library standard benchmarks for lan-
guage models, consisting of evaluating the transfer learning capabilities in downstream tasks.
Unfortunately, precisely because we are targeting languages or domains that do not have as
many resources as general-domain English, apart from the lack of training data, there is usually
a lack of established benchmarks. The development of new benchmarks for Catalan is an ongo-
ing research line that supersedes this master thesis. Nevertheless, we can find some benchmarks,
such as a Catalan Named Entity and Recognition (NERC) task. For evaluating, we fine-tune
the models with the train set of the downstream task and then evaluate the fine-tuned models
in the test sets. We apply the exact same settings for all the compared models. Again, in the
case of Catalan, the most obvious baseline is Google’s Multilingual BERT.

64
6 Applications and results

In this section, we describe the application of the proposed methodologies (described in Section
5), in terms of corpora and model generation.

6.1 Corpora generation


By applying our cleaning pipeline to a heterogeneous set of raw textual data, we will try to
validate the cleaning methodology under different scenarios.

6.1.1 Catalan

In the case of Catalan, we are both generating three new datasets, and cleaning existing ones
with the cleaning pipeline introduced in this work. The three new datasets are originated from:

• General crawling: A new crawling targeting the 440 top .cat domains, the top 30 .ad
(Andorra) domains, and the top 5 .barcelona domains, with a depth of 5. The criterion
for selecting these domains was selecting all the websites with these TLDs120 within the
worldwide top 1M (in visitors) websites121 . The crawling was executed during March-April
2020.

• Catalan Government Crawling: A crawling consisting of websites of the Catalan Gov-


ernment, typically banning crawlers, but specifically allowed to BSC. The crawling was
conducted during September 2020.

• Catalan News Agency Crawling: A crawling consisting of the Catalan News Agency122
news, explicitly allowed to BSC (the website typically blocks crawlers).

Interestingly, these new corpora are so recent that the term of ”coronavirus” appears with enough
frequency to have its own token in the vocabulary when building the model, in Section 6.2.
Open Subtitles, caWac, Wikipedia, and the Catalan section of the DOGC corpus are preprocessed
with the new cleaning pipeline. All cleaned corpora are aggregated (and deduplicated again) in
a new corpus that, to the best of our knowledge, is the biggest Catalan corpus ever generated
for training language models, with as many as 1.7B tokens, being deduplicated and mostly
document-level. Table 17 shows statistics of the different corpora before and after cleaning. The
resulting corpus has:

• 9,892,770 documents.

• 1,758,388,896 tokens.

• 7.4 sentences per document (standard deviation of 24.2).

• 177.7 tokens per document (standard deviation of 589.7).


120
Top Level Domain: https://en.wikipedia.org/wiki/Top-level_domain
121
https://domaintyper.com/top-websites/most-popular-websites-with-cat-domain
122
https://www.acn.cat/

65
• 1074.5 characters per document (standard deviation of 3558.4).

• A minimum of 1 sentence per document (coming from the non-document-level corpora).

• A maximum of 23,217 sentences per document (coming from an amateur novel found in
the crawling).

• 58,234 documents with more than 100 sentences, 445 documents with more than 445 sen-
tences, and 1 document with more than 10,000 sentences. Very long documents were man-
ually inspected to verify the correctness of the concatenation and cleaning (e.g., sentence-
level corpora must be concatenated with an empty line between sentences).

For qualitatively evaluating the cleaning, we provide a random sample with the text before and
after cleaning123 . The sample consists of the original sampled sentences, the result after the
cleaning, and the recorded operations of the cleaning pipeline. In Section 7.1, we will refer to
this sample, when analyzing the results.

6.1.2 General-domain Spanish

In sections 3.3 and 4.1, we mentioned the collection and preprocessing of more than 3.3B tokens
for BETO, a Spanish BERT developed in the University of Chile. After inspecting the corpus,
we can confirm that the sentence splitting process presents some artifacts (e.g., the acronyms).
However, at this point, we believe processing it again with our cleaning pipeline is not a priority,
to save computation resources.
In the case of the BNE corpus, there was no alternative, since the dataset is new and had never
been preprocessed. Figure 15 shows the SLURM jobs of execution of the cleaning pipeline on
the raw data from the BNE corpus, on MareNostrum 4.
The dataset was divided in three parts, and the mapping part of the cleaning pipeline (i.e., all
cleaning steps before document deduplication) is applied separately to each of them via three
(the fourth job in Figure 15 corresponds to the biomedical crawling dataset) SLURM jobs of 50
nodes (3 × 50 × 48 = 7200 CPUs).
In 48 hours and with these 150 nodes124 , the cleaning pipeline processed 58.8% of the JSONs
of the dataset (213001/408692, 22001/75275, and 45001/132736, for each of the three parts,
respectively). When aggregating the three parts, we obtain the following statistics:

• File size (plain text, but including metadata): 1.5 TB + 141 GB + 950 GB = 2.591 TB

• Tokens: 62,168,462,497 + 8,893,804,670 + 57,223,860,084 = 128,286,127,251

• Sentences: 3,027,392,167 + 492,746,364 + 2,950,186,720 = 6,470,325,251

• Documents: 168,761,312 + 31,609,562 + 174,041,060 = 374,411,934

123
Available at https://docs.google.com/spreadsheets/d/10F792DSr2yg3OI7Tyj5c3p55WjkjVAKARaNDDgubYYs/
edit?usp=sharing("debug"tab)
124
Note that the time limit for jobs on MareNostrum 4 is precisely 48 hours, and for a single job one cannot
request more than 50 nodes.

66
Figure 15: Cleaning pipeline execution on BNE: The dataset was divided in three parts, and the
mapping part of the cleaning pipeline (i.e., all cleaning steps before document deduplication)
is applied separately to each of them via three (the fourth job in the figure corresponds to the
biomedical crawling dataset) SLURM jobs of 50 nodes (3 × 50 × 48 = 7200 CPUs).

Corpus Plain text size Tokens


BETO aggregation 18GB 3B
Spanish Oscar 150GB 25.6B
60% BNE (not deduplicated) 2.5TB 128.3B

Table 15: Comparison between BNE and other big Spanish corpora: The 60% of BNE (the
one corresponding to our cleaning execution), once cleaned, is clearly bigger than the existing
Spanish corpora (although it has not yet been deduplicated). However, coming from so many
different websites, it is reasonable to assume that the corpus will not be decreased to more than
one order of magnitude. Note that the plain text size includes some metadata, in the case of
BNE.

As a comparison, the Spanish section of the OSCAR corpus consists of 25.6B tokens (while

67
the partial results of BNE consist of 128.3B tokens), with around 150GB (of plain text). The
collection of the preprocessed corpora of the BETO work resulted in a corpus of around 3B tokens
(18GB of plain text). Furthermore, the dataset is document-level, allowing for the modeling of
long-range dependencies. The caveat, though, is that it has not yet been deduplicated, so the final
size will be actually smaller and it cannot be directly compared (since OSCAR is deduplicated).
Table 15 shows a simple comparison between the three aforementioned corpora. We provide
the state before and after the cleaning of a random sample125 . We will make references to this
sample in Section 7.1.

6.1.3 Biomedical Spanish

After processing the data sources mentioned in Section 4.1, the resulting total corpus consists of
almost 1B tokens. We observe that the data coming from the new biomedical crawling dominates
the dataset (almost 750M tokens out of 972M, once deduplicated). While some of the other
used data sources may be of higher quality or relevance (e.g., scientific articles), the biomedical
crawling is essential to give the aggregated corpus enough scale and diversity for training language
models.
All cleaned corpora are aggregated (and deduplicated again) in a new corpus that, to the best
of our knowledge, is the biggest biomedical Spanish corpus ever generated for training language
models, with as many as 1.7B tokens, being deduplicated and mostly document-level. Table 18
shows statistics of the different corpora before and after cleaning. The resulting corpus has:

• 2,154,539 documents.

• 967,848,477 tokens.

• 43,583,833 sentences.

• 20.2 sentences per document (standard deviation of 3507.1).

• 449.2 tokens per document (standard deviation of 3507.1).

• 2839.1 characters per document (standard deviation of 23157.1).

• A minimum of 1 sentence per document (coming from the non-document-level corpora).

• A maximum of 130,571 sentences per document (coming from an amateur novel found in
the crawling).

• 59,767 documents with more than 100 sentences, 849 documents with more than 445 sen-
tences, and 70 document with more than 10,000 sentences. Very long documents were man-
ually inspected to verify the correctness of the concatenation and cleaning (e.g., sentence-
level corpora must be concatenated with an empty line between sentences).

In general, we observe more diversity in terms of document length than in the case of Catalan.
This is the case because some of the aggregated corpora comprise (anonymized) clinical stories,
125
Available at https://drive.google.com/drive/folders/1t3DZXPF0F6FEM3bJjgaKIbesh0W1ojWd?usp=
sharing. The sample bne.zip contains the raw data sample, while the output.txt file contains the final, clean
output.

68
or articles. At the time of writing this thesis, this corpus is being used to train a domain-specific
RoBERTa, which supersedes this project. For qualitatively evaluating the cleaning, which is
more sensitive because of this challenging domain126 , we provide results on a sample of more
than 400 sentences per biomedical corpus127 . We will make references to this sample in Section
7. The sample consists of the original sampled sentences, the result after the cleaning, and the
recorded operations of the cleaning pipeline.

6.1.4 MT4All

Table 19 shows statistics of the different corpora before and after cleaning. Note that this table
omits the case of Catalan and biomedical Spanish, which were already described in detail. The
generated corpora are document-level, but the project in which they are applied, MT4All, consists
of sentence-level machine translation. Thus, before feeding the text into the model, empty lines
(the document boundaries) are removed, and a sentence-level deduplication is applied. This
further preprocessing, along with tokenization, is applied by the model preprocessing itself (not
in the cleaning phase). Recall that the cleaning process is designed to be as model-agnostic
as possible, so these corpora can be used in the future for document-level machine translation
models, or models with other tokenization strategies.
These corpora, including the Catalan corpus (but not yet the biomedical Spanish one, which will
be used in a more advanced phase of the MT4All project) are used in the MT4All project for
building machine translation systems without supervision. In the case of Finnish, Latvian, and
Norwegian, the domain-specific corpora are used together with the respective general-domain
OSCAR corpora. Specifically, this is done using the methodology proposed in [107, 121, 122],
described in Section 2.3, using Monoses, Fairseq, Vecmap, and Moses128 . Table 16 shows the
preliminary results. The BLEU scores are still relatively low, especially in the case of the pairs
involving Basque and Finnish. It has to be noted that these results are obtained without any
supervision whatsoever. In Section 7.1 we will further discuss these results.

BLEU BLEU
English→Basque 5.1 Basque→English 12.1
English→Catalan 25.3 Catalan→English 25.6
English→Latvian 17.5 Latvian→English 15.1
English→Finnish 9.1 Finnish→English 8.8
English→Norwegian 25.9 Norwegian→English 23.4

Table 16: Preliminary results of the MT4All project: The BLEU scores are still relatively low,
especially in the case of the pairs involving Basque and Finnish. In Section 7.1 we will discuss
these results.

126
E.g., a language identifier will assign lower probability to sentences with many domain-specific terms.
127
Available at https://docs.google.com/spreadsheets/d/1iuzZiIrca_B2XapD1pML4cW23Ewlu83lldm9W69dtmc/
edit?usp=sharing
128
https://github.com/artetxem/monoses

69
Corpus Original size (GB/tokens) Final size (GB/tokens) Final # sentences Nodes/CPUs Time Cleaning
Catalan Open
0.02GB/3.52M tok 0.02GB/3.52M tok 0.61M - - None
Subtitles
SP,
Catalan Oscar 8.30GB/1,358M tok 4.00GB/695.37M tok 31.39M 1/48 14h
DEDUP
LANG,
Catalan Web Q,
3.50GB/780M tok 3.60GB/650.98M tok 27.35M 3/144 2h
Corpus (cawac) SP,
DEDUP
LANG,
Q,
Catalan
1.20GB/198.36M tok 0.98GB/167.47M tok 6.86M 3/144 1h
Wikipedia
SP,
DEDUP
LANG,
Crawling Q,
312.00GB/- 2.60GB/434.82M tok 19.45M 3/144 4h
General SP,

70
DEDUP
LANG,
Generalitat Q,
75.00GB/- 0.25GB/39.12M tok 1.57M 1/48 4h
crawling SP,
DEDUP
Catalan
SP,
News 0.49GB/81.28M tok 0.45GB/75.61M tok 3.06M 1/48 1h
DEDUP
Agency
Parallel DOGC 0.80GB/126.65M tok 0.80GB/126.65M tok 10.92M - - None
TOTAL 401.31GB/2,547.81M tok 12.69GB/2,193.4M tok 101.21M 12/576 26h

All concat
12.69GB/2,193.4M tok 11.00 GB/1,758.3M tok 83.06M 1/48 1h Q, DEDUP
and dedup

Table 17: Cleaned (or generated from scratch) Catalan corpora, where SP means sentence splitting; DEDUP means deduplication; Q
means Quality filter (including all the cleaning heuristics in the cleaning pipeline); and LANG means language identification filter. In
the case of the Catalan Open Subtitles and Parallel DOGC, the cleaning was applied after the concatenation (for this reason in the
table it says Cleaning: None). In the case of the new crawlings, the original state consisted of WARC files. This explains the large
size (and impossibility to count tokens without first removing the HTML).
Corpus Original size (GB/tokens) Nodes/CPUs Time Final size (GB/tokens) # sentences Cleaning
SP, 1W S,
CardioCC 0.001GB/149,904 1/48 1h 0.001GB/0.15M 9,970
DEDUP
SP, 1W S,
RadioCC 0.001GB/177,366 1/48 1h 0.001GB/0.17M 9,948
DEDUP
SP, 1W S,
Libros Casos Clinicos 0.007GB/1,137,555 1/48 1h 0.007GB/1,024,797 68,833
DEDUP
SP, 1W S,
Covid-CC 0.001GB/82,201 1/48 1h 0.001GB/82,091 3,896
DEDUP
SP, 1W S,
EMEA 0.087GB/13,797,362 1/48 1h 0.034GB/5,377,448 284,575
DEDUP
SP, 1W S,
Patents 0.087GB/14,022,520 1/48 1h 0.084GB/13,463,387 253,924
DEDUP
SP, 1W S,
Wikipedia Life Sciences 0.126GB/18,771,176 1/48 1h 0.088GB/13,890,501 832,027
DEDUP
SP, 1W S,
BARR2-Background 0.188GB/28,868,022 1/48 1h 0.159GB/24,516,442 1,029,600
DEDUP

71
SP, 1W S,
PUBMED 0.31GB/1,957,479 1/48 1h 0.013GB/1,858,966 103,674
DEDUP
SP, 1W S,
REEC 0.048GB/4,581,755 1/48 1h 0.028GB/4,283,453 220,726
DEDUP
SP, 1W S,
SciELO 0.401GB/61,838,972 1/48 1h 0.38GB/60,007,289 2,668,231
DEDUP
SP, 1W S,
Mespen Medline 1.2GB/6,864,901 1/48 1h -/4,166,077 322,619
DEDUP
SP, 1W S,
PDFs general 15GB/109,124,996 1/48 1h 0.631GB/97,146,139 5,252,481
DEDUP
LANG,
Q, SP,
medical crawler 626GB/- 50/2400 48h 4.5GB/746,368,185 32,766,976
4W S,
DEDUP
643.457GB/
TOTAL 63/3024 61h 5.927GB/972,184,775.32 43,827,480
261,373,209

CONCAT,
All concat and dedup 643.46GB/972,184,775.32 1/48 1h 5.9GB/967,847,439 43,583,833
DEDUP

Table 18: Generated or preprocessed biomedical Spanish corpora: {N}-W S means minimum # words per sentence.
Corpus Original size Final size Final # sentences Nodes/CPUs Time Cleaning
LANG,
Basque Q,
2.1GB 0.794GB/106.857M tok 7.448M 1/48 1h
Crawling SP,
DEDUP
LANG,
English
Q,
Finance 636GB 0.357GB/67.732M tok 3.840M 1/48 7h
SP,
Crawling
DEDUP
LANG,
Finnish
Q,
Finance 20GB 0.348GB/44.613M tok 3.669M 1/48 4h
SP,

72
Crawling
DEDUP
LANG,
Latvian
Q,
Finance 1.8GB 0.068GB/8.827M tok 0.502M 1/48 <0.1h
SP,
Crawling
DEDUP
LANG,
Norwegian
Q,
Finance 75GB 0.510GB/93.523M tok 5.618M 1/48 2h
SP,
Crawling
DEDUP

Table 19: Cleaned (or generated from scratch) corpora for MT4All. See Table 17 for the meaning of the used abbreviations. The
original size is not reported in terms of tokens due to being WARC files.
6.2 Model generation
The generated corpora have a great potential to feed new language models. In this work, we
describe the use case of the Catalan corpus for building a language-specific RoBERTa.

6.2.1 RoBERTca: the Catalan RoBERTa

Training We omit the test runs of the training procedure. We follow the methodology de-
scribed in Section 5. Regarding the learning rate, we train the model for 115k updates, 10k
of them being warm-up updates with a peak learning rate of 0.0005. This means that, due to
technical issues related to the training of large architectures, especially Transformers [123], the
initial learning rate is set to a tiny value; then, it is gradually increased during the warm-up
phase. This increases the stability and final performance of the learning process. Then, the
learning rate is slowly decreased following a polynomial decay scheduler. Figure 16 shows the
evolution of the learning rate in the course of training. Figure 17 shows the scale of the loss
during training. Fairseq automatically scales the loss in case overflows are detected, which is not
uncommon when using FP16. The scale is more unstable at the beginning, during the warm-up
steps, when gradients have more orders of magnitude. Figure 18 shows the gradient norm during
training.
We use a maximum of 512 positions, which means that the maximum context window observed
in training will consist of 512 tokens. In inference, a sliding window will be used to simulate
larger contexts. We train the model with 8 NVIDIA V100 GPUs of 16GB for 192 hours. We use
a device batch size of 8 sentences per GPU, with approximately 512 tokens per sample. Tokens
coming from different documents are never merged together. If the document is longer than 512
tokens, it is divided into different batches accordingly. Together with an update frequency of 32,
the effective batch size consists of 2048 sentences. Figure 19 shows the word per batch, which
moves around 400k tokens. The number is not constant since documents are never merged and
sentences are never cut, and sentences have different numbers of tokens.
We regularize with a dropout [124] of 0.1 (including attention weights), and a weight decay of
0.01. We use the Adam [125] optimizer with β1 = 0.9, β2 = 0.98,  = 1e−6 . We make of a
floating point precision of 16 bits in the setting recommended by NVIDIA (as detailed in Section
5).
The best checkpoint is selected according to the perplexity in the validation set. We keep all
the checkpoints for potential further studies in the future. Figure 20 shows the evolution of the
perplexity during training. Figures 21 and 22 show the best validation loss recorded during the
training procedure and the training and validation loss progress, respectively.

73
Figure 16: RoBERTca learning rate: During the warmup steps, the learning rate is progressively
increased from a tiny value to the peak value (0.0005). Then, it decreases with a polynomial
decay.

Figure 17: RoBERTca loss scale: The loss scale is automatically adjusted when overflows are
detected, which is not especially uncommon when using FP16. These overflows are more common
when gradients are larger and the scale has not been yet adjusted, at the beginning of training.
Then, it is more stable.

74
Figure 18: The gradient norm is considerably bigger at the beginning, during the warmup phase.
Then, the gradient norm is more stable.

Figure 19: RoBERTca Words Per Batch (WPB): The number is not constant since documents
are never merged and sentences are never cut, and sentences have different numbers of tokens.

75
Figure 20: RoBERTca perplexity: Perplexity consistently decreases in both train and validation
sets. It is used for selecting the best checkpoint. The plot is difficult to visualize due to the large
changes in scale of the metric, during training.

Figure 21: RoBERTca best validation loss: The best cross-entropy obtained in the validation
set kept decreasing. When the training finishes, it seems to have been converged.

76
Figure 22: RoBERTca loss curves: The model seems to have converged.

77
Evaluation As intrinsic evaluation, we conduct the following analysis:

• Vocabulary and tokenization: Table 20 shows the number of tokens per sentence in RoBERTca’s
test set, as per the tokenizers of the different evaluated models.

Model Tokens per sentence


RoBERTca 33.93
mBERT 41.26
Wikibert ca 38.44

Table 20: Tokens per sentence

This comes as no surprise, since RoBERTca (as Wikibert, though) has a vocabulary specif-
ically built for Catalan (and with more vocabulary entries and having seen more diverse
data than Wikibert). Many Catalan words have their own token in RoBERTca’s vocabu-
lary (instead of being built from subwords), including ”coronavirus” (the weird characters
in the case of RoBERTca are normal artifacts caused when trying to visualize individual
BBPE tokens, being based on bytes):

sentence: "L'epidèmia de coronavirus va causar una gran crisi econòmica."

roberca: ['_ A¨mia', '_


GL', "'", 'epid~ Gde', '_
Gcoronavirus', '_
Gva', '_
Gcausar',
'_
Guna', '_
Ggran', '_
Gcrisi', '_ A²mica', '.'] ()
Gecon~

mbert: ['L', "'", 'epi', '##d', '##èm', '##ia', 'de', 'corona', '##vir',
'##us', 'va', 'causar', 'una', 'gran', 'crisi', 'e', '##con', '##òmica', '.']

wikibert: ['l', "'", 'epid', '##emi', '##a', 'de', 'corona', '##vir',


'##us', 'va', 'causar', 'una', 'gran', 'crisi', 'econom', '##ica', '.']

• Mask filling accuracy: In RoBERTca’s test set, we compute the accuracy of sampling
masked tokens (i.e., we mask one token and ask the model to retrieve the original one,
as in training): The results are not directly comparable, since each model has a different
tokenization, but it serves as a sanity check. The results are shown in Table 21.

Model Avg. Accuracy Avg. Confidence


RoBERTca 61.4% 55%
mBERT 46.9% 70%
Wikibert ca 53.1% 57%

Table 21: Mask filling comparison

• Qualitative analysis: We attach some outputs of the model from inputs chosen by us
(and not directly present in the dataset), for the sake of curiosity. The sentences are
inevitably cherry-picked, without having an established methodology or dataset for a better
evaluation, and we cannot extract conclusions. The results are, though, curious, and we

78
can get an intuition of the kind of outputs produced by the model. At least, we can affirm
that there seems to be a certain sex bias, and that in the attached examples (except the
sex bias one) the outputs of the model are reasonably correct in terms of pure semantics
and linguistics. Figures 23, 24, 25, 26, and 27 show some prompts we believed to be
interesting for the reader. Some of them are extracted from the test set, and some others
are manually written for the sake of curiosity. For all the predictions of RoBERTca,
mBERT, and Wikibert ca on the test set, we attach the logs129 .

Figure 23: RoBERTca prompt 1: Our model seems to be aware of the fact that a year is
composed of 365 years, unlike mBERT and Wikibert. Obviously, we cannot extract conclusions
from a single prompt, but we find it interesting.

(a) Output for ”Ell” (he). (b) Output for ”Ella” (she).

Figure 24: RoBERTca prompt 2: Our model seems to have encoded a certain sex bias, induced
by the bias present in the data (especially news, we hypothesize). In particular, with the prompt
”He/She works as a <mask> at Bellvitge’s Hospital”, in the case of ”He” the model predicts
”doctor”, while in the case of ”She”, it predicts ”nurse”.

These analysis allow a sanity check, but they are limited and we will not be able to extract
further conclusions.

129
Logs available at https://drive.google.com/file/d/1_A9aIMagGURu0qN1J5AQHg7blU1T1TMF/view?usp=
sharing

79
Figure 25: RoBERTca prompt 3: Our model is able to predict the ”coronavirus” token (when
prompted with ”The <mask> pandemic has caused a new economic crisis.”), thanks to being
present in the training data.

80
Figure 26: RoBERTca prompt 4: The model successfully infers that it has to predict a female
name. When prompted with a sentence referring to a female, named ”<mask> Villegas”, the
model infers that it has to predict a female name (Montse, Núria, Marta,...).

Figure 27: RoBERTca prompt 5: In this specific example, RoBERTca shows a better under-
standing of units of length than mBERT. The sentence is taken from the RoBERTca test set.
It speaks about a runway with a certain length and width, and the masked token is the one
corresponding to ”meters”, referring to the width. The context is that the only unit that makes
sense as a runway width is, precisely, meters.

81
As extrinsic evaluation, we evaluate the usefulness of RoBERTca contextual representations when
fine-tuned in a downstream task, namely, Named Entity Recognition, Part-Of-Speech tagging,
Question Answering, Text Classification, and Semantic Textual Similarity. Table 22 shows the
results in these tasks for RoBERTa and two reference models. As a Catalan-specific baseline, we
take the Catalan Wikibert (Wikibert ca) [84]; as a multilingual baseline, we take the multilingual
BERT (mBERT). They are described in sections 2.2 and 3.2, respectively.

Model Task Accuracy Precision Recall F1 Exact match

RoBERTca NER (AnCora) 0.99 0.88 0.89 0.88 -


mBERT NER (AnCora) 0.99 0.86 0.87 0.87 -
Wikibert ca NER (AnCora) -130 - - - -

RoBERTca POS (AnCora) 0.99 0.99 0.99 0.99 -


mBERT POS (AnCora) 0.99 0.99 0.99 0.99 -
Wikibert ca POS (AnCora) 0.98 0.98 0.98 0.98 -

RoBERTca QA (XQUAD) - - - 0.683 0.457


mBERT QA (XQUAD) - - - 0.692 0.470
Wikibert ca QA (XQUAD) - - - 0.649 0.430

RoBERTca QA (BSC) - - - 0.854 0.710


mBERT QA (BSC) - - - 0.871 0.736
Wikibert ca QA (BSC) - - - 0.846 0.710

RoBERTca Text Class. (BSC) 0.56 - - - -


mBERT Text Class. (BSC) 0.49 - - - -
Wikibert ca Text Class (BSC) 0.53 - - - -

RoBERTca STS (BSC) 0.28 - - - -


mBERT STS (BSC) 0.31 - - - -
Wikibert ca STS (BSC) 0.25 - - - -

Table 22: RoBERTca evaluation: The preliminary results in Named Entity Recognition, Part-
Of-Speech, Question Answering, Text Classification, and Semantic Textual Similarity show that
RoBERTca is at least competitive with the existing baselines for Catalan (mBERT, the multi-
lingual BERT, and Wikibert ca, a BERT trained only with the Catalan Wikipedia).

For the first two tasks, we use the Ancora corpus131 , consisting of around 11k train sentences in
130
The execution of the script crashed for an unknown reason.
131
http://clic.ub.edu/corpus/en

82
the case of NER and 13k in the case of POS. For question answering, we use 1. TeMU’s Catalan
translation of XQUAD [126] (a multilingual question answering dataset) consisting of around
1000k annotated questions; 2. TeMU’s 14k manually generated questions from Wikipedia articles.
For Text Classification, we use TeMU’s text classification dataset, automatically constructed
from news (and the corresponding tags) extracted from the Catalan News Agency132 . It consists
of 219k sentences (the largest benchmark, by far). For Semantic Textual Similarity, again, we
use a TeMU’s benchmark, this time of 3k manually annotated pairs of sentences. TeMU-BSC
group is still in the process of developing these benchmarks (and others), so they are still a
preliminary version and have not yet been published. The train-validation-test splits, the number
of sentences, and the total number of benchmarks (e.g., a pronoun resolution benchmark is in
the works) may be different in the final version.
We note that our model is always better than Wikibert ca, having seen more data. In the case
of the comparison with the multilingual BERT, the latter is remarkably competitive. In fact,
with these results, our conclusion is that it outperforms our model, which at is always close to its
results. However, our model is more efficient because it produces shorter sequences due to the
tokenization, as shown before in Table 20. Interestingly, the text classification task is the only
one in which both Catalan-specific models outperform, and for a clear reason, the number of train
sentences (we said that the text classification dataset is the largest of the used benchmarks, by
far) and the language-specific vocabulary (as described in Section 7.1). Also, this evaluation must
be taken with a grain of salt because the set of evaluation benchmarks developed by TeMU-BSC
is still a preliminary version. We hypothesize that the difference in performance will increase in
favour of RoBERTca once we obtain new evaluation datasets and increase the number of training
sentences, which is a work in progress. In Section 7.1, we will further analyze these results.

132
https://www.acn.cat/

83
7 Discussion

In this section, we will discuss the applications and results presented in Section 6.

7.1 Results analysis


Regarding corpora generation, we have shown the versatility of the cleaning pipeline under
different scenarios. We have seen that it can effectively clean and format corpora regardless of
factors such as 1. format (WARC from crawlings, plain text), 2. size (from a few megabytes
to terabytes of text), 3. target application (language model pre-training, unsupervised machine
translation), 4. language (from Finnish to Catalan), or 5. domain (Finance, biomedical).
Related to the application on a variety of scenarios itself, we observe that the parameterization
of the cleaning pipeline is key to its adaptability. For instance, in the case of the biomedical
crawling, one can decrease the language identifier threshold to account for the fact the text from
the biomedical domain can still be recognised as Spanish, but with considerably less confidence.
When we apply the pipeline to biomedical text that we are sure it will only consist of Spanish
text, we inhibit the language identifier. Some of the more aggressive cleaning heuristics are also
inhibited on, for example, certain datasets composed of clinical stories. Clinical stories datasets
are relatively small and we do not want them to be excessively trimmed down.
In the case of Catalan, even if using some existing corpora, we have compiled the biggest corpus
ever for language modeling in this language. The format is ready for document-level language
modeling, and we have proven that it can be effectively used for training a state-of-the-art
language model and an unsupervised machine translation with a decent BLEU score, to be
improved.
In the case of biomedical Spanish, we have compiled and preprocessed arguably the largest corpus
for training language models on the biomedical domain. The resulting corpus is extremely diverse.
It ranges from clinical stories to informal comments from patients, and includes scientific and
informative articles.
Regarding general-domain Spanish, we have seen that the partial (≈ 60%) preprocessing of the
BNE dataset has lead to a massive corpus. We are not sure of its real scale, due to the fact that
it has not been deduplicated, but its pre-deduplication size (almost 130B tokens) is promising,
one order of magnitude away from the largest existing corpus in Spanish (the Spanish OSCAR
consists of 26B tokens to date), to our knowledge. The caveat is that it has not yet been
deduplicated (unlike OSCAR, so they are not directly comparable; OSCAR, though, has not
been cleaned). After inspecting a randomly sampled subset of it (see Section 3.3), we conclude
that the text is ready for applying language modeling.
After inspecting the different cleaned datasets and the applied operations133 , we observe that
the outputs of the cleaning pipeline are generally sensible. We do observe certain artifacts, most
of them coming from the original text. For instance:

• Some sentences ending with ”...” seem to be cut. This is an artifact present in the raw
data, typically caused by a collapsed text in the website that the crawler did not expand
133
Appendix D.

84
to retrieve the whole sentence.

• Quotes with more than one sentence (separated by a dot) are split as different sentences.
This is probably the desired behaviour for documents as a whole, but it has the effect of
letting single sentences with not closed quotes.

• We observe that the global sentence deduplication, indeed, is useful for removing boiler-
plate, while typically not removing relevant sentences.

• We observe that the text from the biomedical crawling has, generally, more artifacts than
the one from the general domain. Still, it is of a remarkable quality, coming from a domain-
specific crawling.

Regarding the sentences that have been filtered out, we observe that, while it is true that some
well-formed sentences were discarded, they were mostly at least uninteresting (e.g., short, re-
peated sentences). We believe that our bet on a rule-based system-like proposal, with the help of
some statistical models (basically, the language identifiers) has proven to be effective. However,
we note that the effort of writing these rules has been considerably time-consuming. In addition,
we do not know whether it would adapt to unseen scenarios. If we found a way of introducing
more machine learning components (apart from the language identifiers) in a generic enough
way, that could be an interesting option to explore.
One of the objectives from the very beginning was keeping document boundaries whenever this
was possible. This has been successful, not only in appearance; when observing the formed
documents, sentences exhibit a clear coherence (i.e., one sentence follows from the previous ones,
with few exceptions, mostly caused by non-detected boilerplate).
Regarding the MT4All machine translation systems, the BLEU scores are relatively low. We
note that the system with the best performance is Catalan-English. Letting aside the fact that it
is a general-domain use case and arguably easier to model than Finnish, we note Catalan is one
of the use cases we put the most effort into, apart from Spanish (so far, not used in the MT4All
project). The unsupervised machine translation approach used in MT4All (Section 2.3) was
originally tested on general-domain corpora with relatively close languages (English, German,
French). It was, thus, unclear how would the system behave in the settings of MT4All. Probably,
in cases in which general-domain data were used to increase the size of the domain-specific corpus,
the latter should have had to be treated in a special way, instead of just concatenating, to give
more importance to domain-specific samples.
As far as RoBERTca is concerned, we are certain that the model was well-trained, after observing
its outputs. When evaluated on downstream tasks, it is competitive with existing baselines for
Catalan. We note, though, that the benchmarks for evaluating Catalan language modeling are
limited at this point, and perhaps too easy to observe a noticeable difference. Building more
evaluation benchmarks for Catalan is an ongoing research line at TeMU-BSC. We have seen that
its vocabulary comprises many Catalan words as whole tokens (instead of being composed of
cumbersome subwords). We observe that even if this model has been trained with a document-
level corpus, it typically does not see long documents at once, due to its 512-token limitation
(for the context window). The high performance of mBERT in the benchmarks can, in addition,
be explained due to the fact that cross-lingual transfer learning is easier in the case of Catalan,
because Romance languages are over-represented in Wikipedia (where the multilingual data is
extracted from, in the case of mBERT).

85
In the RoBERTca results, we observe an interesting phenomenon. The only benchmark in
which both language-specific models outperform the multilingual baseline, and with a remarkable
difference, is the one of text classification. We are unsure whether the corresponding dataset
or task has any intrinsic property that makes it easier for the language-specific models than
in the other cases. Nevertheless, there is one clear difference in terms of the data size. The
text classification task is, by far, the largest evaluation benchmarks among the ones we have
used. It seems that the multilingual BERT outperforms the language-specific models when fewer
fine-tuning data are available. The multilingual BERT has seen considerably more pre-training
data (even if in other languages) than the Catalan-specific ones, so it performs better in very
low-resource scenarios. In addition, Romance languages are a big portion of the multilingual
BERT dataset, and they can transfer knowledge to the Catalan representations. When more
training data in the fine-tuning datasets are available, the language-specific models are better.
In this case, the fact that they have a language-specific vocabulary clearly compensates the fact
that they have seen less data during pre-training. The vocabulary on itself is not useful per se;
it needs training the corresponding representations in a large enough datasets (so RoBERTca
outperforms Wikibert ca in this benchmark and the rest of them). This observation about the
dataset size is roughly consistent with the results in the other benchmarks.

7.2 Limitations
One of the main limitations of this work is that whilst we provide enough information to validate
our approach (i.e., it works) and we try to justify our decisions, we do not provide enough
evidence to show that our specific corpora generation approach is better than alternative ones,
other than inspecting the resulting corpora. This is a common situation in the literature, but
still a problem, nonetheless. However, we are certain that the cleaning process has correctly
formatted the text, and has removed non-natural or out-of-domain (or language) text, which at
least would have wasted GPU cycles and model capacity with relations that are not needed.
Apart from that, we have observed certain artifacts in the generated corpora, mostly coming
from the original text. In the case of the MT4All machine translation systems, the BLEU scores
are still relatively low, even if they are still preliminary and need more iterations, but they
need to be improved. As far as the generated RoBERTa, while it is competitive with the existing
baselines for Catalan, we would have expected an important difference, being specific for Catalan
and having been trained with a Catalan corpus orders of magnitude larger than existing systems
(both mBERT and Wikibert ca used the Catalan Wikipedia). Nevertheless, the problem could
lie in the benchmarks used for evaluation, which essentially correspond to relatively easy tasks
only, with little margin for improvement. We expect the difference in performance with respect
to the baselines to increase when the new evaluation benchmarks for Catalan will be available.
What we can affirm, though, is that the tokenizer generates considerably less tokens per sentence,
being more efficient in this regard than the baselines.

7.3 Impact statement


The generated Catalan model is hardly usable for malicious purposes, since its generative abilities
are limited, being an encoder. The contextual representations seem to exhibit, though, certain
biases caused by the training corpus (e.g., some outputs seem to be biased in terms of sex), and

86
the user must be well-aware of this problem. We do observe potential misuses in the case of the
BNE dataset. Its scale makes it feasible to train at least GPT-2-level models for Spanish. The
problematic usages of GPT-2 in the case of English are well-known134 .
Regarding possible environmental concerns, BSC clusters are known to be environmentally-
friendly135 , but, still, training RoBERTca has consumed a non-negligible amount of electricity,
while not providing remarkable gains with respect to the multilingual baseline. However, training
is only executed once, while inference is supposed to be run numerous times. The RoBERTca
tokenizer generates considerably less tokens per sentence than the mBERT or Wikibert ca coun-
terparts, meaning that RoBERTca inference is more compute and memory-efficient (because
sequences are shorter). In the case of the generated corpora, BNE is especially concerning in
terms of the computation needed. Nevertheless, the advantage of our model-agnostic approach
is that once preprocessed once, the data need not be cleaned again.

134
https://openai.com/blog/better-language-models/
135
https://www.bsc.es/news/bsc-news/the-new-bsc-machine-europe%E2%80%99s-%E2%80%9Cgreenest%E2%
80%9D-supercomputer

87
8 Conclusions and future work

To sum up, we have introduced a new pipeline for preprocessing corpora, and training language
models, which is, in turn composed of different sub-pipelines, including a cleaning pipeline, a
model-ready preprocessing pipeline, and the training and evaluation scripts.
Regarding the generated corpora, the corpus resulting from aggregating both existing (but pre-
processed with our cleaning pipeline) and new datasets, has resulted in the largest Catalan corpus
to date, to the best of the our knowledge. In the case of general-domain Spanish, by preprocessing
(even if still partially) BNE, we have shown that the cleaning pipeline is suitable for large-scale
corpora. The results are still preliminary and must be deduplicated, but we cannot rule out
the possibility of having generated the largest Spanish corpus for training language models to
date. We have further demonstrated the flexibility of the cleaning pipeline by applying it for
generating, again, arguably the largest biomedical Spanish corpus for training language models,
and for the rest of MT4All pairs. This proves the flexibility in terms of language, scale, domain,
and targeted task.
We have learned of the importance of generating datasets, a machine learning step which is often
underrated (unlike model generation). We have seen that it can be a considerably long, and time
and resource-consuming process.
Regarding model generation, we have shown that the used corpora can effectively be used for
building a language model from scratch. We have built the first Catalan RoBERTa ever, which is
the Catalan language model that has ever seen the most language-specific data. The preliminary
evaluation shows that it is competitive with the existing baselines, but without a significant
difference. We hypothesize that RoBERTca will clearly outperform the baselines when more
difficult and complete benchmarks are built, but it will have to be investigated.
As limitations, we observe the lack of proper evaluation metrics in the case of corpus cleaning.
We will suggest possible solutions to these problems as future work. All in all, we have shown
that our proposed tools can be effectively used for generating corpora and training language
models from scratch.
As future work, we suggest several lines of work:

• Regarding the cleaning pipeline, we suggest introducing more machine learning compo-
nents, but unlike the previous work, in a more generic way, which can work out-of-the-box
for different languages and domains. One possibility could be training a model using the
input-output pairs of the pipeline itself, and observing whether it generalizes to other lan-
guages and domains. The machine learning model, though, should be lightweight, otherwise
it would not be feasible to apply it to large quantities of raw data.

• We suggest building benchmarks for intrinsic quantitative evaluation of the cleaning pro-
cess, inspired by the CleanEval shared task [127].

• The same way language identifiers are a common component of many preprocessing pipelines,
including ours, domain classification systems could be used for organizing corpora from
crawlings into domains (unless manually tagging each page, we have observed that keywords
from the websites are not especially reliable). This could ease the process of generating
domain-specific corpora.

88
• As far as model generation is concerned, a Spanish biomedical RoBERTa using the data
and tools from this work is being trained at the time of writing this thesis. But this is
just the beginning. There are many architectures (with different trade-offs) to explore, for
Catalan and Spanish (both general-domain and biomedical domain). Nevertheless, while
we believe that a BERT-like model specific for languages or domains with enough data is
generally required (as a baseline, and because in inference it can be more efficient thanks to
tokenizing into shorter sequences), the training of further models will have to be justified
in terms of potential utility, not to waste computational resources. The generated data
could also be used for feeding multi-lingual and multi-domain models.

• One of the strengths of our approach is that we have taken into account document bound-
aries. RoBERTca already uses them, but often it does not see the whole document during
training (due to having a maximum context window of 512 tokens). In the MT4All project,
documents are not used. We would be especially keen on investigating document-level
unsupervised machine translation, on the one hand, and language modeling with longer
dependencies (for instance, using the Linformer architecture [128], a model with a lin-
ear approximation of attention), on the other. The document-level data are ready, and
Linformer is implemented in Fairseq136 .

• At TeMU-BSC, there is an ongoing work for generating benchmarks for further evaluating
models for these domains and languages.

Personally, I believe that I have applied many different skills that I learned in the Master in
Artificial Intelligence, ranging from data processing, to deep learning, without forgetting more
classical approaches in NLP, like the use of language identifiers and rule-based systems for clean-
ing corpora.

136
https://github.com/pytorch/fairseq/tree/master/examples/linformer

89
References

[1] S. Ruder, “Neural transfer learning for natural language processing,” Ph.D. dissertation,
National University of Ireland, Galway, 2019.

[2] S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning in natural lan-
guage processing,” in Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Tutorials, 2019, pp. 15–18.

[3] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations


of words and phrases and their compositionality,” CoRR, vol. abs/1310.4546, 2013.
[Online]. Available: http://arxiv.org/abs/1310.4546

[4] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer,


“Deep contextualized word representations,” CoRR, vol. abs/1802.05365, 2018. [Online].
Available: http://arxiv.org/abs/1802.05365

[5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and


I. Polosukhin, “Attention is all you need,” CoRR, vol. abs/1706.03762, 2017. [Online].
Available: http://arxiv.org/abs/1706.03762

[6] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional
transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018. [Online].
Available: http://arxiv.org/abs/1810.04805

[7] A. Radford, “Improving language understanding by generative pre-training,” 2018.

[8] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models
are unsupervised multitask learners,” 2019.

[9] K. Clark, M.-T. Luong, Q. V. Le, and C. D. Manning, “Electra: Pre-training text encoders
as discriminators rather than generators,” 2020.

[10] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT:


A lite BERT for self-supervised learning of language representations,” CoRR, vol.
abs/1909.11942, 2019. [Online]. Available: http://arxiv.org/abs/1909.11942

[11] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan,


P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan,
R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler,
M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
and D. Amodei, “Language models are few-shot learners,” 2020.

[12] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault,


R. Louf, M. Funtowicz, and J. Brew, “Huggingface’s transformers: State-of-the-art
natural language processing,” CoRR, vol. abs/1910.03771, 2019. [Online]. Available:
http://arxiv.org/abs/1910.03771

[13] R. Sennrich, B. Haddow, and A. Birch, “Improving neural machine translation


models with monolingual data,” CoRR, vol. abs/1511.06709, 2015. [Online]. Available:
http://arxiv.org/abs/1511.06709

90
[14] M. Artetxe, G. Labaka, and E. Agirre, “Unsupervised statistical machine translation,” in
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 2018, pp.
3632–3642. [Online]. Available: https://www.aclweb.org/anthology/D18-1399

[15] G. Lample, L. Denoyer, and M. Ranzato, “Unsupervised machine translation using


monolingual corpora only,” CoRR, vol. abs/1711.00043, 2017. [Online]. Available:
http://arxiv.org/abs/1711.00043

[16] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” Advances in


Neural Information Processing Systems (NeurIPS), 2019.

[17] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and


S. Pyysalo, “Multilingual is not enough: BERT for finnish,” CoRR, vol. abs/1912.07076,
2019. [Online]. Available: http://arxiv.org/abs/1912.07076

[18] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang, “Biobert: a
pre-trained biomedical language representation model for biomedical text mining,” CoRR,
vol. abs/1901.08746, 2019. [Online]. Available: http://arxiv.org/abs/1901.08746

[19] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with


neural networks,” CoRR, vol. abs/1409.3215, 2014. [Online]. Available: http:
//arxiv.org/abs/1409.3215

[20] J. Armengol Estapé, “Neural machine translation and linked data,” Jul 2019. [Online].
Available: http://hdl.handle.net/2117/168617

[21] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to Sequence Learning with Neural
Networks,” arXiv e-prints, p. arXiv:1409.3215, Sep 2014.

[22] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to
align and translate,” CoRR, vol. abs/1409.0473, 2015.

[23] M.-T. Luong, H. Pham, and C. D. Manning, “Effective Approaches to Attention-based


Neural Machine Translation,” arXiv e-prints, p. arXiv:1508.04025, Aug 2015.

[24] S. Ruder, “Deep Learning for NLP Best Practices,” http://ruder.io/deep-learning-nlp-


best-practices/, 2017.

[25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CoRR, vol. abs/1512.03385, 2015. [Online]. Available: http://arxiv.org/abs/1512.03385

[26] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by
reducing internal covariate shift,” CoRR, vol. abs/1502.03167, 2015. [Online]. Available:
http://arxiv.org/abs/1502.03167

[27] J. Lei Ba, J. R. Kiros, and G. E. Hinton, “Layer Normalization,” arXiv e-prints, p.
arXiv:1607.06450, Jul 2016.

[28] P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position


representations,” CoRR, vol. abs/1803.02155, 2018. [Online]. Available: http:
//arxiv.org/abs/1803.02155

91
[29] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: A survey,” 2020.

[30] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and


I. Polosukhin, “Attention Is All You Need,” arXiv e-prints, p. arXiv:1706.03762, Jun 2017.

[31] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad-


ford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” 2020.

[32] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9,
no. 8, p. 1735–1780, Nov. 1997. [Online]. Available: https://doi.org/10.1162/neco.1997.9.
8.1735

[33] P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer,


“Generating wikipedia by summarizing long sequences,” CoRR, vol. abs/1801.10198,
2018. [Online]. Available: http://arxiv.org/abs/1801.10198

[34] D. Hendrycks and K. Gimpel, “Bridging nonlinearities and stochastic regularizers with
gaussian error linear units,” CoRR, vol. abs/1606.08415, 2016. [Online]. Available:
http://arxiv.org/abs/1606.08415

[35] G. Zhao, J. Lin, Z. Zhang, X. Ren, Q. Su, and X. Sun, “Explicit sparse transformer:
Concentrated attention through explicit selection,” CoRR, vol. abs/1912.11637, 2019.
[Online]. Available: http://arxiv.org/abs/1912.11637

[36] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
and V. Stoyanov, “Roberta: A robustly optimized BERT pretraining approach,” CoRR,
vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692

[37] G. Lample and A. Conneau, “Cross-lingual language model pretraining,” CoRR, vol.
abs/1901.07291, 2019. [Online]. Available: http://arxiv.org/abs/1901.07291

[38] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite
bert for self-supervised learning of language representations,” 2019.

[39] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Lukasz Kaiser, “Universal trans-
formers,” 2018.

[40] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and


L. Zettlemoyer, “BART: denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension,” CoRR, vol. abs/1910.13461, 2019. [Online].
Available: http://arxiv.org/abs/1910.13461

[41] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, and L. Zettlemoyer,
“Multilingual denoising pre-training for neural machine translation,” 2020.

[42] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert:
smaller, faster, cheaper and lighter,” 2019.

[43] E. Voita, D. Talbot, F. Moiseev, R. Sennrich, and I. Titov, “Analyzing multi-head self-
attention: Specialized heads do the heavy lifting, the rest can be pruned,” 2019.

92
[44] P. H. Le-Khac, G. Healy, and A. F. Smeaton, “Contrastive representation learning: A
framework and review,” IEEE Access, vol. 8, p. 193907–193934, 2020. [Online]. Available:
http://dx.doi.org/10.1109/ACCESS.2020.3031549

[45] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser, “Universal transformers,”


CoRR, vol. abs/1807.03819, 2018. [Online]. Available: http://arxiv.org/abs/1807.03819

[46] S. J. Mielke, “Can you compare perplexity across different segmentations?” Apr 2019.
[Online]. Available: https://sjmielke.com/comparing-perplexities.htm

[47] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,”
CoRR, vol. abs/1609.07843, 2016. [Online]. Available: http://arxiv.org/abs/1609.07843

[48] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, “GLUE: A


multi-task benchmark and analysis platform for natural language understanding,” CoRR,
vol. abs/1804.07461, 2018. [Online]. Available: http://arxiv.org/abs/1804.07461

[49] G. Majumder, D. P. Pakray, A. Gelbukh, and D. Pinto, “Semantic textual similarity meth-
ods, tools, and applications: A survey,” Computacion y Sistemas, vol. 20, pp. 647–665, 12
2016.

[50] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy,


and S. R. Bowman, “Superglue: A stickier benchmark for general-purpose language
understanding systems,” CoRR, vol. abs/1905.00537, 2019. [Online]. Available:
http://arxiv.org/abs/1905.00537

[51] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100, 000+ questions for
machine comprehension of text,” CoRR, vol. abs/1606.05250, 2016. [Online]. Available:
http://arxiv.org/abs/1606.05250

[52] A. Conneau, G. Lample, R. Rinott, A. Williams, S. R. Bowman, H. Schwenk, and


V. Stoyanov, “XNLI: evaluating cross-lingual sentence representations,” CoRR, vol.
abs/1809.05053, 2018. [Online]. Available: http://arxiv.org/abs/1809.05053

[53] H. Le, L. Vial, J. Frej, V. Segonne, M. Coavoux, B. Lecouteux, A. Allauzen, B. Crabbé,


L. Besacier, and D. Schwab, “Flaubert: Unsupervised language model pre-training for
french,” 2019.

[54] A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and


S. Pyysalo, “Multilingual is not enough: Bert for finnish,” 2019.

[55] A. Jain, D. N. M. Meenachi, and D. B. Venkatraman, “Nukebert: A pre-trained language


model for low resource nuclear domain,” 2020.

[56] M. Artetxe, G. Labaka, and E. Agirre, “A robust self-learning method for fully unsu-
pervised cross-lingual mappings of word embeddings,” in Proceedings of the 56th Annual
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018,
pp. 789–798.

[57] ——, “Generalizing and improving bilingual word embedding mappings with a multi-step
framework of linear transformations,” in Proceedings of the Thirty-Second AAAI Confer-
ence on Artificial Intelligence, 2018, pp. 5012–5019.

93
[58] ——, “Learning bilingual word embeddings with (almost) no bilingual data,” in Proceedings
of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), 2017, pp. 451–462.

[59] ——, “Learning principled bilingual mappings of word embeddings while preserving mono-
lingual invariance,” in Proceedings of the 2016 Conference on Empirical Methods in Natural
Language Processing, 2016, pp. 2289–2294.

[60] ——, “An effective approach to unsupervised machine translation,” CoRR, vol.
abs/1902.01313, 2019. [Online]. Available: http://arxiv.org/abs/1902.01313

[61] M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. B. Viégas,


M. Wattenberg, G. Corrado, M. Hughes, and J. Dean, “Google’s multilingual neural
machine translation system: Enabling zero-shot translation,” CoRR, vol. abs/1611.04558,
2016. [Online]. Available: http://arxiv.org/abs/1611.04558

[62] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic
evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA:
Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available:
https://www.aclweb.org/anthology/P02-1040

[63] M. Artetxe, S. Ruder, D. Yogatama, G. Labaka, and E. Agirre, “A call for more rigor in
unsupervised cross-lingual learning,” 2020.

[64] N. Indurkhya and F. J. Damerau, Handbook of Natural Language Processing, 2nd ed. Chap-
man & Hall/CRC, 2010.

[65] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan,


W. Shen, C. Moran, R. Zens, C. Dyer, O. Bojar, A. Constantin, and E. Herbst, “Moses:
Open source toolkit for statistical machine translation,” in Proceedings of the 45th Annual
Meeting of the ACL on Interactive Poster and Demonstration Sessions, ser. ACL ’07. USA:
Association for Computational Linguistics, 2007, p. 177–180.

[66] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long sequences


with sparse transformers,” CoRR, vol. abs/1904.10509, 2019. [Online]. Available:
http://arxiv.org/abs/1904.10509

[67] R. Sennrich, B. Haddow, and A. Birch, “Neural Machine Translation of Rare Words with
Subword Units,” arXiv e-prints, p. arXiv:1508.07909, Aug 2015.

[68] P. Gage, “A new algorithm for data compression,” C Users J., vol. 12, no. 2, pp. 23–38,
Feb. 1994. [Online]. Available: http://dl.acm.org/citation.cfm?id=177910.177914

[69] C. Wang, K. Cho, and J. Gu, “Neural machine translation with byte-level subwords,”
CoRR, vol. abs/1909.03341, 2019. [Online]. Available: http://arxiv.org/abs/1909.03341

[70] S. Ding, A. Renduchintala, and K. Duh, “A call for prudent choice of subword
merge operations,” CoRR, vol. abs/1905.10453, 2019. [Online]. Available: http:
//arxiv.org/abs/1905.10453

94
[71] I. Provilkov, D. Emelianenko, and E. Voita, “Bpe-dropout: Simple and effective
subword regularization,” CoRR, vol. abs/1910.13267, 2019. [Online]. Available:
http://arxiv.org/abs/1910.13267

[72] T. Kudo and J. Richardson, “Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing,” CoRR, vol. abs/1808.06226, 2018.
[Online]. Available: http://arxiv.org/abs/1808.06226

[73] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun,


Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, L. Kaiser,
S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang,
C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes,
and J. Dean, “Google’s neural machine translation system: Bridging the gap between
human and machine translation,” CoRR, vol. abs/1609.08144, 2016. [Online]. Available:
http://arxiv.org/abs/1609.08144

[74] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models
are unsupervised multitask learners,” 2019.

[75] M. Artetxe, S. Ruder, and D. Yogatama, “On the cross-lingual transferability of


monolingual representations,” CoRR, vol. abs/1910.11856, 2019. [Online]. Available:
http://arxiv.org/abs/1910.11856

[76] W. de Vries and M. Nissim, “As good as new. how to successfully recycle english gpt-2 to
make models for other languages,” 2020.

[77] P. J. Ortiz Suárez, B. Sagot, and L. Romary, “Asynchronous Pipeline for Processing
Huge Corpora on Medium to Low Resource Infrastructures,” in 7th Workshop on the
Challenges in the Management of Large Corpora (CMLC-7), P. Bański, A. Barbaresi,
H. Biber, E. Breiteneder, S. Clematide, M. Kupietz, H. Lüngen, and C. Iliadi, Eds.
Cardiff, United Kingdom: Leibniz-Institut für Deutsche Sprache, Jul. 2019. [Online].
Available: https://hal.inria.fr/hal-02148693

[78] J. Pomikálek, “onion,” 2011, LINDAT/CLARIAH-CZ digital library at the Institute of


Formal and Applied Linguistics (ÚFAL), Faculty of Mathematics and Physics, Charles
University. [Online]. Available: http://hdl.handle.net/11858/00-097C-0000-000D-F67B-7

[79] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors


with subword information,” CoRR, vol. abs/1607.04606, 2016. [Online]. Available:
http://arxiv.org/abs/1607.04606

[80] I. Beltagy, K. Lo, and A. Cohan, “Scibert: Pretrained language model for scientific text,”
in EMNLP, 2019.

[81] J.-S. Lee and J. Hsiang, “PatentBERT: Patent classification with fine-tuning a pre-trained
BERT model,” World Patent Information, vol. 61, no. 101965, 2020.

[82] L. Martin, B. Müller, P. J. O. Suárez, Y. Dupont, L. Romary, É. V. de la Clergerie,


D. Seddah, and B. Sagot, “Camembert: a tasty french language model,” CoRR, vol.
abs/1911.03894, 2019. [Online]. Available: http://arxiv.org/abs/1911.03894

95
[83] B. Magnini, A. Cappelli, E. Pianta, M. Speranza, V. Bartalesi Lenzi, R. Sprugnoli, L. Ro-
mano, C. Girardi, and M. Negri, “Annotazione di contenuti concettuali in un corpus ital-
iano: I - cab,” in Proc.of SILFI 2006, 2006.

[84] S. Pyysalo, J. Kanerva, A. Virtanen, and F. Ginter, “Wikibert models: deep transfer
learning for many languages,” 2020.

[85] F.-L. Julié and E. Berti, “d+1 formalism in einstein-scalar-gauss-bonnet gravity,”


Physical Review D, vol. 101, no. 12, Jun 2020. [Online]. Available: http:
//dx.doi.org/10.1103/PhysRevD.101.124045

[86] S. Lee, H. Jang, Y. Baik, S. Park, and H. Shin, “Kr-bert: A small-scale korean-specific
language model,” 2020.

[87] H. Tanvir, C. Kittask, and K. Sirts, “Estbert: A pretrained language-specific bert for
estonian,” 2020.

[88] W. de Vries, A. van Cranenburgh, A. Bisazza, T. Caselli, G. van Noord, and M. Nissim,
“Bertje: A dutch BERT model,” CoRR, vol. abs/1912.09582, 2019. [Online]. Available:
http://arxiv.org/abs/1912.09582

[89] J. Cañete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Pérez, “Spanish pre-trained
bert model and evaluation data,” in PML4DC at ICLR 2020, 2020.

[90] J. Cañete, “Compilation of large spanish unannotated corpora,” May 2019. [Online].
Available: https://doi.org/10.5281/zenodo.3247731

[91] M. K. Eddine, A. J. P. Tixier, and M. Vazirgiannis, “Barthez: a skilled pretrained french


sequence-to-sequence model,” 2020.

[92] D. Nozza, F. Bianchi, and D. Hovy, “What the [mask]? making sense of language-specific
bert models,” 2020.

[93] R. Speer, “ftfy,” Zenodo, 2019, version 5.5. [Online]. Available: https://doi.org/10.5281/
zenodo.2591652

[94] M. Esplà, M. Forcada, G. Ramı́rez-Sánchez, and H. Hoang, “ParaCrawl: Web-scale


parallel corpora for the languages of the EU,” in Proceedings of Machine Translation
Summit XVII Volume 2: Translator, Project and User Tracks. Dublin, Ireland: European
Association for Machine Translation, Aug. 2019, pp. 118–119. [Online]. Available:
https://www.aclweb.org/anthology/W19-6721

[95] P. Koehn, H. Khayrallah, K. Heafield, and M. Forcada, “Findings of the wmt 2018 shared
task on parallel corpus filtering,” 01 2018, pp. 726–739.

[96] M. Baroni, S. Bernardini, A. Ferraresi, and E. Zanchetta, “The wacky wide web:
A collection of very large linguistically processed web-crawled corpora,” Language
Resources and Evaluation, vol. 43, no. 3, pp. 209–226, 2009. [Online]. Available:
http://www.jstor.org/stable/27743614

[97] R. Schäfer and F. Bildhauer, “Building large corpora from the web using a new efficient
tool chain,” in LREC, 2012.

96
[98] J. Kudela, I. Holubová, and O. Bojar, “Extracting parallel paragraphs from
common crawl,” CoRR, vol. abs/1804.10413, 2018. [Online]. Available: http:
//arxiv.org/abs/1804.10413

[99] P. J. Ortiz Suárez, L. Romary, and B. Sagot, “A monolingual approach to


contextualized word embeddings for mid-resource languages,” in Proceedings of the
58th Annual Meeting of the Association for Computational Linguistics. Online:
Association for Computational Linguistics, Jul. 2020, pp. 1703–1714. [Online]. Available:
https://www.aclweb.org/anthology/2020.acl-main.156

[100] P. Ortiz Suárez, B. Sagot, and L. Romary, “Asynchronous pipelines for processing huge
corpora on medium to low resource infrastructures,” 07 2019.

[101] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for efficient text clas-
sification,” arXiv preprint arXiv:1607.01759, 2016.

[102] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov, “Fasttext.zip:


Compressing text classification models,” arXiv preprint arXiv:1612.03651, 2016.

[103] F. Soares, M. Villegas, A. Gonzalez-Agirre, M. Krallinger, and J. Armengol-Estapé,


“Medical word embeddings for Spanish: Development and evaluation,” in Proceedings of
the 2nd Clinical Natural Language Processing Workshop. Minneapolis, Minnesota, USA:
Association for Computational Linguistics, Jun. 2019, pp. 124–133. [Online]. Available:
https://www.aclweb.org/anthology/W19-1916

[104] N. Ljubešić and A. Toral, “caWaC – a web corpus of Catalan and its application to
language modeling and machine translation,” in Proceedings of the Ninth International
Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland:
European Language Resources Association (ELRA), May 2014, pp. 1728–1732. [Online].
Available: http://www.lrec-conf.org/proceedings/lrec2014/pdf/841 Paper.pdf

[105] G. Boleda, S. Bott, C. Castillo, R. Meza, T. Badı́a, and V. Lopez-Grimau, “Cucweb: a


catalan corpus built from the web,” in Conference of the European Chapter of the Associ-
ation for Computational Linguistics.

[106] I. Leturia, “The web as a corpus of basque,” Ph.D. dissertation, 06 2014.

[107] M. Artetxe, G. Labaka, and E. Agirre, “An effective approach to unsupervised


machine translation,” in Proceedings of the 57th Annual Meeting of the Association for
Computational Linguistics. Florence, Italy: Association for Computational Linguistics,
July 2019, pp. 194–203. [Online]. Available: https://www.aclweb.org/anthology/P19-1019

[108] X. Carreras, I. Chao, L. Padró, and M. Padró, “Freeling: An open-source suite of language
analyzers,” in Proceedings of the 4th International Conference on Language Resources and
Evaluation (LREC’04), 2004.

[109] J. Atserias, B. Casas, E. Comelles, M. González, L. Padró, and M. Padró, “Freeling 1.3:
Syntactic and semantic services in an open-source nlp library,” in Proceedings of the fifth
international conference on Language Resources and Evaluation (LREC 2006). Genoa,
Italy: ELRA, May 2006.

97
[110] L. Padró, M. Collado, S. Reese, M. Lloberes, and I. Castellón, “Freeling 2.1: Five years
of open-source language processing tools,” in Proceedings of 7th Language Resources and
Evaluation Conference (LREC’10), La Valletta, Malta, May 2010.

[111] L. Padró, “Analizadores multilingües en freeling,” Linguamatica, vol. 3, no. 2, pp. 13–20,
December 2011.

[112] L. Padró and E. Stanilovsky, “Freeling 3.0: Towards wider multilinguality,” in Proceedings
of the Language Resources and Evaluation Conference (LREC 2012). Istanbul, Turkey:
ELRA, May 2012.

[113] J. Pomikálek, “Removing boilerplate and duplicate content from web corpora,” Ph.D.
dissertation, Masaryk university, Faculty of informatics, Brno, Czech Republic, 2011.

[114] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,”
in Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition. CVPR 2001, vol. 1, 2001, pp. I–I.

[115] D. Kosmajac and V. Keselj, “Slavic language identification using cascade classifier ap-
proach,” in 2018 17th International Symposium INFOTEH-JAHORINA (INFOTEH),
2018, pp. 1–6.

[116] P. Moritz, R. Nishihara, S. Wang, A. Tumanov, R. Liaw, E. Liang, M. Elibol, Z. Yang,


W. Paul, M. I. Jordan, and I. Stoica, “Ray: A distributed framework for emerging ai
applications,” in Proceedings of the 13th USENIX Conference on Operating Systems Design
and Implementation, ser. OSDI’18. USA: USENIX Association, 2018, p. 561–577.

[117] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli,
“fairseq: A fast, extensible toolkit for sequence modeling,” in Proceedings of NAACL-HLT
2019: Demonstrations, 2019.

[118] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen,


Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito,
M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala,
“Pytorch: An imperative style, high-performance deep learning library,” in Advances in
Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer,
F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019,
pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-
imperative-style-high-performance-deep-learning-library.pdf

[119] W. Falcon, “Pytorch lightning,” GitHub. Note: https://github.com/PyTorchLightning/pytorch-


lightning, vol. 3, 2019.

[120] S. Edunov, M. Ott, , and S. Gross, “Fairseq: Facebook ai research sequence-to-sequence


toolkit written in python.” [Online]. Available: https://github.com/pytorch/fairseq

[121] M. Artetxe, G. Labaka, and E. Agirre, “Unsupervised statistical machine translation,”


CoRR, vol. abs/1809.01272, 2018. [Online]. Available: http://arxiv.org/abs/1809.01272

[122] ——, “Bilingual lexicon induction through unsupervised machine translation,” in


Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

98
Florence, Italy: Association for Computational Linguistics, July 2019, pp. 5002–5007.
[Online]. Available: https://www.aclweb.org/anthology/P19-1494

[123] M. Popel and O. Bojar, “Training tips for the transformer model,” CoRR, vol.
abs/1804.00247, 2018. [Online]. Available: http://arxiv.org/abs/1804.00247

[124] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:


A simple way to prevent neural networks from overfitting,” Journal of Machine Learning
Research, vol. 15, pp. 1929–1958, 06 2014.

[125] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Con-
ference on Learning Representations, 12 2014.

[126] M. Artetxe, S. Ruder, and D. Yogatama, “On the cross-lingual transferability of monolin-
gual representations,” CoRR, vol. abs/1910.11856, 2019.

[127] M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, “Cleaneval: a competition for


cleaning web pages,” in Proceedings of the Sixth International Conference on Language
Resources and Evaluation (LREC’08). Marrakech, Morocco: European Language
Resources Association (ELRA), May 2008. [Online]. Available: http://www.lrec-
conf.org/proceedings/lrec2008/pdf/162 paper.pdf

[128] S. Wang, B. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear
complexity,” arXiv preprint arXiv:2006.04768, 2020.

99
A Source code and logs

The code will be open-source at TeMU’s Github137 . In the meanwhile, it is provided as a non-
distributable ZIP file attached to the delivery of this thesis. The code is organized as follows:

• corpus-cleaner/: Cleaning pipeline.

• corpus-utils-lm/: Model-ready pipeline (tokenization, etc).

• training/: RoBERTa training scripts and utilities (Fairseq).

• evaluation/: Utilities for loading the model in Huggingface and fine-tuning on the down-
stream tasks. The evaluation data is not provided in the delivery, but it will be published
soon (if required, it can be sent on request).

The arguments used in these repositories are mostly described either in the thesis itself, or in
the remaining sections of this appendix.
The text logs of the training of RoBERTca are also provided, in the training/ directory.

B Cleaning pipeline parameters

usage: clean.py [-h] [--input-path INPUT_PATH] [--output-path OUTPUT_PATH]


[--input-format INPUT_FORMAT] [--output-format OUTPUT_FORMAT]
[--checkpoint-backend {shelve,file}]
[--components COMPONENTS [COMPONENTS ...]] [--parallel]
[--log-every-iter LOG_EVERY_ITER] [--backend BACKEND]
[--only-reduce] [--only-reduce-output] [--debug]
[--extensions EXTENSIONS [EXTENSIONS ...]]
[--encoding ENCODING]
[--encoding-threshold ENCODING_THRESHOLD]
[--encoding-error-policy ENCODING_ERROR_POLICY]
[--url-doc URL_DOC] [--warc-warn] [--no-lang-filter-document]
[--no-language-normalization] [--no-replace-emails]
[--no-remove-hashtags-mentions] [--no-remove-tags]
[--no-space-normalization] [--no-replace-urls]
[--char-length-filter-document CHAR_LENGTH_FILTER_DOCUMENT]
[--no-head-filter] [--digits_filter DIGITS_FILTER]
[--remove-citations] [--lang-chars-filter LANG_CHARS_FILTER]
[--alphanum-filter ALPHANUM_FILTER]
[--uppercase-filter UPPERCASE_FILTER]
[--alphabet-filter ALPHABET_FILTER [ALPHABET_FILTER ...]]
[--lang-filter LANG_FILTER [LANG_FILTER ...]]
[--initial-lang-filter-threshold INITIAL_LANG_FILTER_THRESHOLD]
[--dictionary-filter-doc DICTIONARY_FILTER_DOC]
137
https://github.com/temu-bsc

100
[--seg-sentences]
[--char-length-filter-sentence CHAR_LENGTH_FILTER_SENTENCE]
[--word-length-filter-sentence WORD_LENGTH_FILTER_SENTENCE]
[--digits-filter-sentence DIGITS_FILTER_SENTENCE]
[--profanity-check]
[--fast-lang-filter-threshold FAST_LANG_FILTER_THRESHOLD]
[--slow-lang-filter-threshold SLOW_LANG_FILTER_THRESHOLD]
[--no-lang-filter-sentence]
[--no-lang-filter-sentence_src_tgt]
[--code-threshold CODE_THRESHOLD]
[--dictionary-filter-sen DICTIONARY_FILTER_SEN]
[--no-dedup-same-doc-sentences] [--no-src-tag-filter]
[--spell-check] [--terminology-norm TERMINOLOGY_NORM]
[--punctuation-norm]
[--document-deduplication-threshold DOCUMENT_DEDUPLICATION_THRESHOLD]
[--remove-glob-rep-sen REMOVE_GLOB_REP_SEN]
[--dedup-buffer DEDUP_BUFFER]
name

positional arguments:
name A name to identify the run

optional arguments:
-h, --help show this help message and exit
--input-path INPUT_PATH
Input data directory
--output-path OUTPUT_PATH
Output data directory
--input-format INPUT_FORMAT
Input data format
--output-format OUTPUT_FORMAT
Output data format
--checkpoint-backend {shelve,file}
Shelve is more convenient but file is more robust. For
distributed executions,we recommend file.
--components COMPONENTS [COMPONENTS ...]
Elements of the pipeline
--parallel Run the cleaner in parallel
--log-every-iter LOG_EVERY_ITER
Log the pipeline every N iterations(-1, silent)
--backend BACKEND Parallel backend (mp or ray)
--only-reduce Only document filter
--only-reduce-output Only document filter for output files
--debug Activate the debug error mode to compare the original
and cleaned sentences
--extensions EXTENSIONS [EXTENSIONS ...]

101
File extensions to work with (eg. json)
--encoding ENCODING Input encoding format (eg. utf-8. If set to auto, the
programtries to guess the encoding
--encoding-threshold ENCODING_THRESHOLD
Encoding threshold if --encoding auto
(ignoredotherwise. If the encoding detector is not
above this threshold, it assigns utf-8.
--encoding-error-policy ENCODING_ERROR_POLICY
Encoding error policy (same options as open()
--url-doc URL_DOC Path to a url list (plain text, one url per line)that
should be filtered and processed
--warc-warn Enable warnings of WARC parser
--no-lang-filter-document
Avoid applying language filter on documents
--no-language-normalization
Avoid applying language-specific normalization
--no-replace-emails Avoid replacing email adresses with "[EMAIL]"
--no-remove-hashtags-mentions
Remove hashtags and mentions.
--no-remove-tags Avoid removing XML/HTML tags
--no-space-normalization
Avoid normalizing white spaces
--no-replace-urls Avoid replacing URLs with "[URL]"
--char-length-filter-document CHAR_LENGTH_FILTER_DOCUMENT
Minimum char length per document. Set to 0 not to
apply any filter.
--no-head-filter Avoid filtering documents coming froma crawler (having
a "heads" attribute) withcommon HTTP errors.
--digits_filter DIGITS_FILTER
Maximum allowed proportion of digit characters
--remove-citations If used, remove citations in the common square
brackets format, e.g [34]
--lang-chars-filter LANG_CHARS_FILTER
Maximum allowed proportion of characters notbelonging
to the alphabet of the language
--alphanum-filter ALPHANUM_FILTER
Maximum allowed proportion of non-
alphanumericcharacters
--uppercase-filter UPPERCASE_FILTER
Maximum allowed proportion of uppercase characters
--alphabet-filter ALPHABET_FILTER [ALPHABET_FILTER ...]
Alphabets that should be present (eg. LATIN)
--lang-filter LANG_FILTER [LANG_FILTER ...]
List of languages that should allowed when filtering
bylang. If not set, no filtering is applied.
--initial-lang-filter-threshold INITIAL_LANG_FILTER_THRESHOLD
If --lang-filter is set, minimumthreshold for the

102
initial langidentifier
--dictionary-filter-doc DICTIONARY_FILTER_DOC
Path to dictionary (plain text, one term perline of
terms that should not appear in adocument
--seg-sentences Segment wrongfully concatenated sentences.
--char-length-filter-sentence CHAR_LENGTH_FILTER_SENTENCE
filter sentences shorter than a given minimum
character length
--word-length-filter-sentence WORD_LENGTH_FILTER_SENTENCE
filter sentences shorter than a given minimum word
length
--digits-filter-sentence DIGITS_FILTER_SENTENCE
Maximum allowed proportion of digit characters in the
sentence
--profanity-check filter sentences with sensible content
--fast-lang-filter-threshold FAST_LANG_FILTER_THRESHOLD
If --lang-filter is set, minimumthreshold for the
faster lang identifier
--slow-lang-filter-threshold SLOW_LANG_FILTER_THRESHOLD
If --lang-filter is set, minimumthreshold for the
slower lang identifier
--no-lang-filter-sentence
Avoid applying language filter on sentences
--no-lang-filter-sentence_src_tgt
Avoid applying language filter on sentences with
"src=" pattern
--code-threshold CODE_THRESHOLD
Threshold (percentage) of code-like chars and tokensto
filter a sentence (-1 to deactivate)
--dictionary-filter-sen DICTIONARY_FILTER_SEN
Path to dictionary (plain text, one term perline of
terms that should not appear in asentence
--no-dedup-same-doc-sentences
Do not deduplicate sentences in the same document.
--no-src-tag-filter Do not remvoe sentences with the pattern "src=".
--spell-check Apply spell checking.
--terminology-norm TERMINOLOGY_NORM
Path to a terminology dictionary to
appliynormalization
--punctuation-norm Apply punctuation normalization.
--document-deduplication-threshold DOCUMENT_DEDUPLICATION_THRESHOLD
Threshold for document de-duplication, expressed as
the percentage of sentencesoverlap between documents
--remove-glob-rep-sen REMOVE_GLOB_REP_SEN
Whether to remove corpus-level repeated sentences
(threshold of repetitions; -1to deactivate)
--dedup-buffer DEDUP_BUFFER

103
Deduplication buffer size, in bytes (default:
100000000)

C Model-ready pipeline parameters

usage: main.py [-h] [--gzip] [--decontaminate-fit-all-memory]


--excluded-sentences EXCLUDED_SENTENCES
[--docs-valid-test DOCS_VALID_TEST]
[--min-len-valid-test MIN_LEN_VALID_TEST]
[--max-len-valid-test MAX_LEN_VALID_TEST] [--lines]
--vocab-name VOCAB_NAME [--vocab-size VOCAB_SIZE] [--seed SEED]
[--min-frequency MIN_FREQUENCY]
[--reserve-tokens RESERVE_TOKENS]
[--limit-alphabet LIMIT_ALPHABET] [--no-clean-text]
[--no-handle-chinese-chars] [--strip-accents] [--lowercase]
[--no-fairseq] [--next-sentence-prediction]
[--no-show-progress] [--tokenizer {wordpiece,bbpe}]
[--extra-tokens EXTRA_TOKENS [EXTRA_TOKENS ...]]
--fairseq-workers FAIRSEQ_WORKERS
[--test-total-updates TEST_TOTAL_UPDATES]
[--test-max-sentences TEST_MAX_SENTENCES]
corpus_path output_path_name language

D Data samples (before and after data cleaning)

• Catalan: Available on Google Drive (”debug” tab).

• BNE (general-domain Spanish): Available on Google Drive.

• Biomedical Spanish: Available on Google Drive.

E Model usage

RoBERTca can be used as follows (in the example, the case of mask filling, but the code for
other tasks is very similar):

from transformers import FillMaskPipeline


from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForMaskedLM
model = AutoModelForMaskedLM.from_pretrained('pretrained_directory')
pipeline_cat = FillMaskPipeline(model, tokenizer_cat, top_k=1)

104
Instead of the pretrained directory of the model (pretrained directory), the user will be able
to input the name of the model itself, when we upload it to Huggingface’s hub. This will be done
soon. In the meanwhile, the weights of the model can be provided on request.

105

You might also like