0% found this document useful (0 votes)

20 views17 pages

2020 Acl-Main 692

Uploaded by

Diaa Uliyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views17 pages

2020 Acl-Main 692

Uploaded by

Diaa Uliyan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Unsupervised Domain Clusters in Pretrained Language Models

Roee Aharoni1 & Yoav Goldberg1,2

1
Computer Science Department, Bar Ilan University
2
Allen Institute for Artificial Intelligence
[email protected]

bert-base-uncased

Abstract
The notion of “in-domain data” in NLP is of-
ten over-simplistic and vague, as textual data
varies in many nuanced linguistic aspects such
as topic, style or level of formality. In addi-
tion, domain labels are many times unavail-
able, making it challenging to build domain-
specific systems. We show that massive pre-
trained language models implicitly learn sen-
tence representations that cluster by domains
without supervision – suggesting a simple data- it
driven definition of domains in textual data. koran
We harness this property and propose domain subtitles
medical
data selection methods based on such models, law
which require only a small set of in-domain
monolingual data. We evaluate our data se-
Figure 1: A 2D visualization of average-pooled BERT
lection methods for neural machine translation
hidden-state sentence representations using PCA. The
across five diverse domains, where they outper-
colors represent the domain for each sentence.
form an established approach as measured by
both BLEU and by precision and recall of sen-
train state-of-the-art pretrained language models
tence selection with respect to an oracle.
for various tasks (Raffel et al., 2019).
1 Introduction Domain data selection is the task of selecting the
most appropriate data for a domain from a large cor-
It is common knowledge in modern NLP that us-
pus given a smaller set of in-domain data (Moore
ing large amounts of high-quality training data is a
and Lewis, 2010; Axelrod et al., 2011; Duh et al.,
key aspect in building successful machine-learning
2013; Silva et al., 2018). In this work, we propose
based systems. For this reason, a major challenge
to use the recent, highly successful self-supervised
when building such systems is obtaining data in
pre-trained language models, e.g. Devlin et al.
the domain of interest. But what defines a do-
(2019); Liu et al. (2019) for domain data selec-
main? Natural language varies greatly across top-
tion. As pretrained LMs demonstrate state-of-the-
ics, styles, levels of formality, genres and many
art performance across many NLP tasks after being
other linguistic nuances (van der Wees et al., 2015;
trained on massive amounts of data, we hypothe-
van der Wees, 2017; Niu et al., 2017). This over-
size that the robust representations they learn can
whelming diversity of language makes it hard to
be useful for mapping sentences to domains in an
find the right data for the task, as it is nearly im-
unsupervised, data-driven approach. We show that
possible to well-define the exact requirements from
these models indeed learn to cluster sentence repre-
such data with respect to all the aforementioned
sentations to domains without further supervision
aspects. On top of that, domain labels are usually
(e.g. Figure 1), and quantify this phenomenon by
unavailable – e.g. in large-scale web-crawled data
fitting Gaussian Mixture Models (GMMs) to the
like Common Crawl1 which was recently used to
learned representations and measuring the purity of
1
https://commoncrawl.org/ the resulting unsupervised clustering. We then pro-

7747
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7747–7763
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
pose methods to leverage these emergent domain 2 Emerging Domain Clusters in
clusters for domain data selection in two ways: Pretrained Language Models
2.1 Motivation
• Via distance-based retrieval in the sentence The proliferation of massive pretrained neural lan-
embedding space induced by the pretrained guage models such as ELMo (Peters et al., 2018),
language model. BERT (Devlin et al., 2019) or RoBERTa (Liu et al.,
2019) has enabled great progress on many NLP
• By fine-tuning the pretrained language model benchmarks (Wang et al., 2018, 2019a). Larger
for binary classification, where positive exam- and larger models trained on billions of tokens of
ples are from the domain of interest. raw text are released in an ever-increasing pace
(Raffel et al., 2019), enabling the NLP community
to fine-tune them for the task of interest. While
Our methods enable to select relevant data for many works tried to “probe” those models for the
the task while requiring only a small set of mono- morphological, syntactic and semantic information
lingual in-domain data. As they are based solely they capture (Tenney et al., 2019; Goldberg, 2019;
on the representations learned by self-supervised Clark et al., 2019), an important aspect of language
LMs, they do not require additional domain la- remained overlooked in this context – the domain
bels which are usually vague and over-simplify the data comes from, often referred to as the “data
the notion of domain in textual data. We evaluate distribution”.
our method on data selection for neural machine The definition of domain is many times vague
translation (NMT) using the multi-domain German- and over-simplistic (e.g. “medical text” may be
English parallel corpus composed by Koehn and used for biomedical research papers and for clin-
Knowles (2017). Our data selection methods en- ical conversations between doctors and patients,
able to train NMT models that outperform those although the two vary greatly in topic, formality
trained using the well-established cross-entropy dif- etc.). A common definition treats a domain as a
ference method of Moore and Lewis (2010) across data source: “a domain is defined by a corpus from
five diverse domains, achieving a recall of more a specific source, and may differ from other do-
than 95% in all cases with respect to an oracle that mains in topic, genre, style, level of formality, etc.”
selects the “true” in-domain data. (Koehn and Knowles, 2017). We claim that a more
Our contributions in this work are as follows. data-driven definition should take place, as differ-
First, we show that pre-trained language models ent data sources may have sentences with similar
are highly capable of clustering textual data to do- traits and vice versa - a single massive web-crawled
mains with high accuracy in a purely unsupervised corpus contains texts in numerous styles, topics and
manner. Second, we propose methods to select registers. Our analysis in Section 2 shows examples
in-domain data based on this property using vector- for such cases, e.g. a sentence discussing “Viruses
space retrieval and positive-unlabeled fine-tuning and virus-like organisms” in a legal corpus.
of pretrained language models for binary classifica- We hypothesize that massive pretrained LMs
tion. Third, we show the applicability of our pro- can learn representations that cluster to domains,
posed data selection methods on a popular bench- as texts from similar domains will appear in similar
mark for domain adaptation in machine translation. contexts. We test this hypothesis across several
An additional contribution is a new, improved data large, publicly-available pretrained LMs; we ex-
split we create for this benchmark, as we point on plore both masked-language-models (MLMs) and
issues with previous splits used in the literature. auto-regressive LMs.
The code and data for this work is publicly avail-
able.2 We hope this work will encourage more re- 2.2 Method
search on understanding the data landscape in NLP, We encode multi-domain data at the sentence level
enabling to “find the right data for the task” in the into vector representations. We then cluster these
age of massive models and diverse data sources. vector representations for each model using a Gaus-
sian Mixture Model (GMM) with k pre-defined
2
https://github.com/roeeaharoni/ clusters. We chose GMM as our clustering ap-
unsupervised-domain-clusters proach as it allows soft assignments (vs. hard as-

7748
k=5 k=10 k=15
Random 15.08 (±0.0) 16.77 (±0.0) 17.78 (±0.0)
LDA 24.31 (±0.99) 26.73 (±2.19) 30.79 (±2.97)

with PCA (n=50) without PCA

k=5 k=10 k=15 k=5 k=10 k=15
word2vec 53.65 (±0.79) 68.14 (±2.58) 73.44 (±0.68) 45.93 65.80 76.26
BERT-base 87.66 (±0.24) 88.02 (±1.10) 88.37 (±0.66) 85.74 85.08 86.37
BERT-large 85.64 (±6.13) 87.61 (±0.26) 89.07 (±0.53) 68.56 86.53 86.99
DistillBERT 83.68 (±7.14) 86.31 (±0.86) 87.53 (±0.85) 79.00 86.42 88.14
RoBERTa-base 79.05 (±0.10) 86.39 (±0.90) 86.51 (±0.28) 70.21 80.35 81.49
RoBERTa-large 80.61 (±0.33) 89.04 (±0.15) 89.94 (±0.23) 69.88 81.07 85.91
GPT-2 70.30 (±0.05) 84.76 (±0.30) 82.56 (±1.29) 37.82 39.02 41.45
XLNet 55.72 (±0.69) 68.17 (±3.93) 72.65 (±1.92) 30.36 32.96 48.55
Table 1: Unsupervised domain clustering as measured by purity for the different models. Best results are marked
in bold for each setting.

signments as in e.g. K-means) which we think fits report results for a baseline which assigns sentences
the task better (as a sentence can be seen as drawn by sampling randomly from a uniform distribution
from a mixture of several domain).3 In all cases, over the clusters.
to create a sentence representation we perform av-
erage pooling of the last hidden state (before the 2.4 Evaluation
softmax layer) for each token in the sentence.4 To
accelerate the clustering process and enable visual- To evaluate the unsupervised domain clustering we
ization we also experiment with performing dimen- used the multi-domain corpus proposed by Koehn
sionality reduction with PCA over the sentence vec- and Knowles (2017) which includes textual data in
tors before clustering them. We experiment with k five diverse domains: subtitles6 , medical text (PDF
in 5, 10 and 15 to test how adding flexibility would documents from the European Medicines Agency),
improve the domain clustering accuracy. legal text (legislative text of the European Union),
translations of the Koran, and IT-related text (man-
2.3 Models and Baselines uals and localization files of open-source software).
This dataset includes parallel sentences in English
For MLM-based models we use BERT (Devlin
and German; for this experiment we used the En-
et al., 2019), DistilBERT (Sanh et al., 2019) and
glish portion of the data. See more details on the
RoBERTa (Liu et al., 2019) (in both the base and
dataset in Section 3.1. We used 2000 distinct sen-
large versions). For autoregressive models we use
tences from each domain. To evaluate whether the
GPT-2 (Radford et al., 2018) and XLNet (Yang
resulting clusters indeed capture the domains the
et al., 2019). In all cases we use the implementa-
data was drawn from we measure the clustering
tions from the HuggingFace Transformers toolkit
purity, which is a well-known metric for evaluat-
(Wolf et al., 2019). We also evaluated three addi-
ing clustering (Manning et al., 2008). To measure
tional, simpler baselines. The first is using repre-
the clustering purity, we assign each unsupervised
sentations from word2vec (Mikolov et al., 2013),
cluster with the most common “true” domain in the
where we average-pooled the word vectors for the
sentences assigned to that cluster, and then com-
tokens that were present in the model vocabulary.
pute the accuracy according to this majority-based
The second is using Latent Dirichlet Allocation
cluster-domain assignment (note that in this case
(LDA, Blei et al., 2003), which is a classic ap-
several unsupervised clusters can be assigned to
proach to unsupervised clustering of text.5 We also
the same domain). In cases where randomness is
3
See further discussion comparing GMMs and K-means involved we run each experiment five times with
in Daume (2009). different initializations and report the mean and
4
Using the penultimate layer or others may result in better variance of the purity metric for each model.
performance; we leave this for future work.
5
We used the LDA implementation provided in the Gensim
6
toolkit: https://radimrehurek.com/gensim/ From http://www.opensubtitles.org/

7749
both cases – suggesting that these models encode
it 1927 0 55 16 2 the information very differently.

koran 4 1767 225 0 4 2.6 Analysis

True label

As can be seen in Figure 3, in some areas the do-

subtitles 47 21 1918 9 5 mains are somewhat overlapping in the embedding
space, which may lead to outlier cases where ex-
medical 340 0 82 1413 165 amples from one domain are assigned to a cluster
of a another domain. We plot a confusion matrix
(Figure 2) to analyze this further based on the clus-
law 206 0 10 58 1726
tering with BERT-base and k=5. We first note that
the outlier sentences are much shorter than the av-
it

ran

law
s e

erage sentence length in the corpus (11.62 tokens

dic
titl
ko

me
b
su

on average for outliers vs. 20.5 tokens on average

Predicted label
in general). This makes sense as shorter sentences
Figure 2: A confusion matrix for clustering with k=5 contain less information, making it harder to assign
using BERT-base. them to an appropriate cluster. Table 2 shows ex-
amples of outlier sentences, assigned to clusters of
2.5 Results and Discussion domains different from their originating domain.
As can be seen in Table 1, pre-trained language We can see that in many cases the assignments are
models are indeed highly capable of generating sensible – for example for sentences originating
sentence representations that cluster by domains, from the subtitles corpus, a sentence that mentions
resulting in up to 87.66%, 89.04% and 89.94% ac- “great priest” is assigned to the Koran cluster, a
curacy when using k=5, k=10 and k=15 clusters, sentence that mentions “The International Criminal
respectively, across 10,000 sentences in 5 domains. Court in The Hague” is assigned to the Law cluster,
We find these scores remarkably high given our a sentence that mentions “the virus” is assigned to
straight-forward average-pooling strategy and that the Medical cluster and so on. This strengthens our
no domain-supervision was involved in the process claim that defining domains based on the corpus
of learning the pre-trained representations. Figure they originated from is over-simplistic, and using
3 also demonstrates the quality of the obtained clus- a data-driven approach may enable to find better
ters in 2D using the BERT-base model, where the domain assignments across different corpora.
ellipses describe the mean and variance parameters The domain that attracted the largest number
learned for each cluster by the GMM with k = 5.7 of outliers is the IT domain cluster, with 597 sen-
We note that some classes of models did better tences assigned to bert-base-uncased
it from other domains. Looking
than others: while all vector-based models did far
better than the random and LDA baselines8 , the
MLM-based models dominated in all cases over
word2vec and the auto-regressive models. This
may be explained by the fact that the MLM-based
models use the entire sentence context when gen-
erating the representations for each token, while
the auto-regressive models only use the past con-
text, and word2vec uses a limited window context.
Using PCA improved performance in most cases
and especially for the auto-regressive models, al-
it
though the results for the MLMs remain high in koran
subtitles
7
Similar visualizations for additional models are available medical
in the supplementary material. law
8
Note that the LDA models were trained using the multi-
domain data alone, and did not utilize additional pretraining Figure 3: A 2D visualization of the unsupervised GMM
as in the other, more successful models. This may explain
their relatively weak performance. clustering for the same sentences as in Figure 1.

7750
Subtitles assigned to Koran Subtitles assigned to Medical
I am Spa’am, high priest of the boars. Oxygen supply at 50%.
Joseph, go in peace, and the Lord be with you. Or it can help her walk again if the virus is kept in check
with this.
Subtitles assigned to IT Subtitles assigned to Law
Push it up to the front of the screen. Statutes, transcripts, redacted immunity agreements.
Polyalloy requires programming to take permanent The Security Council therefore must press for his immediate
form. referral to the International Criminal Court in The Hague.
Law assigned to Medical Law assigned to IT
- Viruses and virus-like organisms ”INFORMATION SOCIETY STATISTICS
where the glucose content is equal to or less than This document must be attached to the certificate and field
the fructose content. with it, except where there is a computerised checking system.
Medical assigned to Law Medical assigned to IT
This will be introduced by a Regulation adopted by the An updated and improved version of the CD-ROM was issued
European Commission. to all subscribers during the first half of the year.
The marketing authorisation was renewed on 22 May - All tables will be based on generic and not product-specific
2002 and 22 May 2007. data.
IT assigned to Medical IT assigned to Subtitles
R65: Harmful: may cause lung damage if swallowed At the end we say good bye.
Automatic Red-Eye Removal What would you like to do for your next shot?

Table 2: Sentences from one domain which were assigned to another domain by the BERT-based clustering, k=5.

more closely we find that more than half of these stream task – domain data selection for machine
sentences (340 out of 597) included numbers (e.g. translation. Domain data selection is the task of
“34% 25% 34%” (from medical), “(b) reference selecting examples from a large corpus which are
number 20 is deleted;” (from law), “(Command of as close as possible to the domain of interest, given
Prostration # 1)” (from Koran) or “The message, a smaller set of in-domain examples. The selected
R2.” (from subtitles)). As numbers appear in many examples can be used to either (1) train a domain-
different contexts, they may be harder to assign to specific model from scratch (Axelrod et al., 2011),
a specific domain by the context-aware language (2) fine-tune a pre-trained general-domain model
models in such short sentences. The second largest (Sajjad et al., 2017; Silva et al., 2018), or (3) prior-
attractor of outliers is the Subtitles cluster, with itize data for annotation as in an Active-Learning
372 sentences assigned to it from other domains. framework, if only monolingual data is available
We find that most of these sentences contain per- (Haffari et al., 2009). To demonstrate the need for
sonal pronouns or question marks (228 out of 372, domain data selection and set the stage for our data
61.2%) while the ratio of such sentences in the en- selection experiments, we perform preliminary ex-
tire corpus is only 40%. Examples include “Why periments with NMT in a multi-domain scenario.
did you choose the name & amarok;?” (from IT),
or “What is Avonex?” (from Medical). This may 3.1 Multi-Domain Dataset
be expected as the subtitles corpus mainly includes To simulate a diverse multi-domain setting we use
transcriptions of spoken, conversational language, the dataset proposed in Koehn and Knowles (2017),
and “conversation tends to have more verbs, more as it was recently adopted for domain adaptation
personal pronouns, and more questions” (Conrad research in NMT (Hu et al., 2019; Müller et al.,
and Biber, 2005). Another possible reason for the 2019; Dou et al., 2019a,b). The dataset includes
subtitles domain to attract outliers is the fact that parallel text in German and English from five di-
this is the least-topical cluster: movies and TV verse domains (Medical, Law, Koran, IT, Subtitles;
series may discuss diverse topics, unlike medical, as discussed in Section 2), available via OPUS
religious, legal and technical texts that may have a (Tiedemann, 2012; Aulamo and Tiedemann, 2019).
more cohesive topic. In a preliminary analysis of the data we found
that in both the original train/dev/test split by
3 Neural Machine Translation in a Koehn and Knowles (2017) and in the more re-
Multi-Domain Scenario cent split by Müller et al. (2019) there was overlap
between the training data and the dev/test data.9
As we showed that pre-trained language models
Fixing these issues is important, as it may affect
are indeed very useful in clustering sentence repre-
the conclusions one draws from experiments with
sentations by domains in an unsupervised manner,
9
we now seek to harness this property for a down- More details are available in the supplementary material.

7751
Original New Split Medical Law Koran IT Subtitles
Medical 1,104,752 248,099 Medical 56.5 18.3 1.9 11.4 4.3
Law 715,372 467,309 Law 21.7 59 2.7 13.1 5.4
IT 378,477 222,927 Koran 0.1 0.2 15.9 0.2 0.5
Koran 533,128 17,982 IT 14.9 9.6 2.8 43 8.6
Subtitles 22,508,639 14,458,058 Subtitles 7.9 5.5 6.4 8.5 27.3
All 53.3 57.2 20.9 42.1 27.6
Table 3: Number of training examples for each domain Table 4: SacreBLEU (Post, 2018) scores of our base-
in the original split (Müller et al., 2019) and in our split. line systems on the test sets of the new data split. Each
row represents the results from one model on each test
this dataset. For example, as overlapping devel-
set. The best result in each column is marked in bold.
opment sets favor memorization of the training
set, one may choose checkpoints and report results ing data. More details on the exact training and
on over-fitting models. This is especially relevant hyperparameter settings for the NMT models are
with neural sequence-to-sequence models, as they available in the supplementary material.
are highly susceptible to memorization (Aharoni Results The results for the cross-domain evalua-
and Goldberg, 2018) and hallucination (Lee et al., tion are available in Table 4. In most cases, the best
2018), as confirmed by Müller et al. (2019). results for each domain are obtained by training on
To create a better experimental setting to test the in-domain data. Training on all the available
generalization within and across domains, we cre- data helped mostly for the Koran test set. This is
ate a new data split where we ensure that no such expected as the training data for this domain is con-
overlap between the training, development and test siderably smaller than the training data for rest of
sets occur. We started from the split of Müller the domains (Table 3). We can also see that more
et al. (2019) as it included newer versions of some data is not necessarily better (Gascó et al., 2012):
of the datasets.10 Furthermore, we did not allow while the subtitles corpus is the largest of all 5 and
more than one translation of a given source or tar- includes 500,000 sentence pairs, it is second to last
get sentence, as such cases were very frequent in in performance as measured by the average BLEU
the dataset and usually stand for duplicate sentence across all test sets.
pairs (See Table 3). For example, applying this Cross-Domain BLEU vs. Cluster Proximity
filtering reduced the size of the Koran corpus from An interesting observation can be made with re-
533,128 sentence pairs to only 17,982. Finally, spect to the visual analysis of the domain clusters
following Müller et al. (2019) we cap the subti- as depicted in Figure 3: as the Medical cluster
tles corpus to 500,000 sentence pairs as it is much (in Yellow), Law cluster (in Purple) and IT cluster
larger than the rest. We make the new split pub- (in Red) are close to each other in the embedding
licly available and hope it will enable better future space, their cross-domain BLEU scores are also
experimentation on this important subject.11 higher. For example, note how in the results for the
Medical domain-specific model (first row in Table
3.2 Cross-Domain Experiments 4), the BLEU scores on the Law and IT test sets are
Experimental Setup We follow Hu et al. (2019) much higher in comparison to those on the Koran
and train domain-specific models for all domains. and Subtitles test sets, which clusters are farther
We then evaluate each model across the different away in the visualized embedding space. Similarly,
domain test sets, enabling us to understand the ef- as the Subtitles cluster (Blue) is closer to the Koran
fect of different domains on the downstream MT cluster (Green), the highest cross-domain BLEU
performance and to set up strong baselines for data score on the Koran test set is from the Subtitles
selection experiments. We also train a general- model. To further quantify this phenomenon, we
domain model using the available data from all plot and measure Pearson’s correlation between the
domains, as it is also a common approach in multi- cosine similarity of the centroids for the English
domain scenarios (Müller et al., 2019). In all ex- BERT-based dev sentence representations for each
periments we use a similar Transformer (Vaswani domain pair, and the cross-domain BLEU score for
et al., 2017) model, and only control for the train- this domain pair. This is shown in Figure 4. We can
see the general trend where the closer the domain
10
Their dataset is available in: https://github.com/ centroids are (with a similarity of 1 for training
ZurichNLP/domain-robustness
11
https://github.com/roeeaharoni/ and evaluating on the same domain), the higher
unsupervised-domain-clusters the cross-domain BLEU is between those domains,

7752
we showed through unsupervised clustering, learn
representations with domain-relevant information.
In the following sections, we investigate whether
this property of pretrained language models makes
them useful for domain data selection.

4.1 Methods
We propose two methods for domain data selection
with pretrained language models.
Domain-Cosine In this method we first compute
a query vector, which is the element-wise average
over the vector representations of the sentences in
Figure 4: The cosine similarity between the centroids
the small in-domain set. We use the same sentence-
of the BERT representations for each domain pair vs.
the corresponding cross-domain BLEU. level average-pooling approach as described in Sec-
tion 2 to obtain sentence representations. We then
resulting in a Pearson’s correlation of 0.81 (strong retrieve the most relevant sentences in the train-
correlation). This suggests that such preliminary ing set by computing the cosine similarity of each
visual analysis can be a useful tool for understand- sentence with this query vector and ranking the
ing the relationship between diverse datasets, and sentences accordingly.
motivates the use of pre-trained language model Domain-Finetune It is now common knowl-
representations for domain data selection in MT. edge that pretrained language models are especially
useful when fine-tuned for the task of interest in
4 Domain Data Selection with Pretrained an end-to-end manner (Ruder et al., 2019). In this
Language Models method we fine-tune the pretrained LM for binary
classification, where we use the in-domain sen-
As shown in the previous section, using the right tences as positive examples, and randomly sam-
data is critical for achieving good performance on pled general-domain sentences as negative exam-
an in-domain test set, and more data is not neces- ples. We then apply this classifier on the general-
sarily better. However, in real-world scenarios, the domain data and pick the sentences that are classi-
availability of data labeled by domain is limited, fied as positive as in-domain, or choose the top-k
e.g. when working with large scale, web-crawled sentences as ranked by the classifier output distri-
data. In this section we focus on a data-selection bution. This can be seen as an instance of positive-
scenario where only a very small number of in- unlabeled learning for document-set expansion; see
domain sentences are used to select data from a Jacovi et al. (2019) for a recent discussion and
larger unlabeled parallel corpus. An established methodology for this task.
method for data selection was proposed by Moore Negative Sampling with Pre-ranking One
and Lewis (2010), which was also used in training problem that may rise when randomly sampling
the winning systems in WMT 2019 (Ng et al., 2019; negative examples is that unlabeled in-domain sen-
Barrault et al., 2019). This method compares the tences from the general-domain data may be sam-
cross-entropy, according to domain-specific and pled as negative examples – deteriorating the clas-
non-domain-specific language models, for each sifier performance. To alleviate this issue, we
candidate sentence for selection. The sentences perform a biased sampling of negative examples.
are then ranked by the cross-entropy difference, We first rank the general-domain data using the
and only the top sentences are selected for training.
without pre-ranking with pre-ranking
While the method by Moore and Lewis (2010) p r F1 p r F1
is tried-and-true, it is based on simple n-gram lan- Subtitles 0.722 0.984 0.833 0.964 0.978 0.971
Law 0.761 0.94 0.841 0.944 0.94 0.942
guage models which cannot generalize beyond the Medical 0.821 0.916 0.866 0.929 0.92 0.925
n-grams that are seen in the in-domain set. In ad- IT 0.848 0.956 0.898 0.955 0.98 0.967
Koran 0.966 0.958 0.962 0.994 0.974 0.984
dition, it is restricted to the in-domain and general-
domain datasets it is trained on, which are usually Table 5: Ablation analysis showing precision (p) recall
small. On the contrary, pre-trained language mod- (r) and F1 for the binary classification accuracy on a
els are trained on massive amounts of text, and, as held-out set, with and without pre-ranking.

7753
Medical Law Koran IT Subtitles Average
Random-500k 49.8 53.3 18.5 37.5 25.5 36.92
Moore-Lewis-Top-500k 55 58 21.4 42.7 27.3 40.88
Domain-Cosine-Top-500k 52.7 58 22 42.5 27.1 40.46
Domain-Finetune-Top-500k 54.8 58.8 21.8 43.5 27.4 41.26
Domain-Finetune-Positive 55.3 58.7 19.2 42.5 27 40.54
Oracle 56.5 59 15.9 43 27.3 40.34
All 53.3 57.2 20.9 42.1 27.6 40.22

Table 6: SacreBLEU scores for the data selection experiments. Highest scores per column are marked in bold.
Moore-Lewis D-Cosine D-Finetune
Domain-Cosine method, and then sample negative p r p r p r
examples under a certain threshold in the ranking Medical 0.476 0.955 0.391 0.788 0.485 0.975
Law 0.836 0.894 0.841 0.899 0.902 0.965
(in our experiments we sampled from the bottom Koran 0.35 0.985 0.36 0.989 0.36 0.998
two-thirds). Table 5 shows an ablation for such IT 0.441 0.985 0.382 0.857 0.447 0.998
pre-ranking, measuring precision, recall and F1 Subtitles 0.899 0.899 0.916 0.916 0.957 0.957
Average 0.6 0.944 0.578 0.89 0.63 0.979
for binary classification on a held-out set for each
domain. When not using pre-ranking, as the train- Table 7: Precision (p) and recall (r) for data selection
ing data for the domain is larger, the precision is of 500k sentences with respect to the oracle selection.
lower – since more in-domain examples are drawn
as negative samples. Using pre-ranking indeed al- domains, showing again that more data is not nec-
leviates this issue, achieving higher F1 scores in all essarily better in multi-domain scenarios and that
cases. Given the results in Table 5 we always use data selection is a useful approach. Regarding a
pre-ranking in the following experiments. comparison of the data selection methods, Moore-
Lewis performed better than Domain-Cosine, while
4.2 Experimental Setup Domain-Finetune performed best, showing the ben-
efit of fine-tuning large pretrained models for the
We perform data selection experiments for each do-
data selection task. Using the positively-labeled
main in the multi-domain dataset. As the small set
examples alone (Domain-Finetune-Positive) per-
of monolingual in-domain data we take the 2000
formed worse than using the top 500k examples
development sentences from each domain. For the
but better than Domain-Cosine, while not requiring
general-domain corpus we concatenate the training
to determine the number of selected sentences.
data from all domains, resulting in 1,456,317 sen-
tences. To enable faster experimentation we used 4.4 Analysis
DistilBERT (Sanh et al., 2019) for the Domain-
Cosine and Domain-Finetune methods. More tech- We perform an analysis on the selected datasets,
nical details are available in the supplementary ma- where we measure the precision and recall of sen-
terial. We compare our methods to four approaches: tence selection with respect to the oracle selection.
(1) The established method by Moore and Lewis The results are available in Table 7. As also re-
(2010), (2) a random selection baseline, (3) an ora- flected in the BLEU scores, the Domain-Finetune
cle which is trained on all the available in-domain method resulted in the highest domain recall with a
data, and (4) the model we train on all the domains minimum of 97.5, while Moore-Lewis and Domain-
concatenated. We select the top 500k examples to Cosine scored 89.4 and 78.8 respectively. We find
cover the size of every specific in-domain dataset. these results very appealing given that only 2000
We train Transformer NMT models on the selected in-domain sentences were used for selection for
data with a similar configuration to the ones trained each domain out of 1.45 million sentences. Also
in the cross-domain evaluation. note that we used DistilBERT in these experiments:
we believe that using larger, non-distilled models
4.3 Results may result in even better selection performance
(although at the price of larger computational re-
The results are available in Table 6. We can see
quirements).
that all selection methods performed much bet-
ter in terms of BLEU than random selection. It 5 Related Work
is also nice to see that all selection methods per-
formed better than using all the available data or Previous works used n-gram LMs for data selection
the oracle-selected data when averaged across all (Moore and Lewis, 2010; Axelrod et al., 2011) or

7754
other count-based methods (Axelrod, 2017; Ponce- et al. (2020) show the importance of additional pre-
las et al., 2018; Parcheta et al., 2018; Santamarı́a training with in-domain data to improve the down-
and Axelrod, 2019). While such methods work stream task-specific performance.
well in practice, they cannot generalize beyond the While previous work made important contribu-
N-grams observed in the in-domain datasets, which tions to domain data selection, our work is the first
are usually small. to explore massive pretrained language models for
Duh et al. (2013) proposed to replace n-gram both unsupervised domain clustering and for data
models with RNN-based LMs with notable im- selection in NMT.
provements. However, such methods do not cap-
ture the rich sentence-level global context as in the 6 Conclusions and Future Work
recent self-attention-based MLMs; as we showed We showed that massive pre-trained language mod-
in the clustering experiments, autoregressive neural els are highly effective in mapping data to domains
LMs were inferior to masked LMs in clustering the in a fully-unsupervised manner using average-
data by domain. In addition, training large LMs pooled sentence representations and GMM-based
may be prohibitive without relying on pre-training. clustering. We suggest that such clusters are a more
Regarding domain clustering for MT, Hasler appropriate, data driven approach to domains in nat-
et al. (2014) discovered topics using LDA instead ural language than simplistic labels (e.g. “medical
of using domain labels. Cuong et al. (2016) in- text”), and that it will improve over time as better
duced latent subdomains from the training data and larger pretrained LMs will become available.
using a dedicated probabilistic model. We proposed new methods to harness this prop-
Many works used vector-based retrieval for data erty for domain data selection using distance-based
selection; Ruder and Plank (2017) learn to select ranking in vector space and pretrained LM fine-
data using Bayesian optimization, and explored tuning, requiring only a small set of in-domain data.
word2vec for that purpose. Duma and Menzel We demonstrated the effectiveness of our methods
(2016) create paragraph vectors for data selection on a new, improved data split we created for a pre-
in the context of SMT. Wang et al. (2017) use in- viously studied multi-domain machine translation
ternal representations from the NMT model to per- benchmark. Our methods perform similarly or bet-
form data selection. Bapna and Firat (2019) pro- ter than an established data selection method and
pose a mechanism for incorporating retrieved sen- oracle in-domain training across all five domains
tences for each instance for domain adaptation in in the benchmark.
NMT, using representations extracted from a pre- This work just scratches the surface with what
trained NMT model. Farajian et al. (2017) explored can be done on the subject; possible avenues for
instance-based data selection in a multi-domain sce- future work include extending this with multilin-
nario using information retrieval methods. gual data selection and multilingual LMs (Conneau
Other related works on domain adaptation in- and Lample, 2019; Conneau et al., 2019; Wu et al.,
clude Dou et al. (2019a) that adapts multi-domain 2019; Hu et al., 2020), using such selection meth-
NMT models with domain-aware feature embed- ods with domain-curriculum training (Zhang et al.,
dings, which are learned via an auxiliary language 2019; Wang et al., 2019b), applying them on noisy,
modeling task. Peris et al. (2017) proposed neural- web-crawled data (Junczys-Dowmunt, 2018) or for
network based classifiers for data selection in SMT. additional tasks (Gururangan et al., 2020). Another
For more related work on data selection and domain interesting avenue is applying this to unsupervised
adaptation in the context of MT, see the surveys by NMT, which is highly sensitive to domain mis-
Eetemadi et al. (2015) for SMT and more recently match (Marchisio et al., 2020; Kim et al., 2020).
Chu and Wang (2018) for NMT. We hope this work will encourage more research
on finding the right data for the task, towards more
Unrelated to MT, Ma et al. (2019) used BERT
efficient and robust NLP.
to select data for tasks from the GLUE benchmark
(Wang et al., 2018). However, they assumed su-
Acknowledgements
pervision for all the different tasks/domains, while
we propose an unsupervised method requiring only We thank Wei Wang for early discussions on do-
a small set of in-domain data. Also in the con- main adaptation and data selection that inspired this
text of pretrained language models, Gururangan work during Roee’s internship in Google Translate.

7755
References Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov. 2019. Unsupervised
Roee Aharoni and Yoav Goldberg. 2018. Split and cross-lingual representation learning at scale. arXiv
rephrase: Better evaluation and stronger baselines. preprint arXiv:1911.02116.
In Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume Alexis Conneau and Guillaume Lample. 2019. Cross-
2: Short Papers), pages 719–724, Melbourne, Aus- lingual language model pretraining. In Advances
tralia. Association for Computational Linguistics. in Neural Information Processing Systems, pages
7057–7067.
Mikko Aulamo and Jörg Tiedemann. 2019. The OPUS
resource repository: An open package for creating Susan M Conrad and Douglas Biber. 2005. The fre-
parallel corpora and machine translation services. In quency and use of lexical bundles in conversation
Proceedings of the 22nd Nordic Conference on Com- and academic prose. Lexicographica.
putational Linguistics, pages 389–394, Turku, Fin-
land. Linköping University Electronic Press. Hoang Cuong, Khalil Sima’an, and Ivan Titov. 2016.
Adapting to all domains at once: Rewarding domain
Amittai Axelrod. 2017. Cynical selection of lan-
invariance in SMT. Transactions of the Association
guage model training data. arXiv preprint
for Computational Linguistics, 4:99–112.
arXiv:1709.02279.
Hal Daume. 2009. K-means vs GMM, sum-product vs
Amittai Axelrod, Xiaodong He, and Jianfeng Gao.
max-product.
2011. Domain adaptation via pseudo in-domain data
selection. In Proceedings of the 2011 Conference on
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Empirical Methods in Natural Language Processing,
Kristina Toutanova. 2019. BERT: Pre-training of
pages 355–362, Edinburgh, Scotland, UK. Associa-
deep bidirectional transformers for language under-
tion for Computational Linguistics.
standing. In Proceedings of the 2019 Conference
Ankur Bapna and Orhan Firat. 2019. Non-parametric of the North American Chapter of the Association
adaptation for neural machine translation. In Pro- for Computational Linguistics: Human Language
ceedings of the 2019 Conference of the North Amer- Technologies, Volume 1 (Long and Short Papers),
ican Chapter of the Association for Computational pages 4171–4186, Minneapolis, Minnesota. Associ-
Linguistics: Human Language Technologies, Vol- ation for Computational Linguistics.
ume 1 (Long and Short Papers), pages 1921–1931,
Minneapolis, Minnesota. Association for Computa- Zi-Yi Dou, Junjie Hu, Antonios Anastasopoulos, and
tional Linguistics. Graham Neubig. 2019a. Unsupervised domain adap-
tation for neural machine translation with domain-
Loı̈c Barrault, Ondřej Bojar, Marta R. Costa-jussà, aware feature embeddings. In Proceedings of the
Christian Federmann, Mark Fishel, Yvette Gra- 2019 Conference on Empirical Methods in Natu-
ham, Barry Haddow, Matthias Huck, Philipp Koehn, ral Language Processing and the 9th International
Shervin Malmasi, Christof Monz, Mathias Müller, Joint Conference on Natural Language Processing
Santanu Pal, Matt Post, and Marcos Zampieri. 2019. (EMNLP-IJCNLP), pages 1417–1422, Hong Kong,
Findings of the 2019 conference on machine transla- China. Association for Computational Linguistics.
tion (WMT19). In Proceedings of the Fourth Con-
ference on Machine Translation (Volume 2: Shared Zi-Yi Dou, Xinyi Wang, Junjie Hu, and Graham Neu-
Task Papers, Day 1), pages 1–61, Florence, Italy. As- big. 2019b. Domain differential adaptation for neu-
sociation for Computational Linguistics. ral machine translation. In Proceedings of the 3rd
Workshop on Neural Generation and Translation,
David M Blei, Andrew Y Ng, and Michael I Jordan. Hong Kong. Association for Computational Linguis-
2003. Latent dirichlet allocation. Journal of ma- tics.
chine Learning research, 3(Jan):993–1022.
Kevin Duh, Graham Neubig, Katsuhito Sudoh, and Ha-
Chenhui Chu and Rui Wang. 2018. A survey of do- jime Tsukada. 2013. Adaptation data selection us-
main adaptation for neural machine translation. In ing neural language models: Experiments in ma-
Proceedings of the 27th International Conference on chine translation. In Proceedings of the 51st Annual
Computational Linguistics, pages 1304–1319, Santa Meeting of the Association for Computational Lin-
Fe, New Mexico, USA. Association for Computa- guistics (Volume 2: Short Papers), pages 678–683,
tional Linguistics. Sofia, Bulgaria. Association for Computational Lin-
guistics.
Kevin Clark, Urvashi Khandelwal, Omer Levy, and
Christopher D Manning. 2019. What does BERT Mirela-Stefania Duma and Wolfgang Menzel. 2016.
look at? an analysis of BERT’s attention. arXiv Data selection for IT texts using paragraph vector.
preprint arXiv:1906.04341. In Proceedings of the First Conference on Machine
Translation: Volume 2, Shared Task Papers, pages
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, 428–434, Berlin, Germany. Association for Compu-
Vishrav Chaudhary, Guillaume Wenzek, Francisco tational Linguistics.

7756
Sauleh Eetemadi, William Lewis, Kristina Toutanova, Alon Jacovi, Gang Niu, Yoav Goldberg, and Masashi
and Hayder Radha. 2015. Survey of data-selection Sugiyama. 2019. Scalable evaluation and im-
methods in statistical machine translation. Machine provement of document set expansion via neu-
Translation, 29(3-4):189–223. ral positive-unlabeled learning. arXiv preprint
arXiv:1910.13339.
M. Amin Farajian, Marco Turchi, Matteo Negri, and
Marcello Federico. 2017. Multi-domain neural Marcin Junczys-Dowmunt. 2018. Dual conditional
machine translation through unsupervised adapta- cross-entropy filtering of noisy parallel corpora. In
tion. In Proceedings of the Second Conference on Proceedings of the Third Conference on Machine
Machine Translation, pages 127–137, Copenhagen, Translation: Shared Task Papers, pages 888–895,
Denmark. Association for Computational Linguis- Belgium, Brussels. Association for Computational
tics. Linguistics.
Guillem Gascó, Martha-Alicia Rocha, Germán Yunsu Kim, Miguel Graça, and Hermann Ney. 2020.
Sanchis-Trilles, Jesús Andrés-Ferrer, and Francisco When and why is unsupervised neural machine trans-
Casacuberta. 2012. Does more data always yield lation useless? arXiv preprint arXiv:2004.10581.
better translations? In Proceedings of the 13th
Conference of the European Chapter of the Associa- Diederik P Kingma and Jimmy Ba. 2014. Adam: A
tion for Computational Linguistics, pages 152–161, method for stochastic optimization. arXiv preprint
Avignon, France. Association for Computational arXiv:1412.6980.
Linguistics.
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris
Yoav Goldberg. 2019. Assessing BERT’s syntactic Callison-Burch, Marcello Federico, Nicola Bertoldi,
abilities. arXiv preprint arXiv:1901.05287. Brooke Cowan, Wade Shen, Christine Moran,
Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra
Suchin Gururangan, Ana Marasovi, Swabha Constantin, and Evan Herbst. 2007. Moses: Open
Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, source toolkit for statistical machine translation. In
and Noah A. Smith. 2020. Don’t stop pretraining: Proceedings of the 45th Annual Meeting of the As-
Adapt language models to domains and tasks. ACL. sociation for Computational Linguistics Companion
Gholamreza Haffari, Maxim Roy, and Anoop Sarkar. Volume Proceedings of the Demo and Poster Ses-
2009. Active learning for statistical phrase-based sions, pages 177–180, Prague, Czech Republic. As-
machine translation. In Proceedings of Human sociation for Computational Linguistics.
Language Technologies: The 2009 Annual Confer-
ence of the North American Chapter of the Associa- Philipp Koehn and Rebecca Knowles. 2017. Six chal-
tion for Computational Linguistics, pages 415–423, lenges for neural machine translation. In Proceed-
Boulder, Colorado. Association for Computational ings of the First Workshop on Neural Machine Trans-
Linguistics. lation, pages 28–39, Vancouver. Association for
Computational Linguistics.
Eva Hasler, Phil Blunsom, Philipp Koehn, and Barry
Haddow. 2014. Dynamic topic adaptation for Katherine Lee, Orhan Firat, Ashish Agarwal, Clara
phrase-based MT. In Proceedings of the 14th Fannjiang, and David Sussillo. 2018. Hallucinations
Conference of the European Chapter of the Asso- in neural machine translation.
ciation for Computational Linguistics, pages 328–
337, Gothenburg, Sweden. Association for Compu- Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
tational Linguistics. dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Kenneth Heafield. 2011. KenLM: Faster and smaller Roberta: A robustly optimized bert pretraining ap-
language model queries. In Proceedings of the proach. arXiv preprint arXiv:1907.11692.
Sixth Workshop on Statistical Machine Translation,
pages 187–197, Edinburgh, Scotland. Association Xiaofei Ma, Peng Xu, Zhiguo Wang, Ramesh Nalla-
for Computational Linguistics. pati, and Bing Xiang. 2019. Domain adaptation
with BERT-based domain classification and data se-
Junjie Hu, Sebastian Ruder, Aditya Siddhant, Gra- lection. In Proceedings of the 2nd Workshop on
ham Neubig, Orhan Firat, and Melvin Johnson. Deep Learning Approaches for Low-Resource NLP
2020. Xtreme: A massively multilingual multi-task (DeepLo 2019), pages 76–83, Hong Kong, China.
benchmark for evaluating cross-lingual generaliza- Association for Computational Linguistics.
tion. arXiv preprint arXiv:2003.11080.
Christopher D Manning, Prabhakar Raghavan, and Hin-
Junjie Hu, Mengzhou Xia, Graham Neubig, and Jaime rich Schütze. 2008. Introduction to information re-
Carbonell. 2019. Domain adaptation of neural ma- trieval. Cambridge university press.
chine translation by lexicon induction. In Proceed-
ings of the 57th Annual Meeting of the Association Kelly Marchisio, Kevin Duh, and Philipp Koehn. 2020.
for Computational Linguistics, Florence, Italy. Asso- When does unsupervised machine translation work?
ciation for Computational Linguistics. arXiv preprint arXiv:2004.05516.

7757
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor- 2227–2237, New Orleans, Louisiana. Association
rado, and Jeff Dean. 2013. Distributed representa- for Computational Linguistics.
tions of words and phrases and their compositional-
ity. In Advances in neural information processing Alberto Poncelas, Gideon Maillette de Buy Wenniger,
systems, pages 3111–3119. and Andy Way. 2018. Data selection with feature
decay algorithms using an approximated target side.
Robert C. Moore and William Lewis. 2010. Intelligent arXiv preprint arXiv:1811.03039.
selection of language model training data. In Pro-
ceedings of the ACL 2010 Conference Short Papers, Matt Post. 2018. A call for clarity in reporting BLEU
pages 220–224, Uppsala, Sweden. Association for scores. In Proceedings of the Third Conference on
Computational Linguistics. Machine Translation: Research Papers, pages 186–
191, Belgium, Brussels. Association for Computa-
Mathias Müller, Annette Rios, and Rico Sennrich. tional Linguistics.
2019. Domain robustness in neural machine trans-
lation. Ofir Press and Lior Wolf. 2017. Using the output em-
bedding to improve language models. In Proceed-
Nathan Ng, Kyra Yee, Alexei Baevski, Myle Ott, ings of the 15th Conference of the European Chap-
Michael Auli, and Sergey Edunov. 2019. Facebook ter of the Association for Computational Linguistics:
FAIR’s WMT19 news translation task submission. Volume 2, Short Papers, pages 157–163, Valencia,
In Proceedings of the Fourth Conference on Ma- Spain. Association for Computational Linguistics.
chine Translation (Volume 2: Shared Task Papers,
Day 1), pages 314–319, Florence, Italy. Association Alec Radford, Karthik Narasimhan, Tim Salimans, and
for Computational Linguistics. Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training. OpenAI blog.
Xing Niu, Marianna Martindale, and Marine Carpuat.
2017. A study of style in machine translation: Con- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
trolling the formality of machine translation output. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
In Proceedings of the 2017 Conference on Empiri- Wei Li, and Peter J. Liu. 2019. Exploring the limits
cal Methods in Natural Language Processing, pages of transfer learning with a unified text-to-text trans-
2814–2819, Copenhagen, Denmark. Association for former.
Computational Linguistics.
Sebastian Ruder, Matthew E. Peters, Swabha
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Swayamdipta, and Thomas Wolf. 2019. Trans-
Fan, Sam Gross, Nathan Ng, David Grangier, and fer learning in natural language processing. In
Michael Auli. 2019. fairseq: A fast, extensible Proceedings of the 2019 Conference of the North
toolkit for sequence modeling. In Proceedings of American Chapter of the Association for Com-
the 2019 Conference of the North American Chap- putational Linguistics: Tutorials, pages 15–18,
ter of the Association for Computational Linguistics Minneapolis, Minnesota. Association for Computa-
(Demonstrations), pages 48–53, Minneapolis, Min- tional Linguistics.
nesota. Association for Computational Linguistics.
Sebastian Ruder and Barbara Plank. 2017. Learning to
Zuzanna Parcheta, Germán Sanchis-Trilles, and Fran- select data for transfer learning with Bayesian opti-
cisco Casacuberta. 2018. Data selection for NMT mization. In Proceedings of the 2017 Conference on
using infrequent n-gram recovery. EAMT 2018. Empirical Methods in Natural Language Processing,
pages 372–382, Copenhagen, Denmark. Association
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, for Computational Linguistics.
B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,
R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, Yonatan
D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- Belinkov, and Stephan Vogel. 2017. Neural ma-
esnay. 2011. Scikit-learn: Machine learning in chine translation training in a multi-domain scenario.
Python. Journal of Machine Learning Research, arXiv preprint arXiv:1708.08712.
12:2825–2830.
Victor Sanh, Lysandre Debut, Julien Chaumond, and
Álvaro Peris, Mara Chinea-Rı́os, and Francisco Thomas Wolf. 2019. Distilbert, a distilled version
Casacuberta. 2017. Neural networks classifier of BERT: smaller, faster, cheaper and lighter. arXiv
for data selection in statistical machine translation. preprint arXiv:1910.01108.
The Prague Bulletin of Mathematical Linguistics,
108(1):283–294. Lucı́a Santamarı́a and Amittai Axelrod. 2019. Data
selection with cluster-based language difference
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt models and cynical selection. arXiv preprint
Gardner, Christopher Clark, Kenton Lee, and Luke arXiv:1904.04900.
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. In Proceedings of the 2018 Confer- Rico Sennrich, Barry Haddow, and Alexandra Birch.
ence of the North American Chapter of the Associ- 2016. Neural machine translation of rare words
ation for Computational Linguistics: Human Lan- with subword units. In Proceedings of the 54th An-
guage Technologies, Volume 1 (Long Papers), pages nual Meeting of the Association for Computational

7758
Linguistics (Volume 1: Long Papers), pages 1715– Marlies van der Wees. 2017. What’s in a Domain? To-
1725, Berlin, Germany. Association for Computa- wards Fine-Grained Adaptation for Machine Trans-
tional Linguistics. lation. Ph.D. thesis, University of Amsterdam.
Catarina Cruz Silva, Chao-Hong Liu, Alberto Poncelas, Marlies van der Wees, Arianna Bisazza, Wouter
and Andy Way. 2018. Extracting in-domain training Weerkamp, and Christof Monz. 2015. What’s in
corpora for neural machine translation using data se- a domain? analyzing genre and topic differences
lection methods. In Proceedings of the Third Con- in statistical machine translation. In Proceedings
ference on Machine Translation: Research Papers, of the 53rd Annual Meeting of the Association for
pages 224–231, Belgium, Brussels. Association for Computational Linguistics and the 7th International
Computational Linguistics. Joint Conference on Natural Language Processing
(Volume 2: Short Papers), pages 560–566, Beijing,
Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. China. Association for Computational Linguistics.
BERT rediscovers the classical NLP pipeline. In
Proceedings of the 57th Annual Meeting of the Asso- Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
ciation for Computational Linguistics, pages 4593– Chaumond, Clement Delangue, Anthony Moi, Pier-
4601, Florence, Italy. Association for Computational ric Cistac, Tim Rault, R’emi Louf, Morgan Funtow-
Linguistics. icz, and Jamie Brew. 2019. Huggingface’s trans-
formers: State-of-the-art natural language process-
Jörg Tiedemann. 2012. Parallel data, tools and inter- ing. ArXiv, abs/1910.03771.
faces in OPUS. In Proceedings of the Eighth In-
ternational Conference on Language Resources and Shijie Wu, Alexis Conneau, Haoran Li, Luke Zettle-
Evaluation (LREC-2012), pages 2214–2218, Istan- moyer, and Veselin Stoyanov. 2019. Emerging
bul, Turkey. European Languages Resources Associ- cross-lingual structure in pretrained language mod-
ation (ELRA). els. arXiv preprint arXiv:1911.01464.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car-
Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz bonell, Ruslan Salakhutdinov, and Quoc V Le.
Kaiser, and Illia Polosukhin. 2017. Attention is all 2019. XLNet: Generalized autoregressive pretrain-
you need. In I. Guyon, U. V. Luxburg, S. Bengio, ing for language understanding. arXiv preprint
H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar- arXiv:1906.08237.
nett, editors, Advances in Neural Information Pro-
cessing Systems 30, pages 5998–6008. Curran Asso- Xuan Zhang, Pamela Shapiro, Gaurav Kumar, Paul
ciates, Inc. McNamee, Marine Carpuat, and Kevin Duh. 2019.
Curriculum learning for domain adaptation in neu-
Alex Wang, Yada Pruksachatkun, Nikita Nangia, ral machine translation. In Proceedings of the 2019
Amanpreet Singh, Julian Michael, Felix Hill, Omer Conference of the North American Chapter of the
Levy, and Samuel R Bowman. 2019a. Super- Association for Computational Linguistics: Human
glue: A stickier benchmark for general-purpose Language Technologies, Volume 1 (Long and Short
language understanding systems. arXiv preprint Papers), pages 1903–1915, Minneapolis, Minnesota.
arXiv:1905.00537. Association for Computational Linguistics.
Alex Wang, Amanpreet Singh, Julian Michael, Fe-
lix Hill, Omer Levy, and Samuel Bowman. 2018.
GLUE: A multi-task benchmark and analysis plat-
form for natural language understanding. In Pro-
ceedings of the 2018 EMNLP Workshop Black-
boxNLP: Analyzing and Interpreting Neural Net-
works for NLP, pages 353–355, Brussels, Belgium.
Association for Computational Linguistics.
Rui Wang, Andrew Finch, Masao Utiyama, and Ei-
ichiro Sumita. 2017. Sentence embedding for neural
machine translation domain adaptation. In Proceed-
ings of the 55th Annual Meeting of the Association
for Computational Linguistics (Volume 2: Short Pa-
pers), pages 560–566, Vancouver, Canada. Associa-
tion for Computational Linguistics.
Wei Wang, Isaac Caswell, and Ciprian Chelba. 2019b.
Dynamically composing domain-data selection with
clean-data selection by “co-curricular learning” for
neural machine translation. In Proceedings of the
57th Annual Meeting of the Association for Com-
putational Linguistics, pages 1282–1292, Florence,
Italy. Association for Computational Linguistics.

7759
CUDA_VISIBLE_DEVICES=0 \
A Appendix python $FAIRSEQ_PATH/train.py ${BINARIZED_DATA_DIR} \
--arch transformer_wmt_en_de \
--share-all-embeddings \
--optimizer adam \
A.1 NMT Training --adam-betas ’(0.9, 0.98)’ \
--clip-norm 1.0 \
--lr 0.0005 \
Figure 5 details the hyperparameter configuration --lr-scheduler inverse_sqrt \
--warmup-updates 4000 \
we used to train the NMT models. We use Trans- --warmup-init-lr 1e-07 \
former models (Vaswani et al., 2017) in the Base --dropout 0.2 \
--weight-decay 0.0 \
configuration using the implementation provided --criterion label_smoothed_cross_entropy \
--label-smoothing 0.1 \
in Fairseq (Ott et al., 2019). For all models we --max-tokens 4096 \
--update-freq 5 \
use a joint BPE vocabulary (Sennrich et al., 2016) --attention-dropout 0.2 \
--activation-dropout 0.2 \
learned with 32k merge operations over the con- --max-epoch 200 \
--seed 17 \
catenated corpus in both languages, enabling to tie -s $src \
-t $tgt \
all the embedding layers (Press and Wolf, 2017).12 --save-dir $MODEL_PATH \
--save-interval-updates 10000 \
We perform early stopping if the BLEU score on --validate-interval 1

the domain-specific development set did not im- Figure 5: The hyperparameter configuration we used
for NMT model training using Fairseq (Ott et al.,
prove in 10 consequent checkpoints. We use the
2019).
ADAM (Kingma and Ba, 2014) optimizer with an
initial learning rate of 5 · 10− 4 and a maximum
of 4096 tokens per batch. We trained all models A.5 Moore-Lewis Implementation
on a single NVIDIA GPU. We decode using beam We used the implementation of Moore and
search with a beam size of 5. For pre-processing Lewis (2010) by Pamela Shapiro, as avail-
we used the Moses (Koehn et al., 2007) pipeline in- able in: https://github.com/pamelashapiro/
cluding tokenization, normalize-punctuation, non- moore-lewis. This implementation uses the
printing character removal, truecasing and cleaning. KenLM N-Gram language model toolkit (Heafield,
We removed examples with sequences longer than 2011).
100 tokens from the training data (before subword
segmentation). A.6 Additional Visualizations
Figure 6 shows visualizations of the multi-domain
A.2 Data Split dataset from additional pre-trained masked lan-
Table 8 shows details about the overlap between the guage models (BERT large and RoBERTa), and
training, development and test sets for the different Figure 7 shows the same visualization for autore-
data splits of the multi-domain dataset. The overlap gressive models (XLNet and GPT2).
was computed using the English part of the corpus.

A.3 GMM Clustering

We learn GMMs with full covariance matrices, i.e.
without constraints on covariance matrices that de-
termine the shape of each component in the mix-
ture, as implemented in scikit-learn (Pedregosa
et al., 2011). We train the models until conver-
gence or for a maximum of 150 EM iterations.

A.4 Language Model Finetuning

We fine-tune the binary classification head for 5
epochs. We use the ADAM (Kingma and Ba, 2014)
optimizer with an initial learning rate of 2 · 10− 5.
We train the model using 4 NVIDIA GPUs with
256 sentences per batch (64 per GPU).

12
We used the implementation in https://github.
com/rsennrich/subword-nmt

7760
Koehn and Knowles (2017) Müller et al. (2019) New Split
Medical 1090/2000 (54.5%) 1204/2000 (60.2%) 0/2000
Koran 0/2000 1926/2000 (96.3) 0/2000
% dev
Subtitles 1183/5000 (23.66%) 638/2000 (31.9%) 0/2000
in train
Law 595/2000 (29.75%) 1000/2000 (50%) 0/2000
IT 2496/2526 (98.81%) 783/2000 (39.15%) 0/2000
Medical 571/2000 (28.55%) 516/1691 (30.51%) 0/2000
Koran 0/2000 1949/2000 (97.45%) 0/2000
% test
Subtitles 451/5000 (9.02%) 478/2000 (23.9%) 0/2000
in train
Law 649/2000 (32.45%) 966/2000 (48.3%) 0/2000
IT 945/1856 (50.92%) 1036/2000 (51.8%) 0/2000

Table 8: Details about the different data splits for the multi-domain corpus.

7761
bert-large-cased

it
koran
subtitles
medical
law

roberta-large

it
koran
subtitles
medical
law

Figure 6: 2D visualizations of the unsupervised GMM-based clustering for different pretrained MLMs.

7762
xlnet-base-cased

it
koran
subtitles
medical
law

gpt2

it
koran
subtitles
medical
law

Figure 7: 2D visualizations of the unsupervised GMM-based clustering for different pretrained auto-regressive
LMs.

7763