0% found this document useful (0 votes)
28 views8 pages

Language Model Adaptation To Specialized Domains Through Selective

Uploaded by

Faique Memon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views8 pages

Language Model Adaptation To Specialized Domains Through Selective

Uploaded by

Faique Memon
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Language Model Adaptation to Specialized Domains through Selective

Masking based on Genre and Topical Characteristics

Anas Belfathi Ygor Gallina Nicolas Hernandez Richard Dufour Laura Monceaux
LS2N, UMR CNRS 6004, Nantes Université
{firstname.lastname}@univ-nantes.fr

Abstract data to refine pre-trained models (Chalkidis et al.,


2020a; Wu et al., 2021; Ke et al., 2022; Labrak
Recent advances in pre-trained language model-
et al., 2023). This process invariably entails token
ing have facilitated significant progress across
various natural language processing (NLP) masking, where critical factors such as the masking
arXiv:2402.12036v2 [cs.CL] 26 Feb 2024

tasks. Word masking during model training ratio and token selection play pivotal roles. Prior re-
constitutes a pivotal component of language search efforts (Sun et al., 2019a; Joshi et al., 2020;
modeling in architectures like BERT. However, Levine et al., 2020; Li and Zhao, 2021) have delved
the prevalent method of word masking relies into selective information fine-tuning, encompass-
on random selection, potentially disregarding ing words, tokens, and spans.
domain-specific linguistic attributes. In this
In this paper, we introduce an original masking
article, we introduce an innovative masking
approach leveraging genre and topicality infor- approach that harnesses genre and topicality in-
mation to tailor language models to specialized formation to tailor MLMs to specialized domains.
domains. Our method incorporates a ranking Our method integrates a ranking process with meta-
process that prioritizes words based on their discourse and T F×I DF scoring to prioritize tokens
significance, subsequently guiding the mask- based on contextual significance, guiding the mask-
ing procedure. Experiments conducted using ing procedure. By systematically identifying and
continual pre-training within the legal domain masking tokens crucial to domain-specific contexts,
have underscored the efficacy of our approach
we compel the model to adapt to the understanding
on the LegalGLUE benchmark in the English
language. Pre-trained language models and and prediction of essential domain-specific words.
code are freely available for use. To illustrate the effectiveness of our strategies, we
conduct experiments on the continual pre-training
1 Introduction (CPT) of BERT models towards the legal domain,
comparing various token masking strategies.
Large-scale, pre-trained language models (PLMs)
Our contributions include:
have become indispensable in modeling human
language, significantly advancing performance • We propose original masking strategies
across diverse Natural Language Processing (NLP) based on word selection (meta-discourse and
tasks (Bao et al., 2020; Guu et al., 2020; Zhang T F×I DF) for language model training.
et al., 2022). Among these architectures, masked • We develop a systematic approach for incor-
language models (MLMs) like BERT (Devlin et al., porating selected words effectively during the
2019) are prominent, where tokens within a se- training process.
quence are intentionally masked during training, • We release open-source models and code
compelling the model to predict these tokens based designed for adaptable training, facilitating
on surrounding context. This method enables the MLM pre-training for specific domains based
model to grasp intricate semantic relationships and on our approach1 .
syntactic structures inherent in natural language.
However, training MLMs from scratch demands 2 Related work
substantial resources in terms of data and comput-
The classical masking strategy in BERT (Devlin
ing power.
et al., 2019) is the masking of 15% of tokens
In the context of specialized domains, adapta-
within a given sentence. The approach involves
tion through continual pre-training remains the con-
1
ventional approach, drawing upon domain-specific github.com/ygorg/legal-masking
the random replacement (10% chance), preserva- multiple documents.
tion (10%), or substitution with the special [MASK] The second approach, the specificity score for a
token (80%) of selected tokens. The model’s objec- text genre (MetaDis), assesses the extent to which
tive is to accurately predict the original tokens. a word is characteristic of a particular text genre.
In efforts to enrich the representation capabilities A genre of documents is characterized by a com-
of MLMs, ERNIE (Sun et al., 2019b) and Span- mon structure (Biber and Conrad, 2019), often de-
BERT (Joshi et al., 2020) have refined the classi- scribed by words or expressions known as meta-
cal strategy of random token masking employed discourse (Hyland, 1998). For instance, in the le-
by BERT. They introduced methods emphasizing gal genre of jurisprudence, this lexicon includes
the masking of entire words and spans of text, re- terms used to describe facts, present arguments,
spectively, albeit still in a random manner. These reason, or reach a final decision. While Hernan-
approaches have demonstrated improved perfor- dez and Grau (2003) utilized the inverse document
mance in certain domain-specific tasks. frequency score to assess specificity, this measure
From a different perspective, recent studies have overlooks the distribution of occurrences within
explored dynamically altering the content masked documents. We assume that a meta-discourse
during the training process. Yang et al. (2023) in- marker occurs in a consistent proportion across
troduced a time-variant masking strategy, departing documents within the same genre. To capture such
from static methods that maintain consistent con- properties and calculate a meta-discourse score, we
tent throughout training. They noted that, at certain propose the formula depicted in Equation 1:
training stages, models cease to learn from specific  
types of words, discerned through part-of-speech dft std(dtft ) dft
st = ∗ 1− ∗ (1)
tags and the model’s error rates using loss. Sim- tft max(dtft ) N
ilarly, Althammer et al. (2021) applied masking
to words within noun chunks, manipulating the Here, dft and tft represent the document fre-
predictability probability of such tokens. quency and term frequency of a specific word, t,
To our knowledge, our work is the first to pro- respectively, while N signifies the total number of
pose and investigate the impact of masking strate- documents in the corpus. dtft lists the number of
gies employing semantically important words auto- occurrences per document for a given word. The
matically selected for a specialized domain. first term in the equation weights a word based on
its occurrence across distinct documents and its
3 Selective Masking total number of occurrences. A word that appears
a few times per document receives a higher score.
While the original BERT approach employed ran-
The second term accounts for words with consis-
dom word selection (Devlin et al., 2019), our
tent occurrences across documents, reflected by a
method selectively masks words based on their
low standard deviation. Finally, the third term em-
significance to a specific text genre or their topi-
phasizes words that appear in multiple documents,
cal salience within a document. We adopt a two-
contributing to their overall score.
step approach: firstly, assigning a "genre specificity
score" and "topical salience score" to each word
from a domain-specific corpus (Section 3.1). Sub- 3.2 Word Selection Strategy
sequently, we use these ranked lists to determine We propose two strategies for selecting words to
which words to mask (Section 3.2). mask from a weighted list of words obtained pre-
viously. Our first method, TopN, selects the top
3.1 Word Weighting Approaches 15% of words with the highest scores. The second
We propose two automatic word weighting ap- method, Rand, aims to enhance model robustness
proaches computed from a set of domain-specific by avoiding systematic masking of the same words.
documents. The first approach, the topicality score This method introduces a level of weighted ran-
(Tf×Idf), quantifies the thematic relevance of a domness, similar to the dynamic masking approach
word to a given document. We employ the well- used in RoBERTa (Liu et al., 2019). In practice,
established T F×I DF (Jones, 1972) score, which we randomly sample words (without replacement)
evaluates a word’s topical salience by comparing based on the distribution of computed scores (refer
its frequency in a document to its occurrence across to Algorithm 1 in Appendix B).
4 Experimental setup computed the average of scores from three indepen-
We utilize BERT (Devlin et al., 2019) and Legal- dent runs. Results are presented in terms of micro
BERT (Chalkidis et al., 2020b) as our pre-trained (µF1 ) and macro (m-F1) F-measure.
base models. The effectiveness of our masking
strategies is assessed through continual pre-training 5 Results and Discussions
on these models, focusing on legal domain adap- Results are detailed in Table 2, each column refer-
tation. To facilitate this process, we introduce a ring to a task of the LexGLUE benchmark.
pre-training dataset collected specifically for this
purpose, which is also utilized to select masking Efficacy of Continual Pre-training Our study
words (Section 4.1). We delineate the evaluation reveals the effectiveness of continual pre-training
tasks (Section 4.2) and conclude with the experi- across all tasks and experiments for both BERT
mental specifics (Section 4.3). and LegalBERT models. Specifically, the micro-F1
(µF1 ) score for the baseline BERT+CPT demon-
4.1 Pre-training Corpus strates a notable improvement from 60.39 to
64.69 in the ECtHR (B) task, while the baseline
Sub-Corpus # Doc # Tokens LegalBERT-CPT shows substantial enhancements
EU Case Law 29.8K 178.5M 29% in the EUR-LEX task. These improvements, even
ECtHR Case Law 12.5K 78.5M 13%
U.S. Case Law 104.7K 235.5M 39% without modifying the masking strategy, suggest
Indian Case Law 34.8K 111.6M 19% that our collected corpus contains new domain-
Total 181.8K 604.1M 100% specific characteristics that enrich the underlying
Table 1: Details of the 4Gb dataset for CPT. knowledge of language models.

For continual pre-training and word masking Classical vs. Selective Masking To evaluate the
selection, we chose to focus on the legal do- efficacy of our masking approaches, we compared
main by utilizing a subset of the LexFiles cor- them with the classical BERT random masking in
pus (Chalkidis et al., 2023) representative of the a continual pre-training setup (+CPT (baseline)).
LexGLUE (Chalkidis et al., 2022) benchmark. The The results indicate that our models improve over
documents were selected to offer a balanced and the baseline across all tasks, irrespective of the
diverse collection, encompassing the linguistic nu- masking strategy employed. With BERT, notable
ances (see Table 1). improvements were observed in the ECtHR(A) and
LEDGAR tasks, with scores of 54.88 and 82.25,
4.2 Evaluation Tasks respectively. This can be attributed to our mod-
We assess the performance of our model using els’ enhanced ability to leverage Genre (MetaDis)
LexGLUE (Chalkidis et al., 2022), a benchmark- and thematic (T F×I DF) information inherent in the
ing framework designed specifically for the legal legal domain. Conversely, for the LegalBERT mod-
domain. LexGLUE encompasses a diverse array of els, improvements were noted in the ECtHR(B)
legal tasks sourced from European, United States, and UNFAIR-ToS tasks under both scoring meth-
and Canadian legal systems. These tasks entail ods. This underscores the benefits of selectively
multi-class and multi-label classification at the doc- masking words, especially with already adapted
ument level, with a dozen labels. Such a setup models. Moreover, our approach demonstrated su-
provides a rigorous test for our approach’s ability perior results on the SCOTUS task compared to hi-
to excel across a spectrum of complex tasks. erarchical approaches mentioned in Chalkidis et al.
(2022), employing a streamlined and less complex
4.3 Experimental Details model structure. This highlights the importance of
choosing masking techniques that focus on domain-
For continual pre-training, we conducted sessions specific language features, avoiding complex mod-
totaling over 20 hours using 16 V100 GPUs on the els with extra parameters or longer training times.
Jean Zay supercomputer. We adopted a batch size
of 16 and configured gradient accumulation steps MetaDiscourse vs. T F×I DF When comparing
to 16, resulting in an effective batch size of 4096, genre (+MetaDis) and thematic scoring (+T F×I DF)
following the methodology outlined by Labrak et al. masking strategies on BERT and LegalBERT mod-
(2023). To ascertain task performance scores, we els, we observed distinct patterns. For the BERT
Method ECtHR (A) ECtHR (B) SCOTUS EUR-LEX LEDGAR UNFAIR-ToS
µF1 m-F1 µF1 m-F1 µF1 m-F1 µF1 m-F1 µF1 m-F1 µF1 m-F1
BERT 62.12 52.66 69.59 60.39 69.61 58.65 71.70 54.87 87.85 82.30 95.66 80.97
+ CPT (baseline) 63.12 54.13 71.06 64.69 70.57 60.38 71.86 56.18 87.90 82.02 95.56 81.46
+ MetaDis - Rand 62.55 54.88 70.45 63.10 70.26 59.12 71.66 56.00 87.68 82.25 95.38 79.18
+ MetaDis - TopN 62.17 53.35 70.29 62.29 69.92 60.08 71.67 56.95 87.78 82.11 95.57 81.51
+ T F×I DF - Rand 63.36 56.60 71.32 64.58 69.69 59.10 71.93 55.82 87.69 82.11 95.50 78.63
+ T F×I DF - TopN 62.66 56.46 71.50 63.58 70.71 60.06 71.73 57.73 87.67 81.89 95.49 79.45
LegalBERT 63.41 53.19 72.10 63.68 73.61 61.50 71.93 55.47 87.91 81.67 95.81 81.27
+ CPT (baseline) 63.64 58.73 72.60 64.95 74.64 63.13 72.01 55.12 88.41 82.92 95.82 79.70
+ MetaDis - Rand 63.39 56.39 73.08 65.76 74.21 62.97 72.03 54.76 88.38 82.58 95.20 80.26
+ MetaDis - TopN 64.07 58.56 72.53 66.83 73.88 62.57 71.96 55.01 88.32 82.16 94.80 73.67
+ T F×I DF - Rand 63.38 56.78 72.21 65.67 73.71 62.85 71.78 55.82 88.19 82.36 95.80 82.12
+ T F×I DF - TopN 62.89 53.58 73.26 65.86 74.38 63.10 71.90 55.08 88.27 82.65 95.59 81.20

Table 2: Comparative analysis of BERT and LegalBERT models’ performance using continual pre-training with
selective masking on LexGLUE benchmark tasks. Best performing models are indicated in bold, second-best results
are underlined. Green and Orange respectively depict MetaDis and T F×I DF scores. Darker colors highlight
improvement over the CPT (baseline).

models, meta-discourse demonstrated its effective- discourse, the Rand strategy was effective in en-
ness in tasks where genre-specific language fea- hancing performance in the UNFAIR-ToS, ECtHR
tures play a significant role, such as in the ECtHR (B), and EURLEX tasks. Meanwhile, the TopN
(A), LEDGAR, and UNFAIR-ToS tasks. In con- approach made significant strides in both ECtHR
trast, T F×I DF showed its strengths in tasks that (A) and (B), emphasizing the importance of fo-
emphasize thematic relevance, such as in the EC- cusing on the most pertinent words. When using
tHR (B), EURLEX, and SCOTUS tasks. T F×I DF with LegalBERT, improvements were ob-
For the LegalBERT models, both strategies dis- served in the ECtHR (B) and UNFAIR-ToS tasks
played similar performance in the ECtHR (B) and under both strategies. Interestingly, the Rand ap-
UNFAIR-ToS tasks. However, meta-discourse proach achieved the best results for the EURLEX
proved to be more effective in the ECtHR (A) task, task. The consistent improvements across vari-
while T F×I DF demonstrated better results for EU- ous tasks confirm the interest in selective masking,
RLEX. These findings suggest that thematic rel- though it might require customization for the task.
evance is generally crucial for the EURLEX task
regardless of the model’s starting point, indicating
that the thematic scoring aligns well with the task’s
6 Conclusion and Future Work
nature. Conversely, genre considerations (MetaDis)
are particularly beneficial for the ECtHR (A) task, Our research provides conclusive evidence that our
emphasizing the importance of structural and stylis- proposed automatic selective masking strategies
tic language features in legal texts. integrating genre and topical characteristics played
a crucial role in refining the models’ focus when
Rand vs. TopN The experimental analysis of adapted to a specialized domain. We observed im-
BERT models revealed that the random weighting provements across all tasks in the LexGLUE bench-
strategy (Rand) was successful in achieving higher mark focusing on the legal domain. Notably, both
performances in the ECtHR (A) and LEDGAR the BERT and LegalBERT models demonstrated
tasks for both scoring methods. On the other hand, important improvements in the ECtHR and EUR-
the TopN strategy showed improvements in the LEX tasks. While our results are encouraging, sev-
EURLEX task with these scoring techniques. No- eral avenues for further research emerge, including
tably, the TopN method also demonstrated higher the impact of our approach on other domains, such
performance in the ECtHR (B) and SCOTUS tasks as the clinical and scientific domains, to ensure
when using T F×I DF, indicating its effectiveness in the generalizability of our approach. Furthermore,
situations where topicality is crucial. it is crucial to tackle the obstacles presented by
Regarding the LegalBERT models with meta- multi-language models.
Ethical considerations Semantic Technologies (PatentSemTech2021), New
York, NY, USA. Association for Computing Machin-
With respect to the potential risks and biases inher- ery.
ent in language models trained on legal datasets,
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan
legal corpora may comprise texts of varying quality Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jian-
and representativeness. The utilization of models feng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2020.
such as BERT trained on legal texts could poten- Unilmv2: Pseudo-masked language models for uni-
tially introduce biases pertaining to fairness, the fied language model pre-training.
use of gendered language, the representation of Douglas Biber and Susan Conrad. 2019. Register, Genre,
minority groups, and the dynamic nature of legal and Style. Cambridge University Press. Google-
standards over time. It is imperative that these Books-ID: x7OQDwAAQBAJ.
biases are thoroughly evaluated and mitigated to
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
ensure equitable performance across different de- siotis, Nikolaos Aletras, and Ion Androutsopoulos.
mographics and to maintain currency with evolving 2020a. LEGAL-BERT: The muppets straight out of
legal norms. law school. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 2898–
Limitations 2904, Online. Association for Computational Lin-
guistics.
Our work, though offering valuable insights into
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
the application of continual pre-training and selec- siotis, Nikolaos Aletras, and Ion Androutsopoulos.
tive masking techniques for language models in the 2020b. LEGAL-BERT: The Muppets straight out of
legal domain, is not without its limitations. Specifi- Law School. In Findings of the Association for Com-
cally, the current study concentrates solely on the putational Linguistics: EMNLP 2020, pages 2898–
2904, Online. Association for Computational Lin-
BERT architecture, which restricts our ability to guistics.
investigate a broader range of language models that
may exhibit distinct behaviors and sensitivities to Ilias Chalkidis, Nicolas Garneau, Catalina Goanta,
Daniel Katz, and Anders Søgaard. 2023. LeXFiles
our pre-training and masking strategies. Future
and LegalLAMA: Facilitating English Multinational
studies should explore other models, such as Dr- Legal Language Model Development. In Proceed-
BERT (Labrak et al., 2023) and RoBERTa (Liu ings of the 61st Annual Meeting of the Association for
et al., 2019), to provide a more comprehensive Computational Linguistics (Volume 1: Long Papers),
understanding of the effects of our approach. Addi- pages 15513–15535, Toronto, Canada. Association
for Computational Linguistics.
tionally, our study lacks a direct comparison with a
pre-trained model developed using selective mask- Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael
ing from scratch. Such a comparison would serve Bommarito, Ion Androutsopoulos, Daniel Katz, and
Nikolaos Aletras. 2022. LexGLUE: A Benchmark
as a valuable reference point for assessing the in-
Dataset for Legal Language Understanding in En-
cremental benefits of our method. Finally, we ac- glish. In Proceedings of the 60th Annual Meeting of
knowledge that further hyperparameter tuning may the Association for Computational Linguistics (Vol-
lead to enhanced model performance. ume 1: Long Papers), pages 4310–4330, Dublin,
Ireland. Association for Computational Linguistics.
Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
This work was granted access to the HPC resources deep bidirectional transformers for language under-
of IDRIS under the allocation 2023-AD011014882 standing. In Proceedings of the 2019 Conference of
made by GENCI. the North American Chapter of the Association for
This research was funded, in whole or in part, Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
by l’Agence Nationale de la Recherche (ANR),
4171–4186, Minneapolis, Minnesota. Association for
project NR-22-CE38-0004. Computational Linguistics.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-


References pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
augmented language model pre-training.
Sophia Althammer, Mark Buckley, Sebastian Hofstätter,
and Allan Hanbury. 2021. Linguistically informed Nicolas Hernandez and Brigitte Grau. 2003. Automatic
masking for representation learning in the patent do- extraction of meta-descriptors for text description.
main. In 2nd Workshop on Patent Text Mining and In International Conference on Recent Advances In
Natural Language Processing (RANLP), Borovets, Dongjie Yang, Zhuosheng Zhang, and Hai Zhao. 2023.
Bulgaria. Learning Better Masking for Better Language Model
Pre-training. In Proceedings of the 61st Annual Meet-
Ken Hyland. 1998. Persuasion and context: The prag- ing of the Association for Computational Linguistics
matics of academic metadiscourse. Journal of prag- (Volume 1: Long Papers), pages 7255–7267, Toronto,
matics, 30(4):437–455. Canada. Association for Computational Linguistics.
Karen Spärck Jones. 1972. A statistical interpretation
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
of term specificity and its application in retrieval.
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Journal of Documentation, 28:11–21.
wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Luke Zettlemoyer, and Omer Levy. 2020. Span- Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
BERT: Improving pre-training by representing and Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
predicting spans. Transactions of the Association for trained transformer language models.
Computational Linguistics, 8:64–77.
A Continual Pre-training parameters
Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi,
Gyuhak Kim, and Bing Liu. 2022. Continual pre- Before training, samples were randomly shuffled 3
training of language models. In The Eleventh Inter-
national Conference on Learning Representations. times using the same seed.
We train each model using the transformers
Yanis Labrak, Adrien Bazoge, Richard Dufour, Mick- python library for 10 epochs which represents 4453
ael Rouvier, Emmanuel Morin, Béatrice Daille, and
Pierre-Antoine Gourraud. 2023. DrBERT: A robust steps for bert and 4396 steps for legal-bert. The
pre-trained model in French for biomedical and clini- difference in the number of steps arises from the
cal domains. In Proceedings of the 61st Annual Meet- difference of tokenisation between legal-bert and
ing of the Association for Computational Linguis- bert, which results in a different number of training
tics (Volume 1: Long Papers), pages 16207–16221,
Toronto, Canada. Association for Computational Lin- sequence.
guistics. In total we estimate the total computation time
to ≃ 4,100h which breaks down into 3,200h of
Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend,
Kevin Leyton-Brown, Moshe Tennenholtz, and Yoav training time, 380h for evaluating the models and
Shoham. 2020. Pmi-masking: Principled masking of 520h for development.
correlated spans.
B Selection and Analysis of masked words
Yian Li and Hai Zhao. 2021. Pre-training universal
language representation. In Proceedings of the 59th We detail in Algorithm 1 the process of selecting
Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer- the words to mask. The T F×I DF was computed
ence on Natural Language Processing (Volume 1: using the scikit-learn python package.
Long Papers), pages 5122–5133, Online. Association To gain more insight on the difference of selected
for Computational Linguistics. words by the two importance scores, we show in
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- the Table 3 the 50 most masked words for ≃10%
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, of the training corpus.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining Algorithm 1 Explicit masking
approach. CoRR, abs/1907.11692.
1: function M ASK(tokens)
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, 2: M ← {}
Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu,
Hao Tian, and Hua Wu. 2019a. ERNIE: En- 3: W ← W holeW ords(tokens)
hanced Representation through Knowledge Integra- 4: S ← ScoreSequence(W )
tion. ArXiv:1904.09223 [cs]. 5: while |M| < 0.15 ∗ |tokens| do
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi 6: i ← Sample(S)2
Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao 7: Remove W [i] and S[i]
Tian, and Hua Wu. 2019b. Ernie: Enhanced repre- 8: if |M|+|W [i]| ≤ 0.15∗|tokens| then
sentation through knowledge integration. 9: M←M+w
Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang 10: end if
Li, Guilin Qi, and Gholamreza Haffari. 2021. Pre- 11: end while
trained language model in continual learning: A com- 12: return M
parative study. In International Conference on Learn-
ing Representations. 13: end function
2
Use the Max function for TopN method
T F×I DF
applicant, court, 2007, extradition, prosecutor, meshchanskiy, russian, dzhurayev, moscow, uzbekistan,
tashkent, district, custody, government, convention, article, office, decision, §, detention, russia, ccp,
4, preventive, v, minsk, federation, 2, application, uzbek, proceedings, 1, 5, criminal, procedure,
january, 38124, case, 29, merits, may, dismissed, law, rakhimovskiy, 466, request, decided, sobir,
arrest, provisions
MetaDis
general, application, decision, january, september, decided, august, 4, 28, 9, 3, issued, request, rules,
dismissed, 23, 29, indicated, basis, ordered, european, apply, be, 24, 17, date, 5, 30, held, final,
december, 26, 6, 11, mentioned, applied, specified, 12, february, placed, 2, whether, remain, first, to,
deliberated, represented, constitute, case, article

Table 3: 50 most masked words using T F×I DF and Metadiscourse scoring ordered by frequency.

You might also like