Language Model Adaptation To Specialized Domains Through Selective

Uploaded by

Faique Memon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views8 pages

Language Model Adaptation To Specialized Domains Through Selective

Uploaded by

Faique Memon

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Language Model Adaptation to Specialized Domains through Selective

Masking based on Genre and Topical Characteristics

Anas Belfathi Ygor Gallina Nicolas Hernandez Richard Dufour Laura Monceaux
LS2N, UMR CNRS 6004, Nantes Université
{firstname.lastname}@univ-nantes.fr

Abstract data to refine pre-trained models (Chalkidis et al.,

2020a; Wu et al., 2021; Ke et al., 2022; Labrak
Recent advances in pre-trained language model-
et al., 2023). This process invariably entails token
ing have facilitated significant progress across
various natural language processing (NLP) masking, where critical factors such as the masking
arXiv:2402.12036v2 [cs.CL] 26 Feb 2024

tasks. Word masking during model training ratio and token selection play pivotal roles. Prior re-
constitutes a pivotal component of language search efforts (Sun et al., 2019a; Joshi et al., 2020;
modeling in architectures like BERT. However, Levine et al., 2020; Li and Zhao, 2021) have delved
the prevalent method of word masking relies into selective information fine-tuning, encompass-
on random selection, potentially disregarding ing words, tokens, and spans.
domain-specific linguistic attributes. In this
In this paper, we introduce an original masking
article, we introduce an innovative masking
approach leveraging genre and topicality infor- approach that harnesses genre and topicality in-
mation to tailor language models to specialized formation to tailor MLMs to specialized domains.
domains. Our method incorporates a ranking Our method integrates a ranking process with meta-
process that prioritizes words based on their discourse and T F×I DF scoring to prioritize tokens
significance, subsequently guiding the mask- based on contextual significance, guiding the mask-
ing procedure. Experiments conducted using ing procedure. By systematically identifying and
continual pre-training within the legal domain masking tokens crucial to domain-specific contexts,
have underscored the efficacy of our approach
we compel the model to adapt to the understanding
on the LegalGLUE benchmark in the English
language. Pre-trained language models and and prediction of essential domain-specific words.
code are freely available for use. To illustrate the effectiveness of our strategies, we
conduct experiments on the continual pre-training
1 Introduction (CPT) of BERT models towards the legal domain,
comparing various token masking strategies.
Large-scale, pre-trained language models (PLMs)
Our contributions include:
have become indispensable in modeling human
language, significantly advancing performance • We propose original masking strategies
across diverse Natural Language Processing (NLP) based on word selection (meta-discourse and
tasks (Bao et al., 2020; Guu et al., 2020; Zhang T F×I DF) for language model training.
et al., 2022). Among these architectures, masked • We develop a systematic approach for incor-
language models (MLMs) like BERT (Devlin et al., porating selected words effectively during the
2019) are prominent, where tokens within a se- training process.
quence are intentionally masked during training, • We release open-source models and code
compelling the model to predict these tokens based designed for adaptable training, facilitating
on surrounding context. This method enables the MLM pre-training for specific domains based
model to grasp intricate semantic relationships and on our approach1 .
syntactic structures inherent in natural language.
However, training MLMs from scratch demands 2 Related work
substantial resources in terms of data and comput-
The classical masking strategy in BERT (Devlin
ing power.
et al., 2019) is the masking of 15% of tokens
In the context of specialized domains, adapta-
within a given sentence. The approach involves
tion through continual pre-training remains the con-
1
ventional approach, drawing upon domain-specific github.com/ygorg/legal-masking
the random replacement (10% chance), preserva- multiple documents.
tion (10%), or substitution with the special [MASK] The second approach, the specificity score for a
token (80%) of selected tokens. The model’s objec- text genre (MetaDis), assesses the extent to which
tive is to accurately predict the original tokens. a word is characteristic of a particular text genre.
In efforts to enrich the representation capabilities A genre of documents is characterized by a com-
of MLMs, ERNIE (Sun et al., 2019b) and Span- mon structure (Biber and Conrad, 2019), often de-
BERT (Joshi et al., 2020) have refined the classi- scribed by words or expressions known as meta-
cal strategy of random token masking employed discourse (Hyland, 1998). For instance, in the le-
by BERT. They introduced methods emphasizing gal genre of jurisprudence, this lexicon includes
the masking of entire words and spans of text, re- terms used to describe facts, present arguments,
spectively, albeit still in a random manner. These reason, or reach a final decision. While Hernan-
approaches have demonstrated improved perfor- dez and Grau (2003) utilized the inverse document
mance in certain domain-specific tasks. frequency score to assess specificity, this measure
From a different perspective, recent studies have overlooks the distribution of occurrences within
explored dynamically altering the content masked documents. We assume that a meta-discourse
during the training process. Yang et al. (2023) in- marker occurs in a consistent proportion across
troduced a time-variant masking strategy, departing documents within the same genre. To capture such
from static methods that maintain consistent con- properties and calculate a meta-discourse score, we
tent throughout training. They noted that, at certain propose the formula depicted in Equation 1:
training stages, models cease to learn from specific
types of words, discerned through part-of-speech dft std(dtft ) dft
st = ∗ 1− ∗ (1)
tags and the model’s error rates using loss. Sim- tft max(dtft ) N
ilarly, Althammer et al. (2021) applied masking
to words within noun chunks, manipulating the Here, dft and tft represent the document fre-
predictability probability of such tokens. quency and term frequency of a specific word, t,
To our knowledge, our work is the first to pro- respectively, while N signifies the total number of
pose and investigate the impact of masking strate- documents in the corpus. dtft lists the number of
gies employing semantically important words auto- occurrences per document for a given word. The
matically selected for a specialized domain. first term in the equation weights a word based on
its occurrence across distinct documents and its
3 Selective Masking total number of occurrences. A word that appears
a few times per document receives a higher score.
While the original BERT approach employed ran-
The second term accounts for words with consis-
dom word selection (Devlin et al., 2019), our
tent occurrences across documents, reflected by a
method selectively masks words based on their
low standard deviation. Finally, the third term em-
significance to a specific text genre or their topi-
phasizes words that appear in multiple documents,
cal salience within a document. We adopt a two-
contributing to their overall score.
step approach: firstly, assigning a "genre specificity
score" and "topical salience score" to each word
from a domain-specific corpus (Section 3.1). Sub- 3.2 Word Selection Strategy
sequently, we use these ranked lists to determine We propose two strategies for selecting words to
which words to mask (Section 3.2). mask from a weighted list of words obtained pre-
viously. Our first method, TopN, selects the top
3.1 Word Weighting Approaches 15% of words with the highest scores. The second
We propose two automatic word weighting ap- method, Rand, aims to enhance model robustness
proaches computed from a set of domain-specific by avoiding systematic masking of the same words.
documents. The first approach, the topicality score This method introduces a level of weighted ran-
(Tf×Idf), quantifies the thematic relevance of a domness, similar to the dynamic masking approach
word to a given document. We employ the well- used in RoBERTa (Liu et al., 2019). In practice,
established T F×I DF (Jones, 1972) score, which we randomly sample words (without replacement)
evaluates a word’s topical salience by comparing based on the distribution of computed scores (refer
its frequency in a document to its occurrence across to Algorithm 1 in Appendix B).
4 Experimental setup computed the average of scores from three indepen-
We utilize BERT (Devlin et al., 2019) and Legal- dent runs. Results are presented in terms of micro
BERT (Chalkidis et al., 2020b) as our pre-trained (µF1 ) and macro (m-F1) F-measure.
base models. The effectiveness of our masking
strategies is assessed through continual pre-training 5 Results and Discussions
on these models, focusing on legal domain adap- Results are detailed in Table 2, each column refer-
tation. To facilitate this process, we introduce a ring to a task of the LexGLUE benchmark.
pre-training dataset collected specifically for this
purpose, which is also utilized to select masking Efficacy of Continual Pre-training Our study
words (Section 4.1). We delineate the evaluation reveals the effectiveness of continual pre-training
tasks (Section 4.2) and conclude with the experi- across all tasks and experiments for both BERT
mental specifics (Section 4.3). and LegalBERT models. Specifically, the micro-F1
(µF1 ) score for the baseline BERT+CPT demon-
4.1 Pre-training Corpus strates a notable improvement from 60.39 to
64.69 in the ECtHR (B) task, while the baseline
Sub-Corpus # Doc # Tokens LegalBERT-CPT shows substantial enhancements
EU Case Law 29.8K 178.5M 29% in the EUR-LEX task. These improvements, even
ECtHR Case Law 12.5K 78.5M 13%
U.S. Case Law 104.7K 235.5M 39% without modifying the masking strategy, suggest
Indian Case Law 34.8K 111.6M 19% that our collected corpus contains new domain-
Total 181.8K 604.1M 100% specific characteristics that enrich the underlying
Table 1: Details of the 4Gb dataset for CPT. knowledge of language models.

For continual pre-training and word masking Classical vs. Selective Masking To evaluate the
selection, we chose to focus on the legal do- efficacy of our masking approaches, we compared
main by utilizing a subset of the LexFiles cor- them with the classical BERT random masking in
pus (Chalkidis et al., 2023) representative of the a continual pre-training setup (+CPT (baseline)).
LexGLUE (Chalkidis et al., 2022) benchmark. The The results indicate that our models improve over
documents were selected to offer a balanced and the baseline across all tasks, irrespective of the
diverse collection, encompassing the linguistic nu- masking strategy employed. With BERT, notable
ances (see Table 1). improvements were observed in the ECtHR(A) and
LEDGAR tasks, with scores of 54.88 and 82.25,
4.2 Evaluation Tasks respectively. This can be attributed to our mod-
We assess the performance of our model using els’ enhanced ability to leverage Genre (MetaDis)
LexGLUE (Chalkidis et al., 2022), a benchmark- and thematic (T F×I DF) information inherent in the
ing framework designed specifically for the legal legal domain. Conversely, for the LegalBERT mod-
domain. LexGLUE encompasses a diverse array of els, improvements were noted in the ECtHR(B)
legal tasks sourced from European, United States, and UNFAIR-ToS tasks under both scoring meth-
and Canadian legal systems. These tasks entail ods. This underscores the benefits of selectively
multi-class and multi-label classification at the doc- masking words, especially with already adapted
ument level, with a dozen labels. Such a setup models. Moreover, our approach demonstrated su-
provides a rigorous test for our approach’s ability perior results on the SCOTUS task compared to hi-
to excel across a spectrum of complex tasks. erarchical approaches mentioned in Chalkidis et al.
(2022), employing a streamlined and less complex
4.3 Experimental Details model structure. This highlights the importance of
choosing masking techniques that focus on domain-
For continual pre-training, we conducted sessions specific language features, avoiding complex mod-
totaling over 20 hours using 16 V100 GPUs on the els with extra parameters or longer training times.
Jean Zay supercomputer. We adopted a batch size
of 16 and configured gradient accumulation steps MetaDiscourse vs. T F×I DF When comparing
to 16, resulting in an effective batch size of 4096, genre (+MetaDis) and thematic scoring (+T F×I DF)
following the methodology outlined by Labrak et al. masking strategies on BERT and LegalBERT mod-
(2023). To ascertain task performance scores, we els, we observed distinct patterns. For the BERT
Method ECtHR (A) ECtHR (B) SCOTUS EUR-LEX LEDGAR UNFAIR-ToS
µF1 m-F1 µF1 m-F1 µF1 m-F1 µF1 m-F1 µF1 m-F1 µF1 m-F1
BERT 62.12 52.66 69.59 60.39 69.61 58.65 71.70 54.87 87.85 82.30 95.66 80.97
+ CPT (baseline) 63.12 54.13 71.06 64.69 70.57 60.38 71.86 56.18 87.90 82.02 95.56 81.46
+ MetaDis - Rand 62.55 54.88 70.45 63.10 70.26 59.12 71.66 56.00 87.68 82.25 95.38 79.18
+ MetaDis - TopN 62.17 53.35 70.29 62.29 69.92 60.08 71.67 56.95 87.78 82.11 95.57 81.51
+ T F×I DF - Rand 63.36 56.60 71.32 64.58 69.69 59.10 71.93 55.82 87.69 82.11 95.50 78.63
+ T F×I DF - TopN 62.66 56.46 71.50 63.58 70.71 60.06 71.73 57.73 87.67 81.89 95.49 79.45
LegalBERT 63.41 53.19 72.10 63.68 73.61 61.50 71.93 55.47 87.91 81.67 95.81 81.27
+ CPT (baseline) 63.64 58.73 72.60 64.95 74.64 63.13 72.01 55.12 88.41 82.92 95.82 79.70
+ MetaDis - Rand 63.39 56.39 73.08 65.76 74.21 62.97 72.03 54.76 88.38 82.58 95.20 80.26
+ MetaDis - TopN 64.07 58.56 72.53 66.83 73.88 62.57 71.96 55.01 88.32 82.16 94.80 73.67
+ T F×I DF - Rand 63.38 56.78 72.21 65.67 73.71 62.85 71.78 55.82 88.19 82.36 95.80 82.12
+ T F×I DF - TopN 62.89 53.58 73.26 65.86 74.38 63.10 71.90 55.08 88.27 82.65 95.59 81.20

Table 2: Comparative analysis of BERT and LegalBERT models’ performance using continual pre-training with
selective masking on LexGLUE benchmark tasks. Best performing models are indicated in bold, second-best results
are underlined. Green and Orange respectively depict MetaDis and T F×I DF scores. Darker colors highlight
improvement over the CPT (baseline).

models, meta-discourse demonstrated its effective- discourse, the Rand strategy was effective in en-
ness in tasks where genre-specific language fea- hancing performance in the UNFAIR-ToS, ECtHR
tures play a significant role, such as in the ECtHR (B), and EURLEX tasks. Meanwhile, the TopN
(A), LEDGAR, and UNFAIR-ToS tasks. In con- approach made significant strides in both ECtHR
trast, T F×I DF showed its strengths in tasks that (A) and (B), emphasizing the importance of fo-
emphasize thematic relevance, such as in the EC- cusing on the most pertinent words. When using
tHR (B), EURLEX, and SCOTUS tasks. T F×I DF with LegalBERT, improvements were ob-
For the LegalBERT models, both strategies dis- served in the ECtHR (B) and UNFAIR-ToS tasks
played similar performance in the ECtHR (B) and under both strategies. Interestingly, the Rand ap-
UNFAIR-ToS tasks. However, meta-discourse proach achieved the best results for the EURLEX
proved to be more effective in the ECtHR (A) task, task. The consistent improvements across vari-
while T F×I DF demonstrated better results for EU- ous tasks confirm the interest in selective masking,
RLEX. These findings suggest that thematic rel- though it might require customization for the task.
evance is generally crucial for the EURLEX task
regardless of the model’s starting point, indicating
that the thematic scoring aligns well with the task’s
6 Conclusion and Future Work
nature. Conversely, genre considerations (MetaDis)
are particularly beneficial for the ECtHR (A) task, Our research provides conclusive evidence that our
emphasizing the importance of structural and stylis- proposed automatic selective masking strategies
tic language features in legal texts. integrating genre and topical characteristics played
a crucial role in refining the models’ focus when
Rand vs. TopN The experimental analysis of adapted to a specialized domain. We observed im-
BERT models revealed that the random weighting provements across all tasks in the LexGLUE bench-
strategy (Rand) was successful in achieving higher mark focusing on the legal domain. Notably, both
performances in the ECtHR (A) and LEDGAR the BERT and LegalBERT models demonstrated
tasks for both scoring methods. On the other hand, important improvements in the ECtHR and EUR-
the TopN strategy showed improvements in the LEX tasks. While our results are encouraging, sev-
EURLEX task with these scoring techniques. No- eral avenues for further research emerge, including
tably, the TopN method also demonstrated higher the impact of our approach on other domains, such
performance in the ECtHR (B) and SCOTUS tasks as the clinical and scientific domains, to ensure
when using T F×I DF, indicating its effectiveness in the generalizability of our approach. Furthermore,
situations where topicality is crucial. it is crucial to tackle the obstacles presented by
Regarding the LegalBERT models with meta- multi-language models.
Ethical considerations Semantic Technologies (PatentSemTech2021), New
York, NY, USA. Association for Computing Machin-
With respect to the potential risks and biases inher- ery.
ent in language models trained on legal datasets,
Hangbo Bao, Li Dong, Furu Wei, Wenhui Wang, Nan
legal corpora may comprise texts of varying quality Yang, Xiaodong Liu, Yu Wang, Songhao Piao, Jian-
and representativeness. The utilization of models feng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2020.
such as BERT trained on legal texts could poten- Unilmv2: Pseudo-masked language models for uni-
tially introduce biases pertaining to fairness, the fied language model pre-training.
use of gendered language, the representation of Douglas Biber and Susan Conrad. 2019. Register, Genre,
minority groups, and the dynamic nature of legal and Style. Cambridge University Press. Google-
standards over time. It is imperative that these Books-ID: x7OQDwAAQBAJ.
biases are thoroughly evaluated and mitigated to
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
ensure equitable performance across different de- siotis, Nikolaos Aletras, and Ion Androutsopoulos.
mographics and to maintain currency with evolving 2020a. LEGAL-BERT: The muppets straight out of
legal norms. law school. In Findings of the Association for Com-
putational Linguistics: EMNLP 2020, pages 2898–
Limitations 2904, Online. Association for Computational Lin-
guistics.
Our work, though offering valuable insights into
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malaka-
the application of continual pre-training and selec- siotis, Nikolaos Aletras, and Ion Androutsopoulos.
tive masking techniques for language models in the 2020b. LEGAL-BERT: The Muppets straight out of
legal domain, is not without its limitations. Specifi- Law School. In Findings of the Association for Com-
cally, the current study concentrates solely on the putational Linguistics: EMNLP 2020, pages 2898–
2904, Online. Association for Computational Lin-
BERT architecture, which restricts our ability to guistics.
investigate a broader range of language models that
may exhibit distinct behaviors and sensitivities to Ilias Chalkidis, Nicolas Garneau, Catalina Goanta,
Daniel Katz, and Anders Søgaard. 2023. LeXFiles
our pre-training and masking strategies. Future
and LegalLAMA: Facilitating English Multinational
studies should explore other models, such as Dr- Legal Language Model Development. In Proceed-
BERT (Labrak et al., 2023) and RoBERTa (Liu ings of the 61st Annual Meeting of the Association for
et al., 2019), to provide a more comprehensive Computational Linguistics (Volume 1: Long Papers),
understanding of the effects of our approach. Addi- pages 15513–15535, Toronto, Canada. Association
for Computational Linguistics.
tionally, our study lacks a direct comparison with a
pre-trained model developed using selective mask- Ilias Chalkidis, Abhik Jana, Dirk Hartung, Michael
ing from scratch. Such a comparison would serve Bommarito, Ion Androutsopoulos, Daniel Katz, and
Nikolaos Aletras. 2022. LexGLUE: A Benchmark
as a valuable reference point for assessing the in-
Dataset for Legal Language Understanding in En-
cremental benefits of our method. Finally, we ac- glish. In Proceedings of the 60th Annual Meeting of
knowledge that further hyperparameter tuning may the Association for Computational Linguistics (Vol-
lead to enhanced model performance. ume 1: Long Papers), pages 4310–4330, Dublin,
Ireland. Association for Computational Linguistics.
Acknowledgements Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. 2019. BERT: Pre-training of
This work was granted access to the HPC resources deep bidirectional transformers for language under-
of IDRIS under the allocation 2023-AD011014882 standing. In Proceedings of the 2019 Conference of
made by GENCI. the North American Chapter of the Association for
This research was funded, in whole or in part, Computational Linguistics: Human Language Tech-
nologies, Volume 1 (Long and Short Papers), pages
by l’Agence Nationale de la Recherche (ANR),
4171–4186, Minneapolis, Minnesota. Association for
project NR-22-CE38-0004. Computational Linguistics.

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-

References pat, and Ming-Wei Chang. 2020. Realm: Retrieval-
augmented language model pre-training.
Sophia Althammer, Mark Buckley, Sebastian Hofstätter,
and Allan Hanbury. 2021. Linguistically informed Nicolas Hernandez and Brigitte Grau. 2003. Automatic
masking for representation learning in the patent do- extraction of meta-descriptors for text description.
main. In 2nd Workshop on Patent Text Mining and In International Conference on Recent Advances In
Natural Language Processing (RANLP), Borovets, Dongjie Yang, Zhuosheng Zhang, and Hai Zhao. 2023.
Bulgaria. Learning Better Masking for Better Language Model
Pre-training. In Proceedings of the 61st Annual Meet-
Ken Hyland. 1998. Persuasion and context: The prag- ing of the Association for Computational Linguistics
matics of academic metadiscourse. Journal of prag- (Volume 1: Long Papers), pages 7255–7267, Toronto,
matics, 30(4):437–455. Canada. Association for Computational Linguistics.
Karen Spärck Jones. 1972. A statistical interpretation
Susan Zhang, Stephen Roller, Naman Goyal, Mikel
of term specificity and its application in retrieval.
Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Journal of Documentation, 28:11–21.
wan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mi-
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, haylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel
Luke Zettlemoyer, and Omer Levy. 2020. Span- Simig, Punit Singh Koura, Anjali Sridhar, Tianlu
BERT: Improving pre-training by representing and Wang, and Luke Zettlemoyer. 2022. Opt: Open pre-
predicting spans. Transactions of the Association for trained transformer language models.
Computational Linguistics, 8:64–77.
A Continual Pre-training parameters
Zixuan Ke, Yijia Shao, Haowei Lin, Tatsuya Konishi,
Gyuhak Kim, and Bing Liu. 2022. Continual pre- Before training, samples were randomly shuffled 3
training of language models. In The Eleventh Inter-
national Conference on Learning Representations. times using the same seed.
We train each model using the transformers
Yanis Labrak, Adrien Bazoge, Richard Dufour, Mick- python library for 10 epochs which represents 4453
ael Rouvier, Emmanuel Morin, Béatrice Daille, and
Pierre-Antoine Gourraud. 2023. DrBERT: A robust steps for bert and 4396 steps for legal-bert. The
pre-trained model in French for biomedical and clini- difference in the number of steps arises from the
cal domains. In Proceedings of the 61st Annual Meet- difference of tokenisation between legal-bert and
ing of the Association for Computational Linguis- bert, which results in a different number of training
tics (Volume 1: Long Papers), pages 16207–16221,
Toronto, Canada. Association for Computational Lin- sequence.
guistics. In total we estimate the total computation time
to ≃ 4,100h which breaks down into 3,200h of
Yoav Levine, Barak Lenz, Opher Lieber, Omri Abend,
Kevin Leyton-Brown, Moshe Tennenholtz, and Yoav training time, 380h for evaluating the models and
Shoham. 2020. Pmi-masking: Principled masking of 520h for development.
correlated spans.
B Selection and Analysis of masked words
Yian Li and Hai Zhao. 2021. Pre-training universal
language representation. In Proceedings of the 59th We detail in Algorithm 1 the process of selecting
Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Confer- the words to mask. The T F×I DF was computed
ence on Natural Language Processing (Volume 1: using the scikit-learn python package.
Long Papers), pages 5122–5133, Online. Association To gain more insight on the difference of selected
for Computational Linguistics. words by the two importance scores, we show in
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- the Table 3 the 50 most masked words for ≃10%
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, of the training corpus.
Luke Zettlemoyer, and Veselin Stoyanov. 2019.
Roberta: A robustly optimized BERT pretraining Algorithm 1 Explicit masking
approach. CoRR, abs/1907.11692.
1: function M ASK(tokens)
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, 2: M ← {}
Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu,
Hao Tian, and Hua Wu. 2019a. ERNIE: En- 3: W ← W holeW ords(tokens)
hanced Representation through Knowledge Integra- 4: S ← ScoreSequence(W )
tion. ArXiv:1904.09223 [cs]. 5: while |M| < 0.15 ∗ |tokens| do
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi 6: i ← Sample(S)2
Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao 7: Remove W [i] and S[i]
Tian, and Hua Wu. 2019b. Ernie: Enhanced repre- 8: if |M|+|W [i]| ≤ 0.15∗|tokens| then
sentation through knowledge integration. 9: M←M+w
Tongtong Wu, Massimo Caccia, Zhuang Li, Yuan-Fang 10: end if
Li, Guilin Qi, and Gholamreza Haffari. 2021. Pre- 11: end while
trained language model in continual learning: A com- 12: return M
parative study. In International Conference on Learn-
ing Representations. 13: end function
2
Use the Max function for TopN method
T F×I DF
applicant, court, 2007, extradition, prosecutor, meshchanskiy, russian, dzhurayev, moscow, uzbekistan,
tashkent, district, custody, government, convention, article, office, decision, §, detention, russia, ccp,
4, preventive, v, minsk, federation, 2, application, uzbek, proceedings, 1, 5, criminal, procedure,
january, 38124, case, 29, merits, may, dismissed, law, rakhimovskiy, 466, request, decided, sobir,
arrest, provisions
MetaDis
general, application, decision, january, september, decided, august, 4, 28, 9, 3, issued, request, rules,
dismissed, 23, 29, indicated, basis, ordered, european, apply, be, 24, 17, date, 5, 30, held, final,
december, 26, 6, 11, mentioned, applied, specified, 12, february, placed, 2, whether, remain, first, to,
deliberated, represented, constitute, case, article

Table 3: 50 most masked words using T F×I DF and Metadiscourse scoring ordered by frequency.

Natural Language Processing From Scratch
No ratings yet
Natural Language Processing From Scratch
45 pages
House Rent Final7
100% (2)
House Rent Final7
63 pages
Matlab 6
No ratings yet
Matlab 6
296 pages
System Paradigms in NLP
No ratings yet
System Paradigms in NLP
8 pages
NLP Mod 1 (New)
No ratings yet
NLP Mod 1 (New)
50 pages
2024 Acl-Long 256
No ratings yet
2024 Acl-Long 256
11 pages
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Climbing Towards NLU: On Meaning, Form, and Understanding in The Age of Data
No ratings yet
Climbing Towards NLU: On Meaning, Form, and Understanding in The Age of Data
14 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
No ratings yet
German's Next Language Model - Branden Chan, Stefan Schweter, and Timo Moller
9 pages
Text Summarization: Summary As A Generic Term and Defi Ne It As Follows
No ratings yet
Text Summarization: Summary As A Generic Term and Defi Ne It As Follows
16 pages
Lec 02
No ratings yet
Lec 02
33 pages
KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Neural Net
No ratings yet
Neural Net
62 pages
02 - Text Preprocessing - Part2
No ratings yet
02 - Text Preprocessing - Part2
36 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
5 2022 Bea-1 28
No ratings yet
5 2022 Bea-1 28
16 pages
Ir 301
No ratings yet
Ir 301
6 pages
Unraveling The Power of Natural Language Processing
No ratings yet
Unraveling The Power of Natural Language Processing
11 pages
Securebert: A Domain-Specific Language Model For Cybersecurity
No ratings yet
Securebert: A Domain-Specific Language Model For Cybersecurity
19 pages
NLP Manual
No ratings yet
NLP Manual
15 pages
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Module 3
No ratings yet
Module 3
40 pages
Arxiv: Natural Language Processing (Almost) From Scratch
No ratings yet
Arxiv: Natural Language Processing (Almost) From Scratch
47 pages
Project LLM
No ratings yet
Project LLM
4 pages
LP Vi Manual
No ratings yet
LP Vi Manual
77 pages
NLP Part1
No ratings yet
NLP Part1
67 pages
Automatic Prediction On DC Compunds
No ratings yet
Automatic Prediction On DC Compunds
12 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
NLP Record
No ratings yet
NLP Record
15 pages
NLP CT1
No ratings yet
NLP CT1
6 pages
Bert
No ratings yet
Bert
10 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
NLP Notes
No ratings yet
NLP Notes
33 pages
On Task Completion
No ratings yet
On Task Completion
9 pages
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
No ratings yet
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
7 pages
Bert
No ratings yet
Bert
20 pages
Tacl A 00300
No ratings yet
Tacl A 00300
14 pages
Unit 6 Endsem PYQs
No ratings yet
Unit 6 Endsem PYQs
15 pages
UTOPIC 2023.eacl-Main.132
No ratings yet
UTOPIC 2023.eacl-Main.132
16 pages
Intbert Acl19paper-3
No ratings yet
Intbert Acl19paper-3
8 pages
978 3 642 36337 5
No ratings yet
978 3 642 36337 5
851 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
2022 Lrec-1 45
No ratings yet
2022 Lrec-1 45
9 pages
Learning How To Ask: Querying Lms With Mixtures of Soft Prompts
No ratings yet
Learning How To Ask: Querying Lms With Mixtures of Soft Prompts
11 pages
NLP Steps Basic
No ratings yet
NLP Steps Basic
26 pages
Computational Approaches To Sentence Completion
No ratings yet
Computational Approaches To Sentence Completion
10 pages
JAMA 2205 SemanticAlgebra
No ratings yet
JAMA 2205 SemanticAlgebra
18 pages
Towards Automatic Detection of Correct Domain Words in OCR Texts From Polish Digital Libraries
No ratings yet
Towards Automatic Detection of Correct Domain Words in OCR Texts From Polish Digital Libraries
5 pages
Domain Representative Keywords Selection - A Probabilistic Approach
No ratings yet
Domain Representative Keywords Selection - A Probabilistic Approach
14 pages
Conditional BERT Contextual
No ratings yet
Conditional BERT Contextual
12 pages
It-3035 (NLP) - CS Mid Feb 2024
No ratings yet
It-3035 (NLP) - CS Mid Feb 2024
6 pages
7-Information Extraction (IE) and Machine Translation (MT)
No ratings yet
7-Information Extraction (IE) and Machine Translation (MT)
46 pages
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
NLP Front Matter
No ratings yet
NLP Front Matter
28 pages
2020 Acl-Main 692
No ratings yet
2020 Acl-Main 692
17 pages
Switchprompt: Learning Domain-Specific Gated Soft Prompts For Classification in Low-Resource Domains
No ratings yet
Switchprompt: Learning Domain-Specific Gated Soft Prompts For Classification in Low-Resource Domains
7 pages
Echo
No ratings yet
Echo
12 pages
855 Roberta A Robustly Optimized B
No ratings yet
855 Roberta A Robustly Optimized B
15 pages
Online - Talk On Future Technologies in Communications Society (Responses)
No ratings yet
Online - Talk On Future Technologies in Communications Society (Responses)
4 pages
ICC 24 - YP Workshop On Challenges Related To Being A PHD Student - Satisfaction Survey
No ratings yet
ICC 24 - YP Workshop On Challenges Related To Being A PHD Student - Satisfaction Survey
2 pages
Legal Artificial Assistance Agent To Assist Refugees
No ratings yet
Legal Artificial Assistance Agent To Assist Refugees
3 pages
ICC 24 - YP Panel On Navigating The World of Academic Publishing - From Writing To Reviewing
No ratings yet
ICC 24 - YP Panel On Navigating The World of Academic Publishing - From Writing To Reviewing
4 pages
CreditcardscomInc 20070810 S-1 EX-10.33 362297 EX-10.33 Affiliate Agreement
No ratings yet
CreditcardscomInc 20070810 S-1 EX-10.33 362297 EX-10.33 Affiliate Agreement
12 pages
Model Employment Contract
No ratings yet
Model Employment Contract
8 pages
Causes of Separation of East Pakistan
No ratings yet
Causes of Separation of East Pakistan
6 pages
Case Law
No ratings yet
Case Law
6 pages
Business Email Writing
No ratings yet
Business Email Writing
13 pages
12.diyat - Qisas
No ratings yet
12.diyat - Qisas
3 pages
1.unlawful Assembly
0% (1)
1.unlawful Assembly
3 pages
Neral Defences in PPC
No ratings yet
Neral Defences in PPC
3 pages
Powershell Commands PDF
No ratings yet
Powershell Commands PDF
3 pages
Computer Book For PPSC Lecturer Computer Science
No ratings yet
Computer Book For PPSC Lecturer Computer Science
107 pages
Result 9b Test 1
No ratings yet
Result 9b Test 1
1 page
Change Log
No ratings yet
Change Log
75 pages
Raise As Exception
No ratings yet
Raise As Exception
6 pages
Job Description - ShareChat - CodeChef
No ratings yet
Job Description - ShareChat - CodeChef
2 pages
Monday, 22 June 2022 This Presentation Is Used For After School Hours Tutorial Services 1
No ratings yet
Monday, 22 June 2022 This Presentation Is Used For After School Hours Tutorial Services 1
24 pages
1, 2, 3 MCQ RL
No ratings yet
1, 2, 3 MCQ RL
15 pages
EDR Vs XDR
No ratings yet
EDR Vs XDR
23 pages
Illustrated Microsoft Office 365 and Word 2016 Comprehensive 1st Edition
No ratings yet
Illustrated Microsoft Office 365 and Word 2016 Comprehensive 1st Edition
405 pages
Computer Graphics
No ratings yet
Computer Graphics
14 pages
Python Full Material
No ratings yet
Python Full Material
131 pages
Department: Lab Manual
No ratings yet
Department: Lab Manual
36 pages
Manual T4 Portable Multi Gas Detector 2022
No ratings yet
Manual T4 Portable Multi Gas Detector 2022
43 pages
Digital Systems Design Using VHDL
No ratings yet
Digital Systems Design Using VHDL
1 page
SOP IPPB - LPT001 Downloading&Configuring Java
No ratings yet
SOP IPPB - LPT001 Downloading&Configuring Java
9 pages
Number System Solutions
No ratings yet
Number System Solutions
5 pages
Current Openings
No ratings yet
Current Openings
3 pages
7941 17755 1 SM
No ratings yet
7941 17755 1 SM
17 pages
Web Development Task PDF
No ratings yet
Web Development Task PDF
6 pages
PCI Express Gen 4 and Gen 5 Card Edge Connectors
No ratings yet
PCI Express Gen 4 and Gen 5 Card Edge Connectors
4 pages
COM 113 Lecture 1
100% (1)
COM 113 Lecture 1
39 pages
Compiler Design BCST 602
No ratings yet
Compiler Design BCST 602
2 pages
Your E-Admit Card
No ratings yet
Your E-Admit Card
4 pages
Mhs 1st Summative Test Math 9
No ratings yet
Mhs 1st Summative Test Math 9
6 pages
Regression Analysis of Gapminder Data
No ratings yet
Regression Analysis of Gapminder Data
41 pages
16 EPSON - Data Sheet
No ratings yet
16 EPSON - Data Sheet
2 pages
User Manual
No ratings yet
User Manual
83 pages
Research Proposal
No ratings yet
Research Proposal
3 pages
AutoREID Operation Manual en V1 9
No ratings yet
AutoREID Operation Manual en V1 9
51 pages
Optical Transmission Modes, Layers and Protocols: Synchronous Networks
No ratings yet
Optical Transmission Modes, Layers and Protocols: Synchronous Networks
15 pages

Language Model Adaptation To Specialized Domains Through Selective

Uploaded by

Language Model Adaptation To Specialized Domains Through Selective

Uploaded by

Language Model Adaptation to Specialized Domains through Selective

Masking based on Genre and Topical Characteristics

Abstract data to refine pre-trained models (Chalkidis et al.,

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasu-

You might also like