Enhancing Medical Language Understanding Adapting LLMs To The Medical Domain Through Hybrid Granularity Mask Learning
Enhancing Medical Language Understanding Adapting LLMs To The Medical Domain Through Hybrid Granularity Mask Learning
Longjun Fan∗ , Xiaohong Liu† , Yuhao Wang∗ , Guoxing Yang∗ , Zongxin Du∗ , Guangyu Wang∗,‡
∗
State Key Laboratory of Networking and Switching Technology,
Beijing University of Posts and Telecommunications, Beijing 100876, China
† Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
Abstract—Large Language models have made remarkable The predominant methodology for model adaptation is fine-
strides in natural language understanding and generation. How- tuning on labeled domain-specific data [6]. As GPT-4 is not
ever, their performance in specialized fields like medicine often open-source, such as DoctorGLM [7] and HuaTuo [8], fine-
falls short due to the lack of domain-specific knowledge during
pre-training. While fine-tuning on labeled medical data is a tuning open-source alternatives for various medical tasks using
common approach for task adaptation, it may not capture parameter-efficient fine-tuning (PEFT) [9] and adapter-based
the comprehensive medical knowledge required. In this paper, tuning. Although effective, they are more concerned about
we proposed a Hybrid Granularity Mask Learning (HGM) task adaptation rather than domain adaptation. The difference
method for domain adaptation in the medical field. Our method between the two approaches is illustrated in Fig. 1. Task
incorporates multi-level linguistic characteries including token,
entity, and subsentence to enable the model to acquire medi- adaptation (TA) involves directly fine-tuning the model on a
cal knowledge comprehensively. We fine-tune a medical-specific downstram task (a), whereas domain adaptation (DA) involves
language model derived from ChatGLM-6B and Bloom-7B on training the model to learn domain knowledge and then fine-
downstream medical tasks and evaluate its performance. The tuning it on the downstream task (b). Previous research has
results demonstrate a significant improvement compared to the shown that DA can effectively improve performance [10].
baseline, thus affirming the effectiveness of our proposed method.
However, gaining proficiency in medical domain knowledge
Index Terms—Language models, domain adaptation, medical through traditional domain adaptation (DA) methods poses sig-
knowledge, mask learning, medical question-answering nificant challenges due to the rigorous and specialized nature
of the medical field. To address this issue, this paper proposes
I. I NTRODUCTION a novel approach Hybrid Granularity Mask Learning(HGM).
The learning of medical knowledge is divided into three levels:
Instruction-following large language models (LLMs), such token level, entity level, and subsentence level, where mask
as ChatGPT, PaLM, and LLaMA [1]–[3], have made signifi- learning is performed on these granularities, enabling the
cant strides in natural language understanding and generation. model to predict the masked content at a flexible scale. This
Using deep learning techniques [4], LLMs show great potential improvement allows the model to learn medical knowledge
in generating responses, making them incredibly valuable in in a more comprehensive manner, encompassing individual
a variety of applications like language translation, question characters, words, sentences, and even semantics. These three
answering, and text generation. However, most LLMs are not levels of training are combined in a mixed manner. After
tailor-made for the medical field. Given the abundance of undergoing this training task, we obtain a medical-specific
specialized terminology, acronyms, and jargon in medicine, the LLM. Subsequently, we can fine-tune this specific LLM on
general domain knowledge corpus used during pre-training of downstream tasks. Compared to directly fine-tuning a general
these models often falls short(e.g., ”ER” stands for ”Emer- LLM on downstream tasks, the medical-specific LLM requires
gency Room” in medicine, which may not be understood fewer parameter updates during fine-tuning and demonstrates
correctly). Consequently, hallucination occurs, where the gen- enhanced performance [11].
erated content may not align with common medical sense. We chose ChatGLM-6B [12] and Bloom-7B1 as our base
This lack of accurate medical content poses a risk of mislead- models and collected unlabeled medical corpus knowledge
ing users and causing potential harm. [5] Thus, it becomes
imperative to adapt LLMs specifically for the medical domain
to ensure their effectiveness in healthcare applications. 1 https://huggingface.co/bigscience/bloom-7b1
2990
Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:21 UTC from IEEE Xplore. Restrictions apply.
979-8-3503-3748-8/23/$31.00 ©2023 IEEE
the recent introduction of Qlora [18] has further reduced the
necessary GPU resources.
B. Instruction-based Domain Specialization
Instruction-based Domain Specialization refers to the use
of task-specific prompts to enhance the capabilities of large
language models. Prompts, or task-specific input texts, are
designed to elicit specific model responses, guiding the content
Fig. 1. Task adaptation and domain adaptation. (a) Task adaptation: The LLM generation process of LLMs and setting expectations for the
parameters store general knowledge and are fine-tuned directly on domain- desired output. This may not seem to have a significant
specific tasks.(b) Domain adaptation: LLM learns domain-specific knowledge improvement effect. However, further pre-training of prompts
through training and then undergoes fine-tuning on domain-specific tasks.
can enhance their ability to follow user intent. On the other
hand, LLMs can produce accurate and less toxic responses
data including Med-Wiki and CMeKG2 (Chinese Medical with the help of prompts [19], [20]. Since LLMs are trained
Knowledge Graph) as well as labeled named entity recog- on large-scale corpora, prompts can sometimes guide LLMs
nition(NER) data. Token-level and subsentence-level mask to lean towards utilizing high-quality data, resulting in more
learning used the unlabeled data and mask learning at all three specialized and accurate responses.
levels were conducted on the NER data. We combined these C. Knowledge retrieval Augmentation
two sets of data for mask learning and obtained a medicial-
specific model after training. Subsequently, we collected data Retrieval-based enhancement does not require adjusting
for downstream medical tasks and tested the model’s perfor- model parameters but instead enhances LLMs by retrieving
mance on a medical question-answering task. Abundent results relevant information from external sources. When external
showed a significant improvement compared to the baseline. knowledge sources include domain-specific information, it
In summary, our contributions can be summarized as fol- is crucial to prioritize contextual information if the data
lows: source contains task-related details that contradict the model’s
memorized knowledge [6]. This approach can correct model
• Proposing a novel effeicient domain adaptation method
predictions without the need for frequent retraining. The
Hybrid Granularity Mask Learning, which divides the
general practice is first to vectorize all the queries and external
learning of medical knowledge into multi hierarchical
knowledge bases and then search for similar external informa-
granularities including token level, entity level, and sub-
tion in the vector space based on the input embedding. The
sentence level, allowing the model to learn medical
degree of similarity can be measured from the perspectives
knowledge in a more comprehensive manner through
of word similarity, word analogy, and concept categorization
mask learning.
[21].
• Conducting HGM on unlabeled medical corpus knowl-
edge data and labeled NER data, enabling the model to III. M EDTHOD
understand medical texts better.
A. Hybrid Granularity Mask Continual Pre-training
• Our method demonstrated significant performance im-
provement compared to the baseline through evaluation In medical texts, a large number of medical terms are
on a medical question-answering task, indicating the composed of several consecutive words, such as ”acute upper
effectiveness of the proposed method. respiratory tract infection”. The meaning expressed by a single
word extremely different from the whole phrase. This requires
II. R ELATED WORK a flexible scale at both the word and entity level to encode
A. Finetune models in Biomedical Domain these consecutive words as a single unit and allocate similar
attention to the words within the unit, rather than treating
Although large language models have achieved impressive
them as unrelated words. Furthermore, thanks to the powerful
performance in the general domain, their accuracy on most
generation capability of LLM, we also conducted mask learn-
specialized tasks is still lacking due to a lack of domain-
ing at the subsentence level, and our results demonstrate the
specific knowledge. So there have been many efforts to fine-
effectiveness of this approach. Inspired by ERNIE [22], we
tune for tasks [13]–[15]. As full parameter fine-tuning is
consider using a multi-scale strategy to capture multi-scale
difficult and computationally expensive, most of them utilized
information from the sentence.
Parameter-Efficient Fine-Tuning (PEFT) [9], such as Doctor-
In our study, we utilized continual pre-training techniques,
GLM [7] using LoRA [16] fine-tuning and ChatGLM-Med3
specifically inspired by the work of Gururangan et al. [23],
using p-tuning v2 [17]. PEFT allows for the adjustment of
to train the model on the collected data. We employed hybrid
partial parameters, reducing the requirement for GPU memory
granularity mask learning, applying it at both token-level and
and enabling fine-tuning with fewer resources. Furthermore,
subsentence-level using the unlabeled data. Furthermore, we
2 http://cmekg.pcl.ac.cn/ extended the mask learning approach to all three levels: token-
3 https://github.com/SCIR-HI/Med-ChatGLM level, entity-level, and subsentence-level, leveraging the NER
2991
Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:21 UTC from IEEE Xplore. Restrictions apply.
is trained to predict masked subsentences. The subsentences
in the text are randomly masked a special mask [S-MASK],
and with the powerful generation capability of LLM, the model
can be trained to predict the masked clauses given the context.
Similarly, the subsentence-level mask learning loss can be
defined as (3), where Ms is the number of [S-MASK]s, m̂s,i
and ms,i are the i-th subsentence masked and its prediction.
Me
X
Lentity (θ) = − log p(m̂e,i = me,i | θ) (2)
i=1
Ms
X
Lsub-sent (θ) = − log p(m̂s,i = ms,i | θ) (3)
i=1
Fig. 2. HGM and BERT MLM. The input sentence is ”Eat a light diet, if
Overall, the total loss L(θ) is a weighted sum of the token-
dizzy, you can take your blood pressure”. (a) HGM: Masks ”Eat” as a token, level, entity-level, and subsentence-level mask learning loss
”if dizzy” as a subsentence, and ”blood pressure” as an entity.(b) BERT: as (4), where α and β are the hyperparameters to control
Masked tokens: ”Eat” ”if” ”can” and ”blood.”
the relative importance of each component. By minimizing
this loss function during continual pre-training, the model can
data. By predicting the masked content, the model was able to learn language knowledge at a deeper level in terms of words,
learn language knowledge at a deeper level, encompassing the sentences, and semantics, which can be further fine-tuned for
understanding of words, sentences, and even semantics. This downstream tasks.
comprehensive approach enhanced the model’s proficiency in L(θ) = Ltoken (θ) + αLentity (θ) + βLsub-sent (θ) (4)
the medical domain.
As shown in Fig. 2, BERT [24](Bidirectional Encoder After training, we obtained medical-specific LLM, a language
Representations from Transformers) only masks tokens, while model that has learned medical knowledge.
HGM considers tokens, entities, and subsentences. The latter
B. DownStream task adapter
treats entities and subsentences as a single unit, enabling
it to capture semantic information more accurately. LLM is Adapters are trainable modules that can be inserted between
typically a pre-trained language model based on the trans- the layers of a pre-trained model without modifying its original
former architecture. Assuming the input text sequence is parameters [9], which have been proven to be a promising
X = [t1 , t2 , . . . , ti , . . . , tT ], where ti indicates the i-th token. approach in enhancing the performance of language models
At the token level, the model is trained to predict masked on specific downstream tasks [25]. Instead of retraining the
tokens given their context. This can be formulated as a masked entire model from scratch, adapters allow us to selectively
language modeling(MLM) task, where a random subset of modify certain parts of the model while keeping the majority
tokens are replaced with a special mask token [MASK], let of the parameters unchanged. This enables sustainable param-
mt,i be the i-th token masked and m̂t,i be its prediction, the eter sharing and facilitates transfer learning across different
token-level mask learning loss can be defined as the negative domains. By incorporating adapters into the original language
log-likelihood of predicting the correct token as (1): model, we can fine-tune the model to adapt to the diverse
requirements of various tasks.
Mt
X In the context of medical applications, the unique character-
Ltoken (θ) = − log p(m̂t,i = mt,i | θ) (1) istics of tasks such as medical consultations and doctor-patient
i=1
dialogues make it challenging for pre-training alone to capture
Where θ is the parameters of the model and Mi is the all the nuances and intricacies involved. Adapters can be
number of [MASK]s. Assuming that there are E entities in designed and trained specifically for the task at hand, allowing
the text, which can be represented as [e1 , e2 , . . . , ej , . . . , eE ], the model to learn task-specific patterns and improve its
where each entity ej typically consists of multiple tokens, performance accordingly. By incorporating adapters, we can
denoted as [tj1 , tj2 , . . .]. At the entity level, the model is specialize the model for these specific tasks while leveraging
trained to predict masked entity spans ej instead of tokens. the knowledge acquired during pre-training.
This can be formulated as a masked entity prediction task, Numerous studies have demonstrated the effectiveness of
where a random subset of entity spans are replaced with a integrating adapters into models for improving performance on
special mask entity [E-MASK], and the model is trained to downstream tasks [26], [27](e.g., SparseAdapters [28] and K-
predict the original entity from the context. Similarly, the adapters [29]). These findings support the notion that adapting
entity-level mask learning loss can be defined as (2), where Me a pre-trained language model using task-specific adapters
is the number of [E-MASK]s, m̂e,i and me,i are the i-th entity can lead to enhanced performance and accurate results in
masked and its prediction. At the subsentence level, the model the medical field. In this paper, we implemented Low-rank
2992
Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:21 UTC from IEEE Xplore. Restrictions apply.
and anatomical terms. By utilizing these diverse datasets, the
language model can benefit from both unlabeled text data for
unsupervised learning during pretraining, as well as labeled
data for supervised learning during downstream task training.
This combination of data sources helps the model acquire
both general medical knowledge and task-specific information,
enabling it to perform effectively on a wide range of medical
tasks.
Med-Wiki is a collection of medical corpus gathered from
the web, such as CPubMed4 and A-hospital5 , totaling ap-
proximately 28K records. NER data is from CHIP6 . For the
downstream tasks, we collected medical question-and-answer
datasets, namely meddep, medqa, and cMedqa2 [30]. The scale
Fig. 3. Add different task adapters for various tasks
of the Medical QA datasets is shown in Table I. In this paper,
we define the baseline models as follows:
• Base Model(BM): We choose ChatGLM-6B and
adaptation (LoRA) [16] on medical-specific LLMs. As Fig. 3 Bloom7b as the base models. These models are used for
shows, assuming W0 represents the parameters of medical- downstream tasks with LoRA fine-tuning.
specific LLMs, and for task Ti , we add a low-rank matrix Bi Ai • Token-level MLM(TLM): Similar to the masking ap-
to the medical-specific LLMs. The merging process follows proach used in BERT [24], we perform MLM training at
a simple linear relationship, and the resulting model Wi is the token level. We continue pre-training the ChatGLM
represented as (5): and Bloom models and fine-tune.
• Hybrid Granular Masking(HGM) Medical knowledge
W i = W 0 + B i Ai . (5) is divided into three levels: token-level, entity-level, and
subsentence-level. Mask learning is applied to these three
This approach not only circumvents the need for full param-
granularities to enable the model to predict masked
eter fine-tuning, which requires substantial computation, but
content.
it also allows for the shared utilization of acquired medical
knowledge. After pre-training, the medical-specific LLM was
TABLE I
fine-tuned on downstream medical tasks such as medical T HE SIZES OF THREE MEDICAL QUESTION - ANSWERING DATASETS
question answering, one bypass was added for one task.
Dataset MedDep MedQA cMedQA2
IV. E XPERIMENTS Size 300k 300k 9550
A. Experimental Settings
The dataset used in this work is divided into two parts:
The training process consists of two stages: Stage 1 involves
one for continual pretraining and the other for downstream
continual pretraining the model to acquire medical knowledge,
tasks. The pretraining data primarily consists of unlabeled
and Stage 2 focuses on fine-tuning the model on downstream
medical expository texts, such as Med-Wiki and CMEKG, as
tasks. In the Stage 1, we trained the models on 4 Nvidia A40
well as labeled NER data. The Med-Wiki corpus contains a
GPUs with a learning rate of 1e-4. Each GPU had a batch
large collection of medical texts without specific annotations.
size of 2, and we utilized the gradient accumulation strategy
These texts serve as a valuable resource for the continual
with a step of 16. We trained the models for 2 epochs. In
pretraining phase, allowing the language model to gain general
the Stage 2, the configurations remained the same, except for
medical knowledge and language understanding. CMEKG,
the learning rate, which was adjusted to 2e-5. Additionally,
on the other hand, is a Chinese medical knowledge graph
to ensure consistency across datasets of varying scales, we
that was constructed using a combination of manual curation
trained the models for a fixed number(1e4) of steps.
and AI methods. It consists of 62,000 medical entities and
374,000 relation triples, providing detailed information about B. Results
9 medical entity types and 23 different medical relationships.
We used the Base Model(BM) as a baseline and fine-
The inclusion of CMEKG in the pretraining data contributes
tuned both BM and HGM on three datasets with LoRA, to
to enriching the model’s domain-specific knowledge and un-
compare their performance on three medical QA datasets.
derstanding of medical concepts. In addition to the unlabelled
We choosed Google-BLEU and Rouge as evaluation metrics.
data, there is also labeled NER data available. This data is
Google-BLEU is a metric specifically designed for automatic
specifically annotated for named entity recognition, which is
a crucial task in many downstream applications. The labeled 4 http://www.chinapubmed.net/
NER data helps the model learn to identify and extract impor- 5 http://www.a-hospital.com/
2993
Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:21 UTC from IEEE Xplore. Restrictions apply.
evaluation of machine translation quality and ROUGE is extensively studied. We focus on two variables: (1) Removing
another set of metrics to evaluate our automatic summarization the subsentence MLM, which involves a combination of token-
and text generation systems. Table II demonstrates that when level masking and entity-level masking (-SSM). (2) Remov-
using Bloom as the base model, HGM outperformed the ing the entity MLM, which combines token-level masking
baseline on all datasets. The average improvement in Rouge-L and subsentence-level masking (-EM) and conducted ablation
was over 4.8%, and the average improvement in Google-BLEU study with ChatGLM.
was over 4.7%.
TABLE IV
TABLE II ABLATION STUDY
T HE PERFORMANCE OF HGM WHEN B LOOM IS USED AS THE BM.
Methods Rouge-L Google BLEU Rouge-2 Rouge-1
DataSet Model Rouge-L Google-BLEU Rouge-2 Rouge-1 HGM 50.1 31.77 31.48 56.21
BM 0.4439 0.2714 0.2607 0.496 -SSM 49.36 31.06 30.68 55.56
MedDep
HGM 0.4985 0.3287 0.3245 0.549 -EM 49.56 31.32 30.94 55.73
BM 0.4126 0.2292 0.2095 0.4703
MedQA
HGM 0.4562 0.2723 0.2582 0.5133
BM 0.4506 0.2416 0.2361 0.516
cMedQA2
HGM 0.4924 0.2694 0.277 0.5521 Table IV presents the results of the ablation experiments,
BM 0.4307 0.2513 0.2352 0.4867
Avg.
HGM 0.479 0.2981 0.288 0.5336
and as expected, both configurations exhibited a decrease in
performance. When SSM was removed, there was a decrease
of 0.74% in Rouge-L and 0.71% in Google-BLEU. When
Table III shows that when using ChatGLM as the base EM was removed, there was a decrease of 0.54% in Rouge-L
model, it surpassed the baseline on all datasets. The average and 0.45% in Google-BLEU. Comparatively, the base model
improvement in Rouge-L and Google-BLEU was approxi- exhibited a larger drop of 1.35% compared to our approach.
mately 1.4%. The impact of EM was slightly smaller than that of SSM,
which could be attributed to the ambiguity in defining the
TABLE III boundaries of token masks and entity masks, as some entities
T HE PERFORMANCE OF HGM WHEN C HAT GLM IS USED AS THE BM. may consist of a single token.
Then, we implemented TLM on Bloom and ChatGLM
DataSet Model Rouge-L Google-BLEU Rouge-2 Rouge-1
BM 0.5015 0.3331 0.3298 0.5627 according to BERT [24]. The results of fine-tuning on a
MedDep
HGM 0.5172 0.3505 0.3483 0.5767 medical question-answering task are shown in Table V. When
BM 0.476 0.2907 0.2777 0.5377
MedQA
HGM 0.4884 0.3057 0.2924 0.5492 using Bloom as the base model, the improvements in Rouge-
cMedQA2
BM 0.4775 0.2536 0.2622 0.548 L and Google-BLEU were 1.76% and 1.72%, respectively,
HGM 0.4865 0.2593 0.2705 0.5552
BM 0.4875 0.3034 0.2991 0.5369
which were smaller than the improvements achieved by HGM,
Avg.
HGM 0.501 0.3177 0.3148 0.5621 which were 4.83% and 4.86%. When using ChatGLM as the
base model, similarly smaller improvements were observed
compared to HGM. It might be challenging to capture medical
Overall, our approach has shown improvements with both knowledge solely through token-level masking, while learning
language models. Specifically, the Bloom demonstrates a with a mixture of granularities in masking makes it easier to
greater improvement compared to ChatGLM while ChatGLM acquire medical knowledge.
achieves higher overall performance than Bloom, which can
be attributed to its specialized optimization for Chinese and
TABLE V
the utilization of a larger training dataset. ChatGLM was T HE AVERAGE PERFORMANCE OF TLM ON THREE MEDICAL
trained on a corpus of 1 trillion tokens [12], equally distributed QUESTION - ANSWERING DATASETS
between Chinese and English. On the other hand, Bloom
[31] was trained on 350 billion tokens, covering 46 natural Base Model Rouge-L Google-BLEU
Bloom 0.4483 0.2685
languages and 13 programming languages. One potential rea- ChatGLM 0.4892 0.3056
son for ChatGLM’s relatively lower improvement compared
to Bloom could be attributed to the drawback of its prefix
decoder-only architecture, which results in lower training
efficiency. Unlike the causal decoder structure that calculates V. C ONCLUSION
loss across all tokens, the prefix decoder only computes loss In this paper, we proposed a novel approach Adapting Large
based on the output without considering the input. Language Models to the Medical Domain through Hybrid
Granularity Mask Learning to enhance the performance of
C. Analysis language models in the medical field. By deconstruct medical
In this subsection, we explore the contributions of different knowledge into three levels of learning, namely token-level,
parts of our method through ablation study, comparing it entity-level, and subsentence-level, we applied mask learning
to the traditional Masked Language Model approach, which to these granularities to enable the model to predict masked
is commonly used to mask tokens, that has already been content accurately. Our experimental results demonstrated
2994
Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:21 UTC from IEEE Xplore. Restrictions apply.
that these enhancements significantly improved the language [11] Q. Li, “Literature survey: domain adaptation algorithms for natural
model’s ability to acquire in-depth language knowledge. language processing,” Department of Computer Science The Graduate
Center, The City University of New York, pp. 8–10, 2012.
In conclusion, our approach presents a promising avenue [12] Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang,
for advancing language models’ capabilities in the medical “Glm: General language model pretraining with autoregressive blank
domain. Further research and development in this area can infilling,” in Proceedings of the 60th Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–
lead to substantial improvements in various medical applica- 335.
tions, such as clinical decision support systems, medical text [13] S. Li, C. Yang, Y. Yin, X. Zhu, Z. Cheng, L. Shang, X. Jiang,
analysis, and information retrieval. Q. Liu, and Y. Yang, “Autoconv: Automatically generating information-
seeking conversations with large language models,” in Proceedings of the
61st Annual Meeting of the Association for Computational Linguistics
VI. L IMITATIONS (Volume 2: Short Papers), 2023, pp. 1751–1762.
Although HGM achieves promising results on medical tasks [14] D. Bill and T. Eriksson, “Fine-tuning a llm using reinforcement learning
from human feedback for a therapy chatbot application,” 2023.
compared to baseline models, there are certain limitations. The [15] R. Schumann, W. Zhu, W. Feng, T.-J. Fu, S. Riezler, and W. Y. Wang,
effectiveness of evaluation metrics is limited, and it would be “Velma: Verbalization embodiment of llm agents for vision and language
beneficial to introduce additional human-oriented evaluation navigation in street view,” arXiv preprint arXiv:2307.06082, 2023.
[16] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and
metrics such as Safety, Usability, and Smoothness. Our ap- W. Chen, “LoRA: Low-rank adaptation of large language models,” in
proach typically requires a two-stage training process, which International Conference on Learning Representations, 2022. [Online].
is more complex compared to end-to-end model training. The Available: https://openreview.net/forum?id=nZeVKeeFYf9
[17] X. Liu, K. Ji, Y. Fu, W. L. Tam, Z. Du, Z. Yang, and J. Tang, “P-tuning
latest update of ChatGLM to its v2 version7 may need further v2: Prompt tuning can be comparable to fine-tuning universally across
experimentation. scales and tasks,” 2022.
[18] T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora:
ACKNOWLEDGMENT Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314,
2023.
This study was funded by the National Natural Sci- [19] Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi,
ence Foundation of China (grant 62272055), New Corner- and H. Hajishirzi, “Self-instruct: Aligning language model with self
generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
stone Science Foundation through the XPLORER PRIZE, [20] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin,
and Young Elite Scientists Sponsorship Program by CAST C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language
(2021QNRC001). models to follow instructions with human feedback,” Advances in Neural
Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
R EFERENCES [21] B. Wang, A. Wang, F. Chen, Y. Wang, and C.-C. J. Kuo, “Evaluating
word embedding models: Methods and experimental results,” APSIPA
[1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, transactions on signal and information processing, vol. 8, p. e19, 2019.
A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language mod- [22] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu, “Ernie:
els are few-shot learners,” Advances in neural information processing Enhanced language representation with informative entities,” 2019.
systems, vol. 33, pp. 1877–1901, 2020. [23] S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy,
[2] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language
P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling models to domains and tasks,” 2020.
language modeling with pathways,” arXiv preprint arXiv:2204.02311, [24] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training
2022. of deep bidirectional transformers for language understanding,” arXiv
[3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, preprint arXiv:1810.04805, 2018.
T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., [25] D. Emelin, D. Bonadiman, S. Alqahtani, Y. Zhang, and S. Mansour,
“Llama: Open and efficient foundation language models,” arXiv preprint “Injecting domain knowledge in language models for task-oriented
arXiv:2302.13971, 2023. dialogue systems,” arXiv preprint arXiv:2212.08120, 2022.
[4] G. Wang, X. Liu, Z. Ying, G. Yang, Z. Chen, Z. Liu, M. Zhang, H. Yan, [26] J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych, “Adapter-
Y. Lu, Y. Gao et al., “Optimized glycemic control of type 2 diabetes fusion: Non-destructive task composition for transfer learning,” arXiv
with reinforcement learning: a proof-of-concept trial,” Nature Medicine, preprint arXiv:2005.00247, 2020.
pp. 1–10, 2023. [27] A. Bapna, N. Arivazhagan, and O. Firat, “Simple, scalable adaptation
[5] Y. Lu, X. Liu, Z. Du, Y. Gao, and G. Wang, “Medkpl: a heterogeneous for neural machine translation,” arXiv preprint arXiv:1909.08478, 2019.
knowledge enhanced prompt learning framework for transferable diag- [28] S. He, L. Ding, D. Dong, M. Zhang, and D. Tao, “Sparseadapter: An
nosis,” Journal of Biomedical Informatics, p. 104417, 2023. easy approach for improving the parameter-efficiency of adapters,” arXiv
[6] C. Ling, X. Zhao, J. Lu, C. Deng, C. Zheng, J. Wang, T. Chowdhury, preprint arXiv:2210.04284, 2022.
Y. Li, H. Cui, T. Zhao et al., “Beyond one-model-fits-all: A survey [29] R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, G. Cao, D. Jiang,
of domain specialization for large language models,” arXiv preprint M. Zhou et al., “K-adapter: Infusing knowledge into pre-trained models
arXiv:2305.18703, 2023. with adapters,” arXiv preprint arXiv:2002.01808, 2020.
[7] H. Xiong, S. Wang, Y. Zhu, Z. Zhao, Y. Liu, L. Huang, Q. Wang, and [30] S. Zhang, X. Zhang, H. Wang, L. Guo, and S. Liu, “Multi-scale attentive
D. Shen, “Doctorglm: Fine-tuning your chinese doctor is not a herculean interaction networks for chinese medical question answer selection,”
task,” 2023. IEEE Access, vol. 6, pp. 74 061–74 071, 2018.
[8] H. Wang, C. Liu, N. Xi, Z. Qiang, S. Zhao, B. Qin, and T. Liu, “Huatuo: [31] A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y. Xu,
Tuning llama model with chinese medical knowledge,” arXiv preprint W. Zheng, X. Xia, W. L. Tam, Z. Ma, Y. Xue, J. Zhai, W. Chen, P. Zhang,
arXiv:2304.06975, 2023. Y. Dong, and J. Tang, “Glm-130b: An open bilingual pre-trained model,”
[9] N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, 2022.
C.-M. Chan, W. Chen et al., “Parameter-efficient fine-tuning of large-
scale pre-trained language models,” Nature Machine Intelligence, vol. 5,
no. 3, pp. 220–235, 2023.
[10] A. Ramponi and B. Plank, “Neural unsupervised domain adaptation in
nlp—a survey,” arXiv preprint arXiv:2006.00632, 2020.
7 https://github.com/THUDM/ChatGLM2-6B
2995
Authorized licensed use limited to: VIT University. Downloaded on March 08,2024 at 10:58:21 UTC from IEEE Xplore. Restrictions apply.