D2LLM Decomposed and Distilled LLMs For Semantic Search 1719397862
D2LLM Decomposed and Distilled LLMs For Semantic Search 1719397862
D2LLM Decomposed and Distilled LLMs For Semantic Search 1719397862
handling of query-passage interactions. Through ken that labels the query-passage pair (Xi , Xj ) as
a detailed distillation process, D2LLM effectively related or not, with the prompt P aiding in adapting
transfers knowledge from the teacher LLM, produc- to different search task types (symmetric versus
ing a student model that excels beyond the afore- asymmetric). To determine the probability of “yes”
mentioned models finetuned on the same datasets. or “no”, we extract the corresponding rows from
Prompt for Symmetric Search Contrastive Imitation Loss
#Q and #P will each describe an event or problem, they may not be related. 𝒛𝓣
𝒊𝒋 𝒛𝓢𝒊𝒋
Based solely on these descriptions and your understanding of the world, judge
whether #P is absolutely correct about the event in #Q, or whether #Q is Rank Imitation Loss
absolutely correct about the event or problem in #P , please answer yes or no.
MLP 𝒇𝟏
agg agg
𝒙𝒊 (1) ... 𝒙𝒊 (𝐿) 𝒙𝒋 (1) ... 𝒙𝒋 (𝐿) 𝒑(1) ... 𝒑(𝐿) 𝒙𝒊 (1) ... 𝒙𝒊 (𝐿) 𝒙𝒋 (1) ... 𝒙𝒋 (𝐿) 𝒚𝑖 𝒚𝑗
ent nature and its learnability ensures adaptability. where τ is the temperature parameter, sTij is the
teacher’s probability score for a "yes" between
IEM: After the PMA generates individual vec- S is the student’s corresponding
pairs (i, j), and zij
tors for the query and passage, the IEM implicitly logit (unnormalized probability). The CI loss di-
encodes the prompt (be it symmetric or asymmet- verges from traditional contrastive loss by leverag-
ric) and captures the query-passage interaction. As ing the teacher’s scores sT to account for varying
depicted in Figure 1(b)’s right panel, we concate- relevance among samples, assigning higher weights
nate the query and passage embeddings and in- to easy negatives than hard ones. Even if true posi-
put them into a Multi-Layer Perceptron (MLP) de- tives incorrectly appear in D− , the CI loss remains
signed with two branches to handle the respective unaffected, ensuring a more robust training environ-
prompt nuances, which can be expressed as: ment than standard contrastive loss. It also empha-
S agg agg sizes positive samples with higher teacher scores,
yij = f2 (f1 ([yi , yj ])), (6) indicating their criticality.
where both f1 and f2 are MLPs. f1 extracts elemen- 3.2.2 Rank Imitation
tary features from the combined embeddings, while While Contrastive Imitation (CI) loss effectively
sym asym
f2 ∈ {f2 , f2 } is tailored to the two branches, handles true positives and easy negatives, it does
processing symmetric and asymmetric searches. not adequately address the gradations among sam-
After deriving yijS , the student model computes its
ples. This is where Rank Imitation (RI) steps in,
S and score sS in a manner akin to the
logits zij focusing on distinguishing between positive and
ij
teacher model (cf. Eqs. (2)-(3)), but with a learn- hard negative samples, as well as discerning easy
able linear layer W S . We underscore that the MLP from hard negatives, thus enabling the student to
operates as a flexible similarity metric, enhancing replicate the teacher’s subtle ranking nuances.
the description of the query-passage relationship To synchronize the student’s and the teacher’s
beyond the commonly used cosine similarity in ranking of positive and hard negative samples, we
bi-encoders. To maintain efficiency, lightweight aim to maximize the Pearson correlation (Huang
MLPs are utilized. Please refer to Appendix G for et al., 2022) between their logits. The RI loss dedi-
more discussion on IEM. cated to this alignment is:
T S
LRI
P H = 1 − corr(zi , zi ), (7)
3.2 Distillation
where ziT = T]
[zij for j ∈D+ ∪ D−
H signifying a
Knowledge distillation aims to impart the teacher
vector of the teacher’s logits for the combined set
model’s capabilities to the student model. To ac-
of positive and hard negative samples, and likewise
complish this, we focus on three specific training
for ziS . We intentionally exclude in-batch negatives
objectives: contrastive, rank, and feature imitation.
from this measure as they are generally easy neg-
3.2.1 Contrastive Imitation atives and lack the comparative relevance needed
for meaningful ranking against the query Xi .
For a given query Xi , we curate a set of positive
On the other hand, differentiating between hard
samples D+ (relevant passages) and negative sam-
and easy negatives is critical since hard negatives
ples D− (irrelevant passages). The negative sam-
have some connection to the query, unlike easy
ple set includes in-batch negatives and hard nega-
negatives. To emphasize this, we introduce an ad-
tives, the latter sourced from the top-k results using
ditional RI loss for these two groups of samples:
BM25 and bi-encoder evaluations (Ma et al., 2023;
Li et al., 2023a,b), thus forming D− = D− − 1
I ∪ DH .
X X
S
LRI
HI = − − − λjk log(σ(zij
Note that a few hard negatives may potentially be |DH ||DI | − −
j∈DH k∈DI
latent positives, but our Contrastive Imitation can S
address this circumstance robustly, which will be − zik )), (8)
where λjk is the rank comparison metric between This approach guides the student to mimic the re-
a hard negative j and an in-batch negative k lational patterns in the teacher’s representations,
as determined by the teacher. The metric uti- resulting in a deeper form of knowledge transfer.
lized is the normalized discounted cumulative gain
3.2.4 Overall Loss
(NDCG) (Järvelin and Kekäläinen, 2002). It gives
a non-zero λjk only when zij T − z T > 0 in the The collective loss is defined as a weighted sum of
ik
teacher model. The design of this loss ensures that the above individual losses:
only when a sample is deemed an easy negative by
L = LCI + αLRI RI FI
P H + βLHI + γL , (11)
the teacher does the student assign it a lower score
S − z S > 0).
than it would to a hard negative (i.e., zij ik with the weights α, β, and γ. This loss is used to
This approach allows the student to effectively dif- train the student, including the PMA, the IEM, the
ferentiate between easy and hard negatives under linear layer W S , and the bi-encoder (i.e., 1 and 2
the teacher’s guidance, even when hard negatives in Figure 1(b)). Note that the bi-encoder is trained
are interspersed with in-batch negatives. using parameter-efficient finetuning methods, such
as LoRA (Hu et al., 2021) and QLoRA (Dettmers
3.2.3 Feature Imitation
et al., 2023). These strategies can enhance perfor-
CI and RI aim to align the student’s output with mance without imposing a significant increase in
that of the teacher, emphasizing output distillation. the number of learnable parameters. Remarkably,
Working in tandem, feature distillation can also the learnable parameters in the bi-encoder are less
provide substantial benefits by leveraging the rich than 4% of the LLM’s total parameter count.
information encompassed in the teacher LLM. Di-
rectly aligning the embeddings of the classification 4 Experiments
token between the teacher and student models (i.e.,
T and y S ) presents challenges due to the distinct In this section, we evaluate D2LLM’s effectiveness
yij ij on three tasks: Natural Language Inference (NLI),
architectures of the third component (see 3 in Fig-
Semantic Textual Similarity (STS), and Informa-
ure 1). However, the relative relationships between
tion Retrieval (IR), with a focus on Chinese. The
embeddings for different query-passage pairs are
former two are symmetric search tasks, and the
less susceptible to such architecture variations (Liu
last is asymmetric. Details on datasets and eval-
et al., 2019). For instance, given two positive sam-
uation metrics are provided in Appendix A. For
ples Xj and Xk as well as a negative sample Xm for
T is often closer to y T than NLI and STS, we use 0.3 million training sam-
the same query Xi , yij ik
T in the feature space of the teacher, in order to
ples, and 0.9 million for IR. The proposed method
yim (denoted as D2LLM1 ) is compared with five state-
produce a higher score for the pairs (i, j) and (i, k) of-the-art baselines: BGE (Xiao et al., 2023),
than for (i, m), and likewise for the student. To RocketQAv2 (Ren et al., 2021a), SGPT (Muen-
leverage this robustness, Feature Imitation (FI) first nighoff, 2022), Udever (Zhang et al., 2023), and
computes a similarity matrix for all query-passage LLaRA (Li et al., 2023a). BGE and RocketQAv2
pairs within a batch for the teacher as: are BERT-style bi-encoders; BGE uses pretraining
and finetuning, while RocketQAv2 distills from
T
rijk T
= sim(yij T
, yik ), ∀j, k ∈ D+ ∪ D−
H, (9) a cross-encoder, updating both the student and
teacher. The remaining three finetune GPT-style
where sim denotes cosine similarity, and repeats LLMs into bi-encoders. For more details, please
S . Note that
the process for the student to obtain rijk refer to Appendix C. The total number of param-
the above similarity metric is evaluated between eters for each method is shown in parentheses af-
T and y T , not the passage embeddings y agg and
yij ik j ter it. To ensure a fair comparison, all baseline
agg
yk alone. The goal of FI is to minimize the ℓ2 methods have a number of tunable parameters (de-
norm of the difference between the teacher’s and noted as #Param. in Tables 1-2) equal to, or greater
student’s similarity matrices for all positive and than, those in D2LLM, with the exception of SGPT,
hard negative sample combinations (i.e., riT = which regards the use of BitFit finetuning (Zaken
T ] and r S = [r S ] for all j and k):
[rijk i ijk et al., 2022) as its notable advantage. All methods,
except SGPT, incorporate the hard negatives from
LFI = ∥riT − riS ∥22 . (10) 1
The implementation deltals is presented in Appendix B
Table 1: Results for NLI, with the best-performing method and the second-best results marked in bold and
underlined, respectively.
Dataset OCNLI CMNLI
Metric #Data #Param. ACC AP Prec. Recall ACC AP Prec. Recall
BGE-ft(326M) 0.3M 326M 0.5463 0.5689 0.5466 0.5702 0.6097 0.6656 0.6174 0.6278
RocketQAv2(326M) 0.3M 326M 0.5603 0.5633 0.5123 0.5723 0.6164 0.6779 0.5966 0.6905
SGPT(7B) 0.3M 0.4M 0.5977 0.6165 0.6029 0.5994 0.6598 0.7259 0.6643 0.6727
Udever(7B) 0.3M 89M 0.6412 0.6811 0.6478 0.6698 0.7234 0.7819 0.7077 0.7306
LLaRRA(7B) 0.3M 89M 0.6612 0.7115 0.6618 0.6889 0.7150 0.7815 0.7125 0.7381
D2LLM(7B) 0.3M 89M 0.7889 0.8145 0.7736 0.8149 0.8014 0.8759 0.7960 0.8241
Improv.(%) N/A N/A 19.31% 14.48% 17.30% 21.66% 10.78% 12.02% 11.72% 11.65%
BGE(326M) 100M 326M 0.7266 0.7646 0.7362 0.7191 0.7675 0.8580 0.7891 0.7381
LLM-be(7B) N/A N/A 0.5219 0.5083 0.5155 0.5955 0.5619 0.6175 0.5624 0.6762
LLM-ce(7B) N/A N/A 0.8776 0.9609 0.8409 0.9493 0.8347 0.9417 0.8263 0.9303
Table 2: Results for STS, with the best-performing method and the second-best results marked in bold and
underlined, respectively.
ATEC BQ LCQMC PAWSX STSB AFQMC QBQTC
Metric #Data #Param. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear.
BGE-ft(326M) 0.3M 326M 0.3827 0.4126 0.5388 0.5982 0.5683 0.6531 0.2982 0.3127 0.6648 0.6717 0.3492 0.3774 0.1982 0.2049
RocketQAv2(326M) 0.3M 326M 0.1971 0.2362 0.3815 0.3962 0.5368 0.6089 0.1687 0.1558 0.5662 0.5894 0.1945 0.2381 0.2325 0.2180
SGPT(7B) 0.3M 0.4M 0.3045 0.3173 0.5135 0.5241 0.4715 0.4767 0.1842 0.1653 0.5973 0.5842 0.3033 0.3077 0.1717 0.1736
Udever(7B) 0.3M 89M 0.3328 0.3602 0.5389 0.5531 0.5369 0.5819 0.2041 0.2063 0.6509 0.6601 0.3177 0.3246 0.2088 0.2102
LLaRRA(7B) 0.3M 89M 0.343 0.3575 0.5233 0.5369 0.5698 0.5997 0.2113 0.2063 0.6910 0.7001 0.3046 0.3238 0.2127 0.2254
D2LLM(7B) 0.3M 89M 0.3731 0.3994 0.5487 0.5674 0.6210 0.6589 0.3038 0.2883 0.7273 0.7194 0.3676 0.3858 0.2749 0.2850
D2LLM-ft(7B) 0.3M 89M 0.4603 0.4759 0.5589 0.5705 0.6233 0.6475 0.2145 0.2573 0.7346 0.7729 0.3891 0.3966 0.2756 0.2933
Improv.(%) N/A N/A 20.27% 15.34% 3.71% -4.63% 9.39% 0.89% 1.88% -7.80% 6.31% 10.40% 11.43% 5.09% 18.54% 30.12%
BGE(326M) 100M 326M 0.4716 0.4785 0.6001 0.6224 0.6924 0.7249 0.3001 0.3584 0.7765 0.7763 0.4100 0.4253 0.2203 0.2424
LLM-be(7B) N/A N/A 0.2339 0.2178 0.3049 0.3007 0.4484 0.4507 0.1803 0.1676 0.5761 0.5767 0.1762 0.1837 0.1153 0.1147
LLM-ce(7B) N/A N/A 0.3670 0.4152 0.4432 0.4770 0.6224 0.7164 0.3125 0.4365 0.7453 0.7680 0.3643 0.3986 0.3355 0.3491
LLM-ce-ft(7B) 0.3M 89M 0.4816 0.4898 0.5868 0.5991 0.6205 0.7147 0.1978 0.4002 0.7873 0.8172 0.4284 0.4254 0.3414 0.3516
the previous section, as SGPT’s original training exceeds BGE, which was finetuned on 100M rele-
does not involve such negatives. It’s worth noting vant samples, by 6.45%. Furthermore, D2LLM
that BGE relies on extensive data for pretraining effectively narrows the gap between the intact
and finetuning, while other methods only use the bi-encoder LLMs (LLM-be) and cross-encoder
small dataset mentioned above for finetuning. As LLMs (LLM-ce). While the original LLM-be falls
the pretrained BGE model was not accessible, we short as a bi-encoder due to the mismatch between
opted for Chinese-roberta-large-326M (Cui et al., text generation and embedding, the cross-encoder-
2021) and finetuned it using BGE’s method with based teacher model LLM-ce excels by leveraging
our dataset, referring to this version as BGE-ft. LLMs’ ability to synthesize information from sen-
The original BGE is still retained as a reference. tence pairs. Our distillation approach successfully
We also use Chinese-roberta-large as the base for transfers knowledge from the teacher to the stu-
RocketQAv2. For the other methods, we select dent, transforming the initially ineffective LLM-
Qwen-7B-Chat (Bai et al., 2023) as the base model. be into a proficient NLI instrument. Addition-
Additionally, the performance of the cross-encoder ally, SGPT, Udever, and LLaRA outperform the
teacher model based on Qwen-7B-Chat (i.e., LLM- base LLM-be, highlighting that contrastive fine-
ce) is presented, as well as a bi-encoder (i.e., LLM- tuning boosts LLM-be’s capacity, albeit not as ef-
be) that generates sentence embeddings by mean- fectively as D2LLM’s distillation. Among these,
pooling token embeddings from the last layer of LLaRA and Udever benefit more from hard nega-
Qwen-7B-Chat, in order to gauge D2LLM’s im- tives in contrastive finetuning than in-batch nega-
provement over the untrained LLM-be and its prox- tives based SGPT, with LLaRA’s token-level con-
imity to the teacher LLM-ce’s performance. trastive learning proving more advantageous than
Udever’s sentence-level approach. Furthermore, a
4.1 Results on NLI comparison between BGE and BGE-ft reveals that
BGE’s impressive performance is likely attributed
We first investigate the performance of all meth-
to its extensive pretraining and finetuning datasets.
ods for NLI. Table 1 shows that D2LLM outshines
Finally, the performance of RocketQAv2 falls short
all competitors trained on the 0.3M sample set
of expectations, likely due to having fewer hard
across all metrics and all testing datasets. Notably,
negatives per sample in our experiments (i.e., 8)
it surpasses LLaRA, the second-best method, by
compared to the original setting (i.e., 32 or 127);
a significant margin of 14.39%2 on average and
the efficacy of listwise distillation in RocketQAv2
2
We compute the relative gain in this paper, which is de- depends on the hard negatives. Despite its out-
fined as the ratio of the difference between a new value and a standing performance, D2LLM still exhibits poor
reference value to the reference value itself.
Table 3: Ablation Study on the NLI Task.
OCNLI CMNLI Average
Methods ACC AP Prec. Recall ACC AP Prec. Recall difference
−CI+CL 0.7572 0.7791 0.7411 0.7887 0.7755 0.8471 0.7692 0.8018 -3.59%
−RI_PH 0.7293 0.7658 0.7106 0.7589 0.7567 0.8129 0.7433 0.7885 -6.57%
−RI_HI 0.7375 0.7726 0.7390 0.7711 0.7721 0.8194 0.7609 0.7990 -4.92%
−FI 0.7666 0.8012 0.7518 0.8006 0.7939 0.8542 0.7771 0.8056 -2.17%
−PMA+mean 0.7734 0.7997 0.7586 0.8022 0.7954 0.8611 0.7823 0.8087 -1.71%
−PMA+[EOS] 0.7739 0.8023 0.7604 0.8025 0.7958 0.8642 0.7845 0.8107 -1.51%
−IEM+cos 0.7461 0.7886 0.7224 0.8025 0.7867 0.8377 0.7682 0.7921 -3.83%
D2LLM-1.8B 0.6907 0.7399 0.6769 0.6261 0.7102 0.7947 0.7107 0.5840 -14.76%
D2LLM 0.7889 0.8145 0.7736 0.8149 0.8014 0.8759 0.7960 0.8241 -
Figure 2: Runtime Analysis.
Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min
Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Zhang, and Shaoping Ma. 2021. Optimizing dense
Lotufo, and Rodrigo Nogueira. 2022b. No parameter retrieval model training with hard negatives. In Pro-
left behind: How distillation and model size affect ceedings of the 44th International ACM SIGIR Con-
zero-shot retrieval. arXiv preprint arXiv:2206.02873. ference on Research and Development in Information
Retrieval, pages 1503–1512.
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang,
Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Hang Zhang, Yeyun Gong, Yelong Shen, Jiancheng Lv,
Smith, Luke Zettlemoyer, and Tao Yu. 2022. One Nan Duan, and Weizhu Chen. 2022a. Adversarial
embedder, any task: Instruction-finetuned text em- retriever-ranker for dense text retrieval. In Interna-
beddings. arXiv preprint arXiv:2212.09741. tional Conference on Learning Representations.
Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, from T2Ranking5 , DuReader6 , cMedQA27 , and
and Nan Duan. 2022b. Multi-view document repre- mMARCO8 . Concretely, we sample 50% from
sentation learning for open-domain dense retrieval.
T2Ranking, 80% from DuReader, 80% from
In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume cMedQA2, and 35% from mMARCO, and fi-
1: Long Papers), pages 5990–6000. nally compose a training dataset of 0.9M. The
testing data involves T2Retrieval, DuRetrieval,
Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, CovidRetrieval, CmedqaRetrieval, MedicalRe-
Pengjun Xie, Meishan Zhang, and Min Zhang. 2023. trieval, and MMarcoRetrieval. To evaluate the
Language models are universal embedders. arXiv
retrieval effectiveness, each method retrieves and
preprint arXiv:2310.08232.
ranks the top-10 passages for each query, and
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan
MRR@10 and Recall@10 are then utilized as
Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, the metrics.
and Ji-Rong Wen. 2023. Large language models
for information retrieval: A survey. arXiv preprint B Implementation Details
arXiv:2308.07107.
Our D2LLM model is built upon PyTorch and
DeepSpeed, using Qwen-7B-Chat as the teacher
A Datasets and the base LLM for the student due to its effec-
tiveness with Chinese data. The model uses a batch
We consider three tasks to verify the usefulness
size of 32, with each query having 8 hard negatives
of all methods, including Natural Language Infer-
assigned. The PMA module features 32 heads,
ence (NLI), Semantic Textual Similarity (STS), and
and the IEM includes two single-layer MLPs with
Information Retrieval (IR). The training and test-
ReLU activations—the first with an input size of
ing data for each task are listed below. Note that
8192 and output of 512, and the second with con-
all testing data are taken from the comprehensive
sistent 512 dimensions for both. We set α = 1,
benchmark CMTEB (Chinese Massive Text Em-
β = 0.3, and γ = 0.1 in all our experiments, un-
bedding Benchmark).
less otherwise specified, based on the observations
• NLI: For the NLI task, models are tasked to dis- in Appendix F.3. The AdamW optimizer is used
cern the presence of an entailment relationship with a learning rate of 1e-4, including a warm-up
between pairs of sentences. The training data over 0.2 epochs, and training is halted early upon
for this task involves the 0.3M data from SNLI- model convergence. LoRA adjustments are made
zh3 , NLI-zh4 , specifically including the datasets with a rank of 8, while mixed-precision training
named ATEC, BQ, LCQMC, and PAWSX. The and gradient checkpointing minimize memory us-
testing datasets are OCNLI and CMNLI. Perfor- age. Training runs on 8 NVIDIA A100 GPUs with
mance metrics include Accuracy, Average Preci- 80GB each.
sion (AP), Precision, and Recall. For the consumption of computing resources,
• STS: For the STS task, the objective is to pre- note that the number of tunable parameters in
dict the degree of similarity between sentence D2LLM, which amounts to 89 million including
pairs, with a higher predicted score indicating the modifications for PMA and IEM, remains on
greater similarity. The training data is the same par with Udever and LLaRA. The consistent pa-
as the above NLI task. The testing data involves rameter count across these methods is due to our
ATEC, BQ, LCQMC, PAWSX, STSB, AFQMC, choice of using the LoRA adaptation technique for
and QBQTC. Pearson and Spearman correlation all models, and the foundational model is the same
coefficients serve as evaluation metrics. for D2LLM, Udever, and LLaRA, leading to a com-
• IR: For the IR task, each dataset comprised parable computational cost for these LLM-based
a corpus, a set of queries, and an associated approaches. Besides, to optimize GPU memory
mapping of each query to the relevant docu- utilization, we engage in a two-stage approach: ini-
ments within the corpus. The aim was to accu- tially, we utilize the Teacher LLM to infer and
rately identify these pertinent documents for each store logits, relevant for contrastive and rank imita-
query. The training data are randomly sampled 5
https://github.com/THUIR/T2Ranking
6
https://github.com/baidu/DuReader
3 7
https://huggingface.co/datasets/shibing624/snli-zh https://github.com/zhangsheng93/cMedQA2
4 8
https://huggingface.co/datasets/shibing624/nli_zh https://huggingface.co/datasets/unicamp-dl/mmarco
tion losses, as well as similarity matrices necessary • Udever (Zhang et al., 2023) also aims to mod-
for feature imitation. These logits and similarity ify GPT-style LLMs for text embedding. Udever
matrices, with relatively small memory footprints, introduces a special token appended to the end
correlate only with selected positives and hard neg- of the sentences and trains this token to summa-
atives for each sample. Next, during the subse- rize the sentences via sentence-level contrastive
quent training phase, the student model can learn learning.
from these pre-saved components without the si- • LLaRA (Li et al., 2023a) is similar to Udever,
multaneous GPU presence of the teacher model. but employs token-level contrastive learning to
In practice, D2LLM training culminates in approx- further refine sentence embeddings. In particular,
imately 10 and 22 hours of training time for the for Udever and LLaRA, we set the rank of the
symmetric (i.e., NLI and STS) and asymmetric low-rank adapters (LoRA) to 40, in order to guar-
search tasks (i.e., IR) respectively. By contrast, antee the number of trainable parameters in these
the training durations for LLM-based bi-encoders, two methods is the same as that in D2LLM.
such as LLaRA, hover around 9 and 20 hours for
the same tasks. This indicates that the resource D Teacher Model Finetuning
usage is nearly equivalent between these methods,
To finetune the teacher model for a specific task, we
with D2LLM introducing only a minimal additional
incorporate all positive sentence pairs and select an
computational burden. To implement D2LLM
equal number of hard negatives (cf. Section 3.2.1),
with diminished resource utilization, a reduction
maintaining a balanced ratio of 1:1 between posi-
in the number of tunable parameters could be ben-
tives and negatives. Depending on the task at hand,
eficial—potentially by lowering the rank value in
we utilize prompts suited for either symmetric or
LoRA or by adopting other parameter-efficient fine-
asymmetric searches to structure the finetuning
tuning methods like QLoRA (Dettmers et al., 2023),
dataset, composing inputs of prompted sentence
and investigating these resource-conserving alter-
pairs and outputs of binary responses "yes" or "no".
natives will be a focus of our future work.
We typically opt for LoRA finetuning, where we
set the rank within LoRA to 32. This specific rank
C Baselines setting is chosen to align the number of learnable
parameters with those used in other comparable
The five benchmark methods are summarized be-
methods.
low:
• BGE (Xiao et al., 2023) is a BERT-style bi- E Results on IR
encoder. It involves three training stages. First,
In this section, we check the performance of all
the model pretrained with massive data using
methods for the IR task. As shown in Table 4,
MAE. Next, it is finetuned with unlabeled and
D2LLM again outperforms other methods trained
labeled data separately.
on the same dataset for the majority of the time (8
• RocketQAv2 (Ren et al., 2021b) is also BERT-
out of 12 instances), despite not having a teacher
style bi-encoder. It is distilled from a BERT-style
model finetuned specifically for this task. Al-
cross-encoder via dynamic listwise distillation.
though we lack performance data for the teacher
This technique enables joint update of both the
model LLM-ce due to its cross-encoder design
teacher and student. Particularly in our experi-
being impractically slow for real-world retrieval
ments, we initialize both the student and teacher
tasks, D2LLM proves to be an effective surro-
as a pretrained Chinese-roberta-large model9 .
gate. It strikes a balance between accuracy and
• SGPT (Muennighoff, 2022) exploits both bi and efficiency, making it suitable for practical appli-
cross-encoder architectures to enhance GPT-style cations. Nonetheless, it’s worth noting that BGE
LLMs for semantic search. Here, we only utilize achieves the highest performance, a likely result of
the bi-encoder variant due to its efficiency. It its extensive training on data pertinent to IR.
incorporates BitFit finetuning of the LLM and
position-weighted mean pooling to generate an F Ablation Studies
overall embedding for a query or a passage.
In this section, we conduct a series of ablation stud-
9
https://huggingface.co/hfl/chinese-roberta-wwm-ext- ies to further show the competency of the proposed
large D2LLM.
Table 4: Results for IR, with the best-performing method and the second-best results marked in bold and underlined,
respectively.
T2Retrieval DuRetrieval mMARCORetrieval cMedQARetrieval CovidRetrieval MedicalRetrieval
Metric #Data #Param. MRR Recall MRR Recall MRR Recall MRR Recall MRR Recall MRR Recall
BGE-ft(326M) 0.9M 326M 0.8739 0.7659 0.9125 0.8546 0.6908 0.8363 0.2208 0.2638 0.5828 0.7363 0.4561 0.5490
RocketQAv2(326M) 0.9M 326M 0.6670 0.5579 0.6502 0.5079 0.5012 0.6920 0.1995 0.2321 0.4032 0.5822 0.2979 0.3851
SGPT(7B) 0.9M 0.4M 0.7408 0.6026 0.7762 0.6805 0.5516 0.7307 0.2454 0.3249 0.4411 0.6109 0.3266 0.4121
Udever(7B) 0.9M 89M 0.8653 0.7358 0.8905 0.8213 0.6327 0.7898 0.3145 0.3903 0.5124 0.6742 0.4346 0.5210
LLaRA(7B) 0.9M 89M 0.8412 0.7362 0.8777 0.8083 0.6511 0.8003 0.3221 0.3886 0.5250 0.6793 0.4612 0.5632
D2LLM(7B) 0.9M 89M 0.8893 0.7719 0.9162 0.8608 0.6723 0.8221 0.3501 0.4028 0.5639 0.7121 0.4991 0.6021
Improv.(%) N/A N/A 1.76% 0.78% 0.41% 0.73% -2.68% -1.70% 8.69% 3.20% -3.24% -3.29% 8.22% 6.91%
BGE(326M) 100M 326M 0.9094 0.8084 0.9345 0.8851 0.7583 0.8934 0.4349 0.4452 0.6587 0.8246 0.5504 0.6660
LLM-be(7B) N/A N/A 0.1843 0.1098 0.3331 0.2178 0.0541 0.1011 0.0456 0.0547 0.2630 0.4131 0.0462 0.076
LLM-ce(7B) N/A N/A - - - - - - - - - - - -
F.1 Impact of Losses and Modules Table 5: Performance on individual or mixed data type
via IEM and cosine similarity.
Our analysis focuses on the efficacy of various OCNLI CMNLI T2Retrieval
losses and modules detailed in Section 3. Below Metric ACC AP ACC AP MRR Recall
D2LLM-cos-sym 0.7461 0.7886 0.7867 0.8377 0.2791 0.2021
are the specific modifications we tested, with the D2LLM-cos-asym 0.4913 0.4704 0.5003 0.5082 0.8072 0.7037
corresponding results shown in Table 3: D2LLM-cos-mixed 0.7035 0.7524 0.7593 0.8024 0.7771 0.6802
D2LLM-sym 0.7905 0.8226 0.8084 0.8839 0.2622 0.1650
• −CI+CL: We replaced the proposed CI loss in D2LLM-asym 0.5138 0.4893 0.5144 0.5003 0.8346 0.7218
D2LLM-dual 0.7834 0.8017 0.7825 0.8778 0.8321 0.7059
Section 3.2.1 with the standard contrastive loss.
The CI loss amounts to the standard contrastive
loss by setting the score sTij given by the teacher pose of the [EOS] token in the pretrained LLM,
to 1. As expected, the resulting performance dete- thus not fully capitalizing on the LLM’s capabili-
riorates by 3.59%, since the standard contrastive ties.
loss is sensitive to positives concealed within the • −IEM+cos: Substituting the Interaction Emula-
hard negative set D− H. tion Module (IEM) with cosine similarity, akin
• −RI_PH: Omitting the rank imitation loss for to original bi-encoders, led to a 3.83% decline in
positives and hard negatives (Equation (7)) led to performance. This change buttresses our asser-
a 6.57% reduction in performance. This under- tion that the MLP is integral to modeling com-
scores the value of the student model mirroring plex sentence relationships more effectively than
the teacher in ranking these critical pairs. cosine similarity alone.
• −RI_HI: The removal of the rank imitation loss • D2LLM-1.8B: Scaling down the teacher LLM
for hard and easy negatives (Equation (8)) re- from 7B to 1.8B exhibited a 14.76% decrease in
sulted in a 4.92% performance drop. This sup- performance. This suggests that the capacity of
ports our initial argument (Section 3.2.2) that the teacher LLM is a key determinant in the effec-
distinguishing these sample sets is key for robust tiveness of the student D2LLM, with the findings
student model training. indicating that larger teacher models engender
• −FI: Excluding the feature imitation loss in- more proficient student models.
curred a 2.17% loss in performance, highlighting
the role of feature distillation in transferring a F.2 Impact of Single and Dual Branches in the
broader spectrum of knowledge from the teacher IEM
to the student model. We further investigated whether it is essential
• −PMA+mean: Replacing Pooling by Multi- and beneficial to include two branches in the
head Attention (PMA) with mean pooling led IEM. For this purpose, we selected two represen-
to a 1.71% decrease in performance. This result tative datasets from those listed in Appendix A:
emphasizes the superior flexibility of the learn- SNLI-zh for symmetric semantic relationships and
able PMA compared to the static mean pooling. T2Ranking for asymmetric semantic relationships.
• −PMA+[EOS]: Forgoing PMA in favor of us- These datasets were employed to train a dual-
ing the [EOS] token as a sentence-wide embed- branch model, designated as D2LLM-dual. For
ding, and applying contrastive finetuning, caused comparative analysis, we trained two separate
a 1.51% performance downturn. The shift in models—D2LLM-sym and D2LLM-asym—each
training objective strays from the original pur- specializing in one of the two dataset types. We
0.81 0.80 0.80
0.80
0.79 0.79 0.79
Accuracy
Accuracy
Accuracy
0.78 0.78 0.78
0.77
0.76 OCNLI 0.77 OCNLI 0.77 OCNLI
CMNLI CMNLI CMNLI
0.75 0.4 0.6 0.8 1 1.2 1.4 0.76 0.1 0.2 0.3 0.4 0.5 0.760.01 0.05 0.1 0.5 1
Value of Value of Value of
Figure 3: Effect of the hyperparameter α, β and γ on OCNLI and CMNLI.
Recall
Recall
0.765
0.76 0.76
0.75 0.6 0.8 1 1.2 0.75 0.2 0.3 0.4 0.5 0.7550.01 0.05 0.1 0.5
Value of Value of Value of
Figure 4: Effect of the hyperparameter α, β and γ on T2Retrieval.
assessed the models’ capabilities on test datasets α, we further incorporate the loss component LRI HI
that were aligned with either symmetric relations and adjust to identify an optimal pair (α, β). We
(OCNLI and CMNLI) or asymmetric relations continue in a similar fashion to identify the optimal
(T2Retrieval). The results are summarized in Ta- (α, β, γ) within the vicinity of the previously de-
ble 5. termined (α, β). Although this approach may not
It can be observed that D2LLM-dual achieves guarantee globally optimal parameters, it substan-
performance on par with D2LLM-sym for sym- tially conserves resources and has proven to yield
metric relationship tasks and with D2LLM-asym satisfactory performance. We conduct experiments
for asymmetric tasks. In stark contrast, D2LLM- based on NLI and IR tasks and report the perfor-
sym experiences a notable performance drop in mance for OCNLI, CMNLI, and T2Retrieval.
asymmetric contexts, while D2LLM-asym strug- We first depict ACC and Recall as functions of
gles similarly with symmetric tasks. These findings α in Figure 3 and Figure 4. It can be observed that
highlight the efficacy of the dual-branch IEM in a low α value correlates with suboptimal perfor-
D2LLM-dual, which is adept at tackling both sym- mance, indicating that the student model is inade-
metric and asymmetric semantic challenges. Thus, quately leveraging the teacher’s ranking capacity.
D2LLM-dual emerges as a versatile and unified As α is incremented to 1, we observe a progressive
model, proficiently capturing the nuanced interac- enhancement in performance. However, surpassing
tions between sentences regardless of the semantic this threshold results in a decline, which suggests
relationship type. an optimal value at α = 1.
Regarding β, this parameter determines the at-
F.3 Weight Sensitivity Analysis tention on the loss term LRI
HI . The empirical results,
We also investigate the impact of three hyperpa- depicted in Figure 3 and Figure 4, reveal that a β
rameters α, β, and γ, which represent the balance value of 0.3 yields the most significant augmenta-
weights of LCI , LRI RI
P H , and LHI respectively. In
tion in model performance.
practice, to determine (α, β, γ), a grid search is Finally, we change the value of γ to discern
obviously impractical. Instead, we opt for a se- the importance of feature imitation. Figure 3 and
quential method to choose the weight parameters. Figure 4 indicate that higher γ leads to enhanced
Specifically, we first obtain a locally optimal value model performance up to a certain point. Never-
for α focusing solely on the losses LCI and LRI PH. theless, an excessive prominence on it disrupts the
Next, by slightly expanding the tentative range of equilibrium among the other loss functions, thereby
impairing overall performance. semantic relationship type to be learned indepen-
Based on the above analysis, we note that the dently during training, thereby minimizing cross-
optimal hyperparameter combination remains simi- interference. Conversely, cosine similarity, as
lar across different tasks, suggesting that D2LLM implicitly mentioned in (Vuli and Mrki, 2018)
generally does not require laborious task-specific and (Muennighoff, 2022), struggles to distinguish
hyperparameter adjustments, thereby aiding in re- between these relationship types accurately. More-
source conservation. Experimental results for both over, the enhanced performance of the D2LLM-
the NLI and IR tasks reveal that D2LLM’s perfor- sym and D2LLM-asym, compared to their cosine-
mance is relatively stable across a certain parameter based counterparts, further substantiates the profi-
range for the weight parameters (α, β, γ). we set ciency of the learnable MLP components f1 and
α = 1, β = 0.3, and γ = 0.1 in all our experi- f2 within the IEM at navigating semantic nuances
ments, unless otherwise specified. more adeptly than cosine similarity.
Nevertheless, it is essential to acknowledge the
G Discussion on IEM limitations of the IEM, in comparison with cross-
Recall that the IEM can be expressed as: encoders (i.e., the teacher model LLM-ce or LLM-
ce-ft). As shown in Tables 1, 2 and 4, IEM does
S agg agg not yet match the performance of cross-encoder
yij = f2 (f1 ([yi , yj ])). (12)
models, which is primarily due to its mechanism of
In other words, the IEM operates by accepting con- facilitating information exchange only after the sep-
catenated representations of the query and passage arate derivation of query and passage embeddings.
and determining their relevancy. Fundamentally, This approach overlooks the nuanced relationship
through f1 , it tries to emulate the cross-attention modeling achieved through the continuous cross-
mechanism present in the teacher model, foster- attention layers in the cross-encoders.
ing an information exchange between the passage
and query. Depending on whether the relationship H Efficiency Enhancement in Real-world
agg agg
between [yi , yj ] is symmetric or asymmetric, Applications
IEM directs the input to the relevant branch of
sym asym In real-world applications, especially with larger
MLP f2 ∈ {f2 , f2 }, which accordingly out-
datasets, bi-encoders, despite their capability to
puts a "yes" or "no" representation. This differen-
pre-compute vectors of passages, may face signifi-
tiation allows IEM to adeptly handle both types
cant expenses when performing full-scale vector re-
of semantic search, mirroring the diverse prompts
trieval. To ensure efficient search, various technolo-
P ∈ {Psym , Pasym } utilized by the teacher model
gies (Jegou et al., 2010; Ge et al., 2013) have been
during inference.
put forward, of which a representative solution is
To evaluate IEM’s performance against tradi-
quantization-based Approximate Nearest Neighbor
tional cosine similarity measures, we engaged in
(ANN) methods. These methods compress the vec-
a comprehensive experimental suite, the details
tor space through quantization to enhance retrieval
of which are provided in Appendix F.2. Within
efficiency and reduce storage demands while only
this analysis, we focused on the SNLI-zh and
marginally sacrificing accuracy, thereby markedly
T2Ranking datasets to represent symmetric and
accelerating search speeds. These techniques are
asymmetric semantic relationships, respectively.
equally applicable to D2LLM. By applying quanti-
Models trained on these datasets include a dual-
zation techniques to both the embeddings produced
branch version (D2LLM-dual) and two specialized
by the PMA and the network parameters of the
models (D2LLM-sym and D2LLM-asym). Ad-
IEM (which functions as the distance metric), we
ditionally, we introduced cosine similarity into
can slightly compromise accuracy to significantly
this comparative study, resulting in models de-
enhance search speed. This improvement in ef-
noted as D2LLM-cos-mixed, D2LLM-cos-sym,
ficiency can, in turn, contribute to better ranking
and D2LLM-cos-asym. The results in Table 5
performance in subsequent procedures.
demonstrate that D2LLM-dual exhibits superior
capability in handling both symmetric and asym- I Bad Case Study
metric search tasks compared to the D2LLM-cos-
mixed model. This can be attributed to the dual- For the NLI task, we present two bad cases, one
branch structure of the IEM, which allows each representing an entailment pair, and the other a
contradiction pair. In addition to the labels, we have strategies signify promising areas for our future
provided the student model’s prediction and the explorations.
teacher logits. The specific results are as follows:
NLI bad case #1:
Sentence A: There is a tennis court in this city, and
they are the first batch of customers.
Sentence B: There are customers who came to the
tennis court before them.
Student’s Prediction: 0.5959
Teacher’s logits: 0.9141
Label: 0 (Contradiction)
NLI bad case #2:
Sentence A: The operator said that the two parties
were bargaining.
Sentence B: There were at least three people.
Student’s Prediction: 0.3541
Teacher’s logits: 0.4688
Label: 1 (Entailment)
Similarly for the IR task, we provide two bad
cases, both of which are positive examples. We
present the student’s rankings of these cases and
the teacher logits. The specific results are as fol-
lows:
IR bad case #1:
Query: What are the four major artifacts of Asia?
Passage: One of the artifacts: motorcycle. Motor-
cycle can be said to be a status symbol in India.
Student’s Rank: 32
Teacher’s logits: 0.081
Label: 1 (Correct match)
IR bad case #2:
Query: Why does my phone always show that there
are messages that cannot be opened?
Passage: If you have a smart phone, you can try
flashing it.
Student’s Rank: 46
Teacher’s logits: 0.135
Label: 1 (Correct match)
We suggest that these "bad cases" may stem from
the teacher model’s initial misjudgment, which in-
advertently hinders the student model’s learning
process. To address this, we propose two potential
strategies to refine the teacher model’s predictions:
First, for scenarios with limited resources, applying
In-Context Learning (Min et al., 2022) by includ-
ing examples of these errors during training could
enhance the teacher’s ability to provide more ac-
curate logits. Second, with sufficient resources,
fine-tuning the teacher model can improve its per-
formance while being mindful to avoid catastrophic
forgetting (Kirkpatrick et al., 2017) to ensure it re-
tains its generalized learning capabilities. Both