D2LLM Decomposed and Distilled LLMs For Semantic Search 1719397862

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

D2LLM: Decomposed and Distilled Large Language Models

for Semantic Search


Zihan Liao* Hang Yu*
East China Normal University Ant Group
[email protected] [email protected]

Jianguo Li† Jun Wang† Wei Zhang†


Ant Group East China Normal University East China Normal University
[email protected] [email protected] [email protected]

Abstract 2020), fact checking (Thorne et al., 2018), and


retrieval-augmented generation (Gao et al., 2024).
The key challenge in semantic search is to The major challenge of semantic search lies in
create models that are both accurate and ef-
devising a model that is both accurate and effi-
arXiv:2406.17262v1 [cs.CL] 25 Jun 2024

ficient in pinpointing relevant sentences for


queries. While BERT-style bi-encoders ex- cient in pinpointing the most relevant passages for
cel in efficiency with pre-computed embed- any given query. The current go-to models, par-
dings, they often miss subtle nuances in search ticularly the compact BERT-style bi-encoders or
tasks. Conversely, GPT-style LLMs with cross- dual encoders (Wang et al., 2022; Xiao et al., 2023;
encoder designs capture these nuances but Li et al., 2023b), independently convert queries
are computationally intensive, hindering real- and passages into vectors and judge their relevance
time applications. In this paper, we present
through measures such as cosine similarity. This
D2LLMs—Decomposed and Distilled LLMs
for semantic search—that combines the best of process is praised for efficiency—enabling pre-
both worlds. We decompose a cross-encoder computation and on-the-fly querying of passage
into an efficient bi-encoder integrated with vectors. However, this streamlined method comes
Pooling by Multihead Attention and an Interac- at an accuracy cost. Within the rigidity of the
tion Emulation Module, achieving nuanced un- bi-encoder’s similarity space, subtle but critical nu-
derstanding and pre-computability. Knowledge ances may be lost, such as when differentiating be-
from the LLM is distilled into this model using tween symmetric search tasks (e.g., finding similar
contrastive, rank, and feature imitation tech-
questions to "What are the symptoms of the flu?")
niques. Our experiments show that D2LLM sur-
passes five leading baselines in terms of all met- and asymmetric search tasks (e.g., matching that
rics across three tasks, particularly improving same query to a comprehensive answer detailing
NLI task performance by at least 6.45%. The symptoms). The bi-encoders’ constrained interac-
source code is available at https://github. tion mode limits their comprehension of the distinct
com/codefuse-ai/D2LLM. informational roles that queries and passages play.
Additionally, bi-encoders are bound to a laborious
1 Introduction and multi-stage training process, starting with pre-
Semantic search has become an integral part of training on massive datasets of weakly-supervised
natural language processing, tasked with sifting text pairs and ending with finetuning on diverse
through extensive texts to find passages that best and extensive labeled datasets (Wang et al., 2023).
match a user’s query based on underlying semantic This process is heavily data-intensive and usually
links. It transcends the non-semantic techniques, limited by the variety of data available. Moreover,
such as TF-IDF and BM25, by resolving lexical the small size of bi-encoders often means they ex-
mismatches and enabling more precise text match- cel within their training domain but fall short when
ing. As a result, semantic search has significant im- generalizing to new, unseen domains (Rosa et al.,
pacts across various fields, including information 2022a,b; Su et al., 2022).
retrieval (Zhu et al., 2023), question answering (Al- On the flip side, GPT-style Large language mod-
lam and Haggag, 2012), dialogue systems (Chen els (LLMs) with cross-encoder designs overcome
et al., 2017), item recommendation (Hu et al., these limitations by jointly processing queries and
* Equal contribution. This work was done when Zihan Liao
passages, thereby forming a single, interactive in-
was a research intern at Ant Group. put. This method enables a granular understanding

Corresponding authors. of textual relationships, as it involves concatenat-
ing the query and passage, with directive prompts performs five leading methods in three bench-
such as "Are these questions similar?" or "Does this mark tasks, with a particularly notable 14.39%
passage answer the question?" to guide the model improvement over the second-best LLaRA and a
through both symmetric and asymmetric search 6.45% lead over the heavily finetuned benchmark
tasks. Updated LLMs arrive pre-loaded with a BGE model in the NLI task.
broad spectrum of world knowledge (Hu and Shu,
2023), eliminating the need for domain-specific 2 Related Works
pretraining and facilitating rapid adaptation. The
remarkable zero-shot learning ability (Wei et al., Classical Models Classical semantic search tech-
2021; Kojima et al., 2022) ensures their robust per- niques leveraging more compact language mod-
formance even for novel domains. However, this els can generally be categorized into bi-encoders,
accurate analysis incurs a toll on computational ef- cross-encoders, and bi-encoders distilled from
ficiency; it precludes the caching of passage vectors cross-encoders. The first approach, bi-encoders,
and necessitates fresh inference for each new query- often finetunes BERT-style models through con-
passage pairing, which can hinder the practicality trastive learning, with a focus on negative sam-
of cross-encoder LLMs in situations demanding ple mining, as exemplified by SBERT (Reimers
real-time, voluminous processing. and Gurevych, 2019), ANCE (Xiong et al., 2020),
In this paper, we seek to bring the best of DPR (Karpukhin et al., 2020), SimCSE (Gao
both worlds together with the introduction of et al., 2021), ME-BERT (Luan et al., 2021), Rock-
D2LLMs, which stands for Decomposed and Dis- etQA (Qu et al., 2021), ADORE (Zhan et al.,
tilled LLMs for semantic search. Our proposed 2021), and DiffCSE (Chuang et al., 2022). How-
framework begins by decomposing an LLM-based ever, finetuning alone does not always ensure
cross-encoder into a bi-encoder coupled with an In- adaptability and generalization, leading to the in-
teraction Emulation Module (IEM). The bi-encoder, tegration of self or weakly supervised pretrain-
equipped with Pooling by Multihead Attention ing as seen in ICT (Lee et al., 2019b), Con-
(PMA) of token embeddings resulting from a pre- denser (Gao and Callan, 2021), Cocondenser (Gao
trained LLM, efficiently generates separate embed- and Callan, 2022), Contriever (Izacard et al., 2022),
dings for queries and passages, allowing passage OpenAI Embeddings (Neelakantan et al., 2022),
vectors to be pre-stored while ensuring the model’s E5 (Wang et al., 2022), GTE (Li et al., 2023b), and
adaptability. The IEM goes further, intricately map- BGE (Xiao et al., 2023). Despite this, these models
ping the relationships between queries and pas- still face challenges in capturing complex query-
sages. It features dedicated branches for handling passage relationships. While multiview embed-
symmetric and asymmetric search tasks. We then dings have been suggested (Zhang et al., 2022b),
distill the high-level knowledge from the original cross-encoders (Rosa et al., 2022a,b)—our second
LLM-based cross-encoder (the teacher) into our category—address this more effectively but are not
decomposed model (the student) through a series well-suited for real-time use due to the computa-
of teacher-guided methodologies, including con- tional cost of recomputing passage representations.
trastive imitation, rank imitation, and feature imita- The third category, most related to our proposed
tion. Our contributions can be summarized as: D2LLM, aims to distill the effectiveness of cross-
encoders into bi-encoders. Previous work in this
• We introduce D2LLM, a new semantic search area has primarily used straightforward distilla-
solution that combines the speed of bi-encoders tion strategies, such as pseudo-labeling (Qu et al.,
with the accuracy of LLM-based cross-encoders. 2021; Ren et al., 2021a; Izacard and Grave, 2021)
This method breaks down the complex cross- or KL divergence loss (Yang and Seo, 2020), to
encoder into a more manageable student model align the student model with the teacher model.
comprising a bi-encoder, a PMA, and an IEM. More advanced techniques like AR2 (Zhang et al.,
• We transfer the teacher’s linguistic expertise to 2022a) and RocketQAv2 (Ren et al., 2021b) have
the student through a comprehensive knowledge attempted joint optimization of student and teacher
distillation strategy, encompassing contrastive models using adversarial training and dynamic dis-
imitation, rank imitation, and feature imitation tillation. However, these approaches face chal-
techniques. lenges when applied to LLMs: scalability issues
• Our empirical results reveal that D2LLM out- due to the co-training of two models, loss functions
that may not capture the full breadth of knowledge 3 D2LLM
in LLMs, and the failure to adequately model the
detailed interactions between queries and passages. The architecture of D2LLM is depicted in Fig-
Our D2LLM framework overcomes these chal- ure 1. Here, the teacher model operates as an LLM-
lenges by focusing on refining the student model based cross-encoder, adeptly handling joint query-
through distillation with a static teacher model, passage inputs to leverage the LLM’s robust se-
which can be pre-finetuned for specific tasks. We mantic comprehension for more accurate outcomes.
enhance the distillation process by employing a On the other hand, the student model features a
combination of contrastive, rank, and feature imita- bi-encoder for pre-computing passage vectors to
tion losses. Crucially, we integrate an Interaction ensure operational efficiency, complemented by an
Emulation Module into the student model to bet- Interaction Emulation Module that considers the
ter understand and replicate the nuanced interplay complex query-passage interplay. Through an elab-
between queries and passages, thus solving the orate knowledge distillation procedure, we aim to
problems of previous distillation approaches. cultivate a student model that emulates the teacher’s
accuracy while retaining efficiency. Subsequently,
LLMs Classical semantic search methods often we detail the process of decomposing the teacher
suffer from limited size, leading them to excel in model to assemble the student model and elucidate
one specific task like text similarity, classification, the knowledge distillation methodology.
or information retrieval, as with SimCSE (Gao
et al., 2021) and Contriever (Izacard et al., 2022). 3.1 Decomposition
Yet, larger models and improved pretraining tech- Before delving into the construction of the student
niques have been shown to significantly boost both model, we first introduce how an LLM plays the
the accuracy and applicability of PLM-based dense role of the teacher.
retrieval (Izacard et al., 2022; Wang et al., 2022;
Xiao et al., 2023). This has led to the application 3.1.1 Teacher Model T
of LLMs to semantic search, though this field re- The teacher model aims to accurately determine
mains underexplored. Current approaches to trans- whether a query Xi and a passage Xj are a com-
forming LLMs into bi-encoders include finetuning patible match. We utilize a cross-encoder architec-
only methods like SGPT (Muennighoff, 2022), Re- ture for this purpose since it leverages the LLM’s
pLLaMa (Ma et al., 2023), Udever (Zhang et al., strength in making informed decisions by synthe-
2023), and LLaRA (Li et al., 2023a), as well as sizing different parts of the combined inputs. To
those combining continued pretraining with fine- tap into the LLM’s zero-shot learning capabilities,
tuning, such as CPT (Neelakantan et al., 2022) and we employ prompt engineering, crafting specific
GTR (Ni et al., 2022). Typically, these methods prompts P that guide the LLM in analyzing the
introduce a special token at the end of the passage, query-passage pairs. As indicated in the upper left
trained with contrastive loss to emulate the role panel of Figure 1, we design prompts P for symmet-
of BERT’s [CLS] token. However, this adapta- ric and asymmetric searches (Psym and Pasym ). The
tion deviates from the original training objective of chosen prompt P ∈ {Psym , Pasym } is then concate-
LLMs, which is next-token prediction, potentially nated with the query-passage pair (Xi , Xj ), prompt-
underutilizing LLMs’ capabilities and resulting in ing the LLM to generate a “yes” or “no” response.
suboptimal performance compared to smaller, spe- This process is represented in Figure 1(a), where
cialized bidirectional encoders like GTE (Li et al., the LLM generates an embedding for the response
2023b) and BGE (Xiao et al., 2023). (i.e., the dark purple square):
The proposed D2LLM tackles this issue by de-
T
composing the teacher LLM into a student bi- yij = LLM(Xi , Xj , P). (1)
encoder combined with an Interaction Emulation
Module that more faithfully mirrors the LLM’s The embedding yij T functions as a classification to-

handling of query-passage interactions. Through ken that labels the query-passage pair (Xi , Xj ) as
a detailed distillation process, D2LLM effectively related or not, with the prompt P aiding in adapting
transfers knowledge from the teacher LLM, produc- to different search task types (symmetric versus
ing a student model that excels beyond the afore- asymmetric). To determine the probability of “yes”
mentioned models finetuned on the same datasets. or “no”, we extract the corresponding rows from
Prompt for Symmetric Search Contrastive Imitation Loss
#Q and #P will each describe an event or problem, they may not be related. 𝒛𝓣
𝒊𝒋 𝒛𝓢𝒊𝒋
Based solely on these descriptions and your understanding of the world, judge
whether #P is absolutely correct about the event in #Q, or whether #Q is Rank Imitation Loss
absolutely correct about the event or problem in #P , please answer yes or no.

Linear Feature Imitation Loss Linear


Prompt for Asymmetric Search
#Q will describe a query, #P will describe a passage, they may not be related.
Based solely on these descriptions and your understanding of the world, judge
whether #P can correctly answer the question asked in #Q. Please answer yes 𝒚𝒮𝑖𝑗 𝒚𝒮𝑖𝑗
or no. PMA PMA
𝐬𝐲𝐦 𝐚𝐬𝐲𝐦
𝒇𝟐 IEM 𝒇𝟐
𝒚𝓣
𝒊𝒋 𝒚𝒊 (1) 𝒚𝒊 (𝐿) 𝒚𝒋 (1) 𝒚𝒋 (𝐿)
MLP MLP

MLP 𝒇𝟏

agg agg
𝒙𝒊 (1) ... 𝒙𝒊 (𝐿) 𝒙𝒋 (1) ... 𝒙𝒋 (𝐿) 𝒑(1) ... 𝒑(𝐿) 𝒙𝒊 (1) ... 𝒙𝒊 (𝐿) 𝒙𝒋 (1) ... 𝒙𝒋 (𝐿) 𝒚𝑖 𝒚𝑗

Query Xi Passage Xj Prompt P Query Xi Passage Xj


(a) Teacher Model (b) Student Model
Figure 1: Architecture of D2LLM: The teacher model is decomposed into three segments corresponding to the
input of the 1 query, 2 passage, and 3 prompt, represented in light red, light blue, and light purple. Its output,
represented by a dark purple square, is the classification token embedding, which, after a linear layer, yields logits.
The student model maintains the query and passage components ( 1 and 2 ) from the teacher but adds 3 the PMA
and IEM to capture the interplay between the query and passage, as well as their combined interaction with the
prompt. It also outputs classification token embeddings respectively for symmetric and asymmetric search and then
derives logits via a linear layer.
the weight matrix W T in the last layer of the origi- PMA: The PMA module (Lee et al., 2019a) syn-
nal LLM, compute the logits zij T , and finally apply thesizes information from tokens within the query
a softmax function to calculate the probability of Xi and the passage Xj , producing a distinct em-
“yes” (i.e., score) sTij , that is, bedding vector for each. For a query of length L,
represented as Xi = [xi (1), . . . , xi (L)] with xi (k)
T
zij = W T [“yes”,“no”]yij
T
, (2) being the k-th token, and corresponding hidden
sTij = softmax(zij
T
). (3) states Yi = [yi (1), . . . , yi (L)] from the LLM’s
last layer, PMA aggregates token information as:
In practice, it is more convenient to compute the
agg
logits for all tokens within the LLM and subse- yi = PMAq (Yi ) = LN(h + FFN(h)), (4)
quently extract the probabilities sTij for "yes" and h = LN(MHA(q, Yi , Yi ) + q), (5)
"no" by inputting the relevant logits into the soft-
max function. Beyond prompt engineering, the where LN, FFN, and MHA(Q, K, V) respectively
teacher model can also be finetuned for improved denote layer normalization, feedforward network,
performance. and multihead attention with query Q, key K, and
value V. The PMA’s query q is a learnable vector
3.1.2 Student Model S that functions as an anchor, extracting informa-
As shown in Figure 1(a), the teacher model can be tion from the L tokens based on their similarity to
decomposed into three components processing the the query q for semantic search. This attention-
input of 1 the query, 2 the passage, and 3 the based pooling offers greater flexibility than tradi-
prompt, respectively. The third component goes tional mean/min/max pooling by allowing dynamic
beyond handling the prompt; it integrates and exam- weight adjustments, as opposed to fixed weights
ines the interactions between the query and passage, for all instances of Xi and Xj .
as well as their collective interplay with the prompt, Note that, unlike BERT-style models which typi-
to ultimately determine their match. Mirroring the cally use a special token [CLS] for sentence embed-
teacher, the student model retains the same struc- ding, GPT-style models lack an equivalent mecha-
ture for handling 1 the query and 2 the passage. nism. Alternative methods for GPT-style models,
Innovatively, for 3 , it assimilates the prompt infor- like CPT (Neelakantan et al., 2022) and LLaRA (Li
mation and the query-passage-prompt interactions et al., 2023a), have tried using the last token or
via the Pooling by Multihead Attention (PMA) and the [EOS] token as a substitute for the [CLS] to-
the Interaction Emulation Module (IEM). ken. However, such usage diverges from the LLMs’
intended next-token prediction function, often to discussed later. The resulting Contrastive Imitation
their detriment. Alternatively, SGPT (Muennighoff, (CI) loss is:
2022) introduces position-weighted pooling, but exp(sTij zij
S /τ )
1 X
may impose undue constraints by presuming that to- LCI = − log ,
|D+ |
X
T S
ken relevance is a function of their positions. PMA j∈D + exp((1 − s ik )z ik /τ )
is favored as it does not conflict with LLMs’ inher- k∈D−

ent nature and its learnability ensures adaptability. where τ is the temperature parameter, sTij is the
teacher’s probability score for a "yes" between
IEM: After the PMA generates individual vec- S is the student’s corresponding
pairs (i, j), and zij
tors for the query and passage, the IEM implicitly logit (unnormalized probability). The CI loss di-
encodes the prompt (be it symmetric or asymmet- verges from traditional contrastive loss by leverag-
ric) and captures the query-passage interaction. As ing the teacher’s scores sT to account for varying
depicted in Figure 1(b)’s right panel, we concate- relevance among samples, assigning higher weights
nate the query and passage embeddings and in- to easy negatives than hard ones. Even if true posi-
put them into a Multi-Layer Perceptron (MLP) de- tives incorrectly appear in D− , the CI loss remains
signed with two branches to handle the respective unaffected, ensuring a more robust training environ-
prompt nuances, which can be expressed as: ment than standard contrastive loss. It also empha-
S agg agg sizes positive samples with higher teacher scores,
yij = f2 (f1 ([yi , yj ])), (6) indicating their criticality.
where both f1 and f2 are MLPs. f1 extracts elemen- 3.2.2 Rank Imitation
tary features from the combined embeddings, while While Contrastive Imitation (CI) loss effectively
sym asym
f2 ∈ {f2 , f2 } is tailored to the two branches, handles true positives and easy negatives, it does
processing symmetric and asymmetric searches. not adequately address the gradations among sam-
After deriving yijS , the student model computes its
ples. This is where Rank Imitation (RI) steps in,
S and score sS in a manner akin to the
logits zij focusing on distinguishing between positive and
ij
teacher model (cf. Eqs. (2)-(3)), but with a learn- hard negative samples, as well as discerning easy
able linear layer W S . We underscore that the MLP from hard negatives, thus enabling the student to
operates as a flexible similarity metric, enhancing replicate the teacher’s subtle ranking nuances.
the description of the query-passage relationship To synchronize the student’s and the teacher’s
beyond the commonly used cosine similarity in ranking of positive and hard negative samples, we
bi-encoders. To maintain efficiency, lightweight aim to maximize the Pearson correlation (Huang
MLPs are utilized. Please refer to Appendix G for et al., 2022) between their logits. The RI loss dedi-
more discussion on IEM. cated to this alignment is:
T S
LRI
P H = 1 − corr(zi , zi ), (7)
3.2 Distillation
where ziT = T]
[zij for j ∈D+ ∪ D−
H signifying a
Knowledge distillation aims to impart the teacher
vector of the teacher’s logits for the combined set
model’s capabilities to the student model. To ac-
of positive and hard negative samples, and likewise
complish this, we focus on three specific training
for ziS . We intentionally exclude in-batch negatives
objectives: contrastive, rank, and feature imitation.
from this measure as they are generally easy neg-
3.2.1 Contrastive Imitation atives and lack the comparative relevance needed
for meaningful ranking against the query Xi .
For a given query Xi , we curate a set of positive
On the other hand, differentiating between hard
samples D+ (relevant passages) and negative sam-
and easy negatives is critical since hard negatives
ples D− (irrelevant passages). The negative sam-
have some connection to the query, unlike easy
ple set includes in-batch negatives and hard nega-
negatives. To emphasize this, we introduce an ad-
tives, the latter sourced from the top-k results using
ditional RI loss for these two groups of samples:
BM25 and bi-encoder evaluations (Ma et al., 2023;
Li et al., 2023a,b), thus forming D− = D− − 1
I ∪ DH .
X X
S
LRI
HI = − − − λjk log(σ(zij
Note that a few hard negatives may potentially be |DH ||DI | − −
j∈DH k∈DI
latent positives, but our Contrastive Imitation can S
address this circumstance robustly, which will be − zik )), (8)
where λjk is the rank comparison metric between This approach guides the student to mimic the re-
a hard negative j and an in-batch negative k lational patterns in the teacher’s representations,
as determined by the teacher. The metric uti- resulting in a deeper form of knowledge transfer.
lized is the normalized discounted cumulative gain
3.2.4 Overall Loss
(NDCG) (Järvelin and Kekäläinen, 2002). It gives
a non-zero λjk only when zij T − z T > 0 in the The collective loss is defined as a weighted sum of
ik
teacher model. The design of this loss ensures that the above individual losses:
only when a sample is deemed an easy negative by
L = LCI + αLRI RI FI
P H + βLHI + γL , (11)
the teacher does the student assign it a lower score
S − z S > 0).
than it would to a hard negative (i.e., zij ik with the weights α, β, and γ. This loss is used to
This approach allows the student to effectively dif- train the student, including the PMA, the IEM, the
ferentiate between easy and hard negatives under linear layer W S , and the bi-encoder (i.e., 1 and 2
the teacher’s guidance, even when hard negatives in Figure 1(b)). Note that the bi-encoder is trained
are interspersed with in-batch negatives. using parameter-efficient finetuning methods, such
as LoRA (Hu et al., 2021) and QLoRA (Dettmers
3.2.3 Feature Imitation
et al., 2023). These strategies can enhance perfor-
CI and RI aim to align the student’s output with mance without imposing a significant increase in
that of the teacher, emphasizing output distillation. the number of learnable parameters. Remarkably,
Working in tandem, feature distillation can also the learnable parameters in the bi-encoder are less
provide substantial benefits by leveraging the rich than 4% of the LLM’s total parameter count.
information encompassed in the teacher LLM. Di-
rectly aligning the embeddings of the classification 4 Experiments
token between the teacher and student models (i.e.,
T and y S ) presents challenges due to the distinct In this section, we evaluate D2LLM’s effectiveness
yij ij on three tasks: Natural Language Inference (NLI),
architectures of the third component (see 3 in Fig-
Semantic Textual Similarity (STS), and Informa-
ure 1). However, the relative relationships between
tion Retrieval (IR), with a focus on Chinese. The
embeddings for different query-passage pairs are
former two are symmetric search tasks, and the
less susceptible to such architecture variations (Liu
last is asymmetric. Details on datasets and eval-
et al., 2019). For instance, given two positive sam-
uation metrics are provided in Appendix A. For
ples Xj and Xk as well as a negative sample Xm for
T is often closer to y T than NLI and STS, we use 0.3 million training sam-
the same query Xi , yij ik
T in the feature space of the teacher, in order to
ples, and 0.9 million for IR. The proposed method
yim (denoted as D2LLM1 ) is compared with five state-
produce a higher score for the pairs (i, j) and (i, k) of-the-art baselines: BGE (Xiao et al., 2023),
than for (i, m), and likewise for the student. To RocketQAv2 (Ren et al., 2021a), SGPT (Muen-
leverage this robustness, Feature Imitation (FI) first nighoff, 2022), Udever (Zhang et al., 2023), and
computes a similarity matrix for all query-passage LLaRA (Li et al., 2023a). BGE and RocketQAv2
pairs within a batch for the teacher as: are BERT-style bi-encoders; BGE uses pretraining
and finetuning, while RocketQAv2 distills from
T
rijk T
= sim(yij T
, yik ), ∀j, k ∈ D+ ∪ D−
H, (9) a cross-encoder, updating both the student and
teacher. The remaining three finetune GPT-style
where sim denotes cosine similarity, and repeats LLMs into bi-encoders. For more details, please
S . Note that
the process for the student to obtain rijk refer to Appendix C. The total number of param-
the above similarity metric is evaluated between eters for each method is shown in parentheses af-
T and y T , not the passage embeddings y agg and
yij ik j ter it. To ensure a fair comparison, all baseline
agg
yk alone. The goal of FI is to minimize the ℓ2 methods have a number of tunable parameters (de-
norm of the difference between the teacher’s and noted as #Param. in Tables 1-2) equal to, or greater
student’s similarity matrices for all positive and than, those in D2LLM, with the exception of SGPT,
hard negative sample combinations (i.e., riT = which regards the use of BitFit finetuning (Zaken
T ] and r S = [r S ] for all j and k):
[rijk i ijk et al., 2022) as its notable advantage. All methods,
except SGPT, incorporate the hard negatives from
LFI = ∥riT − riS ∥22 . (10) 1
The implementation deltals is presented in Appendix B
Table 1: Results for NLI, with the best-performing method and the second-best results marked in bold and
underlined, respectively.
Dataset OCNLI CMNLI
Metric #Data #Param. ACC AP Prec. Recall ACC AP Prec. Recall
BGE-ft(326M) 0.3M 326M 0.5463 0.5689 0.5466 0.5702 0.6097 0.6656 0.6174 0.6278
RocketQAv2(326M) 0.3M 326M 0.5603 0.5633 0.5123 0.5723 0.6164 0.6779 0.5966 0.6905
SGPT(7B) 0.3M 0.4M 0.5977 0.6165 0.6029 0.5994 0.6598 0.7259 0.6643 0.6727
Udever(7B) 0.3M 89M 0.6412 0.6811 0.6478 0.6698 0.7234 0.7819 0.7077 0.7306
LLaRRA(7B) 0.3M 89M 0.6612 0.7115 0.6618 0.6889 0.7150 0.7815 0.7125 0.7381
D2LLM(7B) 0.3M 89M 0.7889 0.8145 0.7736 0.8149 0.8014 0.8759 0.7960 0.8241
Improv.(%) N/A N/A 19.31% 14.48% 17.30% 21.66% 10.78% 12.02% 11.72% 11.65%
BGE(326M) 100M 326M 0.7266 0.7646 0.7362 0.7191 0.7675 0.8580 0.7891 0.7381
LLM-be(7B) N/A N/A 0.5219 0.5083 0.5155 0.5955 0.5619 0.6175 0.5624 0.6762
LLM-ce(7B) N/A N/A 0.8776 0.9609 0.8409 0.9493 0.8347 0.9417 0.8263 0.9303

Table 2: Results for STS, with the best-performing method and the second-best results marked in bold and
underlined, respectively.
ATEC BQ LCQMC PAWSX STSB AFQMC QBQTC
Metric #Data #Param. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear. Pear. Spear.
BGE-ft(326M) 0.3M 326M 0.3827 0.4126 0.5388 0.5982 0.5683 0.6531 0.2982 0.3127 0.6648 0.6717 0.3492 0.3774 0.1982 0.2049
RocketQAv2(326M) 0.3M 326M 0.1971 0.2362 0.3815 0.3962 0.5368 0.6089 0.1687 0.1558 0.5662 0.5894 0.1945 0.2381 0.2325 0.2180
SGPT(7B) 0.3M 0.4M 0.3045 0.3173 0.5135 0.5241 0.4715 0.4767 0.1842 0.1653 0.5973 0.5842 0.3033 0.3077 0.1717 0.1736
Udever(7B) 0.3M 89M 0.3328 0.3602 0.5389 0.5531 0.5369 0.5819 0.2041 0.2063 0.6509 0.6601 0.3177 0.3246 0.2088 0.2102
LLaRRA(7B) 0.3M 89M 0.343 0.3575 0.5233 0.5369 0.5698 0.5997 0.2113 0.2063 0.6910 0.7001 0.3046 0.3238 0.2127 0.2254
D2LLM(7B) 0.3M 89M 0.3731 0.3994 0.5487 0.5674 0.6210 0.6589 0.3038 0.2883 0.7273 0.7194 0.3676 0.3858 0.2749 0.2850
D2LLM-ft(7B) 0.3M 89M 0.4603 0.4759 0.5589 0.5705 0.6233 0.6475 0.2145 0.2573 0.7346 0.7729 0.3891 0.3966 0.2756 0.2933
Improv.(%) N/A N/A 20.27% 15.34% 3.71% -4.63% 9.39% 0.89% 1.88% -7.80% 6.31% 10.40% 11.43% 5.09% 18.54% 30.12%
BGE(326M) 100M 326M 0.4716 0.4785 0.6001 0.6224 0.6924 0.7249 0.3001 0.3584 0.7765 0.7763 0.4100 0.4253 0.2203 0.2424
LLM-be(7B) N/A N/A 0.2339 0.2178 0.3049 0.3007 0.4484 0.4507 0.1803 0.1676 0.5761 0.5767 0.1762 0.1837 0.1153 0.1147
LLM-ce(7B) N/A N/A 0.3670 0.4152 0.4432 0.4770 0.6224 0.7164 0.3125 0.4365 0.7453 0.7680 0.3643 0.3986 0.3355 0.3491
LLM-ce-ft(7B) 0.3M 89M 0.4816 0.4898 0.5868 0.5991 0.6205 0.7147 0.1978 0.4002 0.7873 0.8172 0.4284 0.4254 0.3414 0.3516

the previous section, as SGPT’s original training exceeds BGE, which was finetuned on 100M rele-
does not involve such negatives. It’s worth noting vant samples, by 6.45%. Furthermore, D2LLM
that BGE relies on extensive data for pretraining effectively narrows the gap between the intact
and finetuning, while other methods only use the bi-encoder LLMs (LLM-be) and cross-encoder
small dataset mentioned above for finetuning. As LLMs (LLM-ce). While the original LLM-be falls
the pretrained BGE model was not accessible, we short as a bi-encoder due to the mismatch between
opted for Chinese-roberta-large-326M (Cui et al., text generation and embedding, the cross-encoder-
2021) and finetuned it using BGE’s method with based teacher model LLM-ce excels by leveraging
our dataset, referring to this version as BGE-ft. LLMs’ ability to synthesize information from sen-
The original BGE is still retained as a reference. tence pairs. Our distillation approach successfully
We also use Chinese-roberta-large as the base for transfers knowledge from the teacher to the stu-
RocketQAv2. For the other methods, we select dent, transforming the initially ineffective LLM-
Qwen-7B-Chat (Bai et al., 2023) as the base model. be into a proficient NLI instrument. Addition-
Additionally, the performance of the cross-encoder ally, SGPT, Udever, and LLaRA outperform the
teacher model based on Qwen-7B-Chat (i.e., LLM- base LLM-be, highlighting that contrastive fine-
ce) is presented, as well as a bi-encoder (i.e., LLM- tuning boosts LLM-be’s capacity, albeit not as ef-
be) that generates sentence embeddings by mean- fectively as D2LLM’s distillation. Among these,
pooling token embeddings from the last layer of LLaRA and Udever benefit more from hard nega-
Qwen-7B-Chat, in order to gauge D2LLM’s im- tives in contrastive finetuning than in-batch nega-
provement over the untrained LLM-be and its prox- tives based SGPT, with LLaRA’s token-level con-
imity to the teacher LLM-ce’s performance. trastive learning proving more advantageous than
Udever’s sentence-level approach. Furthermore, a
4.1 Results on NLI comparison between BGE and BGE-ft reveals that
BGE’s impressive performance is likely attributed
We first investigate the performance of all meth-
to its extensive pretraining and finetuning datasets.
ods for NLI. Table 1 shows that D2LLM outshines
Finally, the performance of RocketQAv2 falls short
all competitors trained on the 0.3M sample set
of expectations, likely due to having fewer hard
across all metrics and all testing datasets. Notably,
negatives per sample in our experiments (i.e., 8)
it surpasses LLaRA, the second-best method, by
compared to the original setting (i.e., 32 or 127);
a significant margin of 14.39%2 on average and
the efficacy of listwise distillation in RocketQAv2
2
We compute the relative gain in this paper, which is de- depends on the hard negatives. Despite its out-
fined as the ratio of the difference between a new value and a standing performance, D2LLM still exhibits poor
reference value to the reference value itself.
Table 3: Ablation Study on the NLI Task.
OCNLI CMNLI Average
Methods ACC AP Prec. Recall ACC AP Prec. Recall difference
−CI+CL 0.7572 0.7791 0.7411 0.7887 0.7755 0.8471 0.7692 0.8018 -3.59%
−RI_PH 0.7293 0.7658 0.7106 0.7589 0.7567 0.8129 0.7433 0.7885 -6.57%
−RI_HI 0.7375 0.7726 0.7390 0.7711 0.7721 0.8194 0.7609 0.7990 -4.92%
−FI 0.7666 0.8012 0.7518 0.8006 0.7939 0.8542 0.7771 0.8056 -2.17%
−PMA+mean 0.7734 0.7997 0.7586 0.8022 0.7954 0.8611 0.7823 0.8087 -1.71%
−PMA+[EOS] 0.7739 0.8023 0.7604 0.8025 0.7958 0.8642 0.7845 0.8107 -1.51%
−IEM+cos 0.7461 0.7886 0.7224 0.8025 0.7867 0.8377 0.7682 0.7921 -3.83%
D2LLM-1.8B 0.6907 0.7399 0.6769 0.6261 0.7102 0.7947 0.7107 0.5840 -14.76%
D2LLM 0.7889 0.8145 0.7736 0.8149 0.8014 0.8759 0.7960 0.8241 -
Figure 2: Runtime Analysis.

performance on certain samples, please refer to I 4.3 Runtime Analysis


for details. In this section, we evaluate the runtime of all meth-
ods, which can be further divided into the time
4.2 Results on STS and IR for query vectorization, relevance scoring, and
passage ranking. The results of all methods for
Moving on to the STS task, Table 2 shows that the dataset T2Retriever are directed in Figure 2.
D2LLM outperforms other methods trained on the For bi-encoder-based methods, since passage em-
same dataset in most cases (10 out of 14). How- bedding vectors are pre-computed and stored in a
ever, BGE-ft occasionally exceeds D2LLM, and the database, we exclude that preprocessing time and
original BGE maintains a consistent lead. Notably, instead measure only the runtime for query vector-
even the teacher model, LLM-ce, falls behind BGE, ization and cosine similarity calculation. The pro-
shedding light on D2LLM’s less-than-optimal re- posed D2LLM follows a similar vectorization step,
sults. However, It’s important to point out that but computes similarity using a multi-layer per-
the teacher model LLM-ce was not specifically ceptron (MLP). For cross-encoder-based method
finetuned for STS. To address this, we finetune LLM-ce, each passage must be concatenated with
the teacher model with the same training set for the query and processed individually through the
the STS domain using LoRA (see Appendix D for model, resulting in a significantly higher runtime
more details), resulting in the variant LLM-ce-ft. for relevance scoring compared to bi-encoders.
The finetuning on merely 0.3M data yields an aver- D2LLM demonstrates only a marginally increased
age improvement of 7.17% for the teacher, an effect relevance-scoring time relative to bi-encoders, at-
showed in the last two rows in Table 2. Building tributable to the more complex computations within
upon LLM-ce-ft, we train the student model, de- the MLP compared to cosine similarity. In sum-
noted as D2LLM-ft, which shows an increase of mary, D2LLM markedly improves upon the cross-
1.69% over the original D2LLM. Furthermore, now encoder-based teacher model’s efficiency while en-
D2LLM-ft significantly outshines other methods hancing the accuracy of the bi-encoder benchmark,
trained on the same 0.3M sample set, by a margin effectively balancing efficiency and accuracy.
of at least 17.42% on average. This confirms that,
despite the initial underperformance on a task, the 4.4 Ablation Studies
strong adaptability of LLMs means that finetuning Due to the space limitation, we only briefly
with a relatively small dataset can substantially en- overview our ablation studies here. For a detailed
hance both the teacher’s and subsequently the stu- account, please see Appendix F.1. We first explore
dent’s performance. On the other hand, while BGE the individual contributions of different losses and
shows comparable or marginally better results, it modules to the performance of D2LLM. Insights
does so at the expense of finetuning on a massive gleaned from Table 3 reveal that: 1) The integration
100M data corpus specific to semantic search tasks, of contrastive, rank, and feature imitation processes
suggesting that adopting a smaller model like BGE is critical to D2LLM’s success; these processes ef-
(326M parameters) demands a large quantity of fectively distill diverse facets of knowledge from
pertinent data. Regarding the IR task, its results are the teacher model. 2) PMA proves indispensable
similar to those of the prior two tasks, and so we and cannot be aptly substituted with mean pool-
deferred their detailed discussion to Appendix E. ing or the use of the [EOS] token within the LLM;
Similar to the NLI task, D2LLM struggles with these alternatives are either overly simplistic or mis-
some specific IR samples, as detailed in I. aligned with the inherent design of the LLM. 3) the
IEM is necessary for capturing the intricate dynam- model incorporates three weight parameters in the
ics of query-passage relationships, which cosine combined loss function (11), and optimally tun-
similarity alone fails to encapsulate adequately. 4) ing these hyperparameters from the data remains
The size of the teacher LLM positively influences a complex yet potentially rewarding challenge. Fi-
the performance of the student D2LLM, with larger nally, training D2LLM demands substantial com-
teachers leading to more capable students. Further- putational resources due to its reliance on large
more, we find in Appendix F.2 that the IEM with language models, which could be prohibitive for
dual branches can handle both symmetric and asym- those with limited computational means.
metric semantic tasks, outperforming single-branch
variants that struggle outside their specialized do- 8 Acknowledgements
mains. Finally, we perform a sensitivity analysis of We would like to thank Ant Group for their support
the weights in the overall loss (11) in Appendix F.3, for this work. This work was supported in part by
and use the selected values in all our experiments. National Natural Science Foundation of China (No.
62072182 and No. 92270119).
5 Conclusion
This paper presents D2LLM, an innovative model
References
distilling knowledge from an LLM teacher to con-
struct an efficient student for semantic search. With Ali Mohamed Nabil Allam and Mohamed Hassan Hag-
gag. 2012. The question answering systems: A sur-
a deep and nuanced understanding of its teacher, vey. International Journal of Research and Reviews
D2LLM utilizes specially designed modules and in Information Sciences (IJRRIS), 2(3).
losses to encapsulate the teacher’s prowess in a
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
more compact form. Our experimental findings re- Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
veal that D2LLM successfully synthesizes the high and Jingren Zhou. 2023. Qwen-vl: A frontier large
accuracy associated with cross-encoders and the vision-language model with versatile abilities. arXiv
operational efficiency of bi-encoders. preprint arXiv:2308.12966.
Hongshen Chen, Xiaorui Liu, Dawei Yin, and Jiliang
6 Ethical Considerations Tang. 2017. A survey on dialogue systems: Re-
cent advances and new frontiers. Acm Sigkdd Ex-
Our research is of a fundamental nature and is not plorations Newsletter, 19(2):25–35.
anticipated to have significant social implications.
We have utilized exclusively open-source datasets, Yung-Sung Chuang, Rumen Dangovski, Hongyin Luo,
Yang Zhang, Shiyu Chang, Marin Soljačić, Shang-
which ensures transparency and adherence to ethi- Wen Li, Scott Yih, Yoon Kim, and James Glass. 2022.
cal standards in data usage. Moreover, the accessi- Diffcse: Difference-based contrastive learning for
bility of these datasets facilitates replicability and sentence embeddings. In Proceedings of the 2022
scrutiny by the broader research community, align- Conference of the North American Chapter of the
Association for Computational Linguistics: Human
ing with ethical research practices. However, we Language Technologies, pages 4207–4218.
acknowledge the responsibility that accompanies
the development of any semantic search technol- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and
Ziqing Yang. 2021. Pre-training with whole word
ogy, given its potential influence on information masking for chinese bert. IEEE/ACM Transac-
access and dissemination. We encourage ongoing tions on Audio, Speech, and Language Processing,
dialogue and ethical considerations in the applica- 29:3504–3514.
tion of our research findings. Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
7 Limitations of quantized llms. In Thirty-seventh Conference on
Neural Information Processing Systems.
While D2LLM presents a significant advancement
in semantic search, it is not without limitations. Luyu Gao and Jamie Callan. 2021. Is your language
model ready for dense representation fine-tuning.
Firstly, the IEM offers greater flexibility than co- arXiv preprint arXiv:2104.08253.
sine similarity in capturing semantic nuances, but it
may still omit some intricate details grasped by the Luyu Gao and Jamie Callan. 2022. Unsupervised cor-
pus aware language model pre-training for dense pas-
original teacher LLM. While enhancing the IEM’s sage retrieval. In Proceedings of the 60th Annual
complexity could improve performance, this can Meeting of the Association for Computational Lin-
come at the expense of efficiency. Secondly, the guistics (Volume 1: Long Papers), pages 2843–2853.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Empirical Methods in Natural Language Process-
Simcse: Simple contrastive learning of sentence em- ing, EMNLP 2020, pages 6769–6781. Association
beddings. In 2021 Conference on Empirical Meth- for Computational Linguistics (ACL).
ods in Natural Language Processing, EMNLP 2021,
pages 6894–6910. Association for Computational James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz,
Linguistics (ACL). Joel Veness, Guillaume Desjardins, Andrei A Rusu,
Kieran Milan, John Quan, Tiago Ramalho, Ag-
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, nieszka Grabska-Barwinska, et al. 2017. Over-
Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Qianyu Guo, coming catastrophic forgetting in neural networks.
Meng Wang, and Haofen Wang. 2024. Retrieval- Proceedings of the national academy of sciences,
augmented generation for large language models: A 114(13):3521–3526.
survey.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu-
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. taka Matsuo, and Yusuke Iwasawa. 2022. Large lan-
2013. Optimized product quantization. IEEE trans- guage models are zero-shot reasoners. Advances in
actions on pattern analysis and machine intelligence, neural information processing systems, 35:22199–
36(4):744–755. 22213.
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu,
Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, Juho Lee, Yoonho Lee, Jungtaek Kim, Adam Ko-
et al. 2021. Lora: Low-rank adaptation of large lan- siorek, Seungjin Choi, and Yee Whye Teh. 2019a.
guage models. In International Conference on Learn- Set transformer: A framework for attention-based
ing Representations. permutation-invariant neural networks. In Interna-
tional conference on machine learning, pages 3744–
Linmei Hu, Siyong Xu, Chen Li, Cheng Yang, Chuan 3753. PMLR.
Shi, Nan Duan, Xing Xie, and Ming Zhou. 2020.
Graph neural news recommendation with unsuper- Kenton Lee, Ming-Wei Chang, and Kristina Toutanova.
vised preference disentanglement. In Proceedings 2019b. Latent retrieval for weakly supervised open
of the 58th annual meeting of the association for domain question answering. In Proceedings of the
computational linguistics, pages 4255–4264. 57th Annual Meeting of the Association for Compu-
tational Linguistics, pages 6086–6096.
Zhiting Hu and Tianmin Shu. 2023. Language mod-
els, agent models, and world models: The law for Chaofan Li, Zheng Liu, Shitao Xiao, and Yingxia
machine reasoning and planning. arXiv preprint Shao. 2023a. Making large language models a bet-
arXiv:2312.05230. ter foundation for dense retrieval. arXiv preprint
arXiv:2312.15503.
Tao Huang, Shan You, Fei Wang, Chen Qian, and Chang
Xu. 2022. Knowledge distillation from a stronger Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long,
teacher. Advances in Neural Information Processing Pengjun Xie, and Meishan Zhang. 2023b. Towards
Systems, 35:33716–33727. general text embeddings with multi-stage contrastive
learning. arXiv preprint arXiv:2308.03281.
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebas-
tian Riedel, Piotr Bojanowski, Armand Joulin, and Yufan Liu, Jiajiong Cao, Bing Li, Chunfeng Yuan,
Edouard Grave. 2022. Unsupervised dense informa- Weiming Hu, Yangxi Li, and Yunqiang Duan. 2019.
tion retrieval with contrastive learning. Transactions Knowledge distillation via instance relationship
on Machine Learning Research. graph. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages
Gautier Izacard and Edouard Grave. 2021. Distilling 7096–7104.
knowledge from reader to retriever for question an-
swering. In ICLR 2021-9th International Conference
Yi Luan, Jacob Eisenstein, Kristina Toutanova, and
on Learning Representations.
Michael Collins. 2021. Sparse, dense, and attentional
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cu- representations for text retrieval. Transactions of the
mulated gain-based evaluation of ir techniques. Association for Computational Linguistics, 9:329–
ACM Transactions on Information Systems (TOIS), 345.
20(4):422–446.
Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and
Herve Jegou, Matthijs Douze, and Cordelia Schmid. Jimmy Lin. 2023. Fine-tuning llama for multi-stage
2010. Product quantization for nearest neighbor text retrieval. arXiv preprint arXiv:2310.08319.
search. IEEE transactions on pattern analysis and
machine intelligence, 33(1):117–128. Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe,
Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle-
Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick moyer. 2022. Rethinking the role of demonstrations:
Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and What makes in-context learning work? In Proceed-
Wen Tau Yih. 2020. Dense passage retrieval for open- ings of the 2022 Conference on Empirical Methods in
domain question answering. In 2020 Conference on Natural Language Processing, pages 11048–11064.
Niklas Muennighoff. 2022. Sgpt: Gpt sentence James Thorne, Andreas Vlachos, Christos
embeddings for semantic search. arXiv preprint Christodoulopoulos, and Arpit Mittal. 2018.
arXiv:2202.08904. Fever: a large-scale dataset for fact extraction and
verification. In Proceedings of the 2018 Conference
Arvind Neelakantan, Tao Xu, Raul Puri, Alec Rad- of the North American Chapter of the Association
ford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, for Computational Linguistics: Human Language
Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Technologies, Volume 1 (Long Papers), pages
2022. Text and code embeddings by contrastive pre- 809–819.
training. arXiv preprint arXiv:2201.10005.
Ivan Vuli and Nikola Mrki. 2018. Specialising word
Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Her- vectors for lexical entailment. In North American
nandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Chapter of the Association for Computational Lin-
Hall, Ming-Wei Chang, et al. 2022. Large dual en- guistics.
coders are generalizable retrievers. In Proceedings
of the 2022 Conference on Empirical Methods in Liang Wang, Nan Yang, Xiaolong Huang, Binxing
Natural Language Processing, pages 9844–9855. Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder,
and Furu Wei. 2022. Text embeddings by weakly-
Yingqi Qu, Yuchen Ding, Jing Liu, Kai Liu, Ruiyang
supervised contrastive pre-training. arXiv preprint
Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, and
arXiv:2212.03533.
Haifeng Wang. 2021. Rocketqa: An optimized train-
ing approach to dense passage retrieval for open- Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang,
domain question answering. In Proceedings of the Rangan Majumder, and Furu Wei. 2023. Improving
2021 Conference of the North American Chapter of text embeddings with large language models. arXiv
the Association for Computational Linguistics: Hu- preprint arXiv:2401.00368.
man Language Technologies, pages 5835–5847.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert:
Adams Wei Yu, Brian Lester, Nan Du, Andrew M
Sentence embeddings using siamese bert-networks.
Dai, and Quoc V Le. 2021. Finetuned language mod-
In Proceedings of the 2019 Conference on Empirical
els are zero-shot learners. In International Confer-
Methods in Natural Language Processing and the 9th
ence on Learning Representations.
International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), pages 3982–3992. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas
Ruiyang Ren, Shangwen Lv, Yingqi Qu, Jing Liu, Muennighof. 2023. C-pack: Packaged resources to
Wayne Xin Zhao, Qiaoqiao She, Hua Wu, Haifeng advance general chinese embedding. arXiv preprint
Wang, and Ji-Rong Wen. 2021a. Pair: Leverag- arXiv:2309.07597.
ing passage-centric similarity relation for improving
Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang,
dense passage retrieval. In Findings of the Associ-
Jialin Liu, Paul N Bennett, Junaid Ahmed, and
ation for Computational Linguistics: ACL-IJCNLP
Arnold Overwijk. 2020. Approximate nearest neigh-
2021, pages 2173–2183.
bor negative contrastive learning for dense text re-
Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, trieval. In International Conference on Learning
Qiaoqiao She, Hua Wu, Haifeng Wang, and Ji-Rong Representations.
Wen. 2021b. Rocketqav2: A joint training method
for dense passage retrieval and passage re-ranking. Sohee Yang and Minjoon Seo. 2020. Is retriever
In Proceedings of the 2021 Conference on Empiri- merely an approximator of reader? arXiv preprint
cal Methods in Natural Language Processing, pages arXiv:2010.10999.
2825–2835.
Elad Ben Zaken, Yoav Goldberg, and Shauli Ravfogel.
Guilherme Rosa, Luiz Bonifacio, Vitor Jeronymo, Hugo 2022. Bitfit: Simple parameter-efficient fine-tuning
Abonizio, Marzieh Fadaee, Roberto Lotufo, and for transformer-based masked language-models. In
Rodrigo Nogueira. 2022a. In defense of cross- Proceedings of the 60th Annual Meeting of the As-
encoders for zero-shot retrieval. arXiv preprint sociation for Computational Linguistics (Volume 2:
arXiv:2212.06121. Short Papers), pages 1–9.

Guilherme Moraes Rosa, Luiz Bonifacio, Vitor Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min
Jeronymo, Hugo Abonizio, Marzieh Fadaee, Roberto Zhang, and Shaoping Ma. 2021. Optimizing dense
Lotufo, and Rodrigo Nogueira. 2022b. No parameter retrieval model training with hard negatives. In Pro-
left behind: How distillation and model size affect ceedings of the 44th International ACM SIGIR Con-
zero-shot retrieval. arXiv preprint arXiv:2206.02873. ference on Research and Development in Information
Retrieval, pages 1503–1512.
Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang,
Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Hang Zhang, Yeyun Gong, Yelong Shen, Jiancheng Lv,
Smith, Luke Zettlemoyer, and Tao Yu. 2022. One Nan Duan, and Weizhu Chen. 2022a. Adversarial
embedder, any task: Instruction-finetuned text em- retriever-ranker for dense text retrieval. In Interna-
beddings. arXiv preprint arXiv:2212.09741. tional Conference on Learning Representations.
Shunyu Zhang, Yaobo Liang, Ming Gong, Daxin Jiang, from T2Ranking5 , DuReader6 , cMedQA27 , and
and Nan Duan. 2022b. Multi-view document repre- mMARCO8 . Concretely, we sample 50% from
sentation learning for open-domain dense retrieval.
T2Ranking, 80% from DuReader, 80% from
In Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (Volume cMedQA2, and 35% from mMARCO, and fi-
1: Long Papers), pages 5990–6000. nally compose a training dataset of 0.9M. The
testing data involves T2Retrieval, DuRetrieval,
Xin Zhang, Zehan Li, Yanzhao Zhang, Dingkun Long, CovidRetrieval, CmedqaRetrieval, MedicalRe-
Pengjun Xie, Meishan Zhang, and Min Zhang. 2023. trieval, and MMarcoRetrieval. To evaluate the
Language models are universal embedders. arXiv
retrieval effectiveness, each method retrieves and
preprint arXiv:2310.08232.
ranks the top-10 passages for each query, and
Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan
MRR@10 and Recall@10 are then utilized as
Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, the metrics.
and Ji-Rong Wen. 2023. Large language models
for information retrieval: A survey. arXiv preprint B Implementation Details
arXiv:2308.07107.
Our D2LLM model is built upon PyTorch and
DeepSpeed, using Qwen-7B-Chat as the teacher
A Datasets and the base LLM for the student due to its effec-
tiveness with Chinese data. The model uses a batch
We consider three tasks to verify the usefulness
size of 32, with each query having 8 hard negatives
of all methods, including Natural Language Infer-
assigned. The PMA module features 32 heads,
ence (NLI), Semantic Textual Similarity (STS), and
and the IEM includes two single-layer MLPs with
Information Retrieval (IR). The training and test-
ReLU activations—the first with an input size of
ing data for each task are listed below. Note that
8192 and output of 512, and the second with con-
all testing data are taken from the comprehensive
sistent 512 dimensions for both. We set α = 1,
benchmark CMTEB (Chinese Massive Text Em-
β = 0.3, and γ = 0.1 in all our experiments, un-
bedding Benchmark).
less otherwise specified, based on the observations
• NLI: For the NLI task, models are tasked to dis- in Appendix F.3. The AdamW optimizer is used
cern the presence of an entailment relationship with a learning rate of 1e-4, including a warm-up
between pairs of sentences. The training data over 0.2 epochs, and training is halted early upon
for this task involves the 0.3M data from SNLI- model convergence. LoRA adjustments are made
zh3 , NLI-zh4 , specifically including the datasets with a rank of 8, while mixed-precision training
named ATEC, BQ, LCQMC, and PAWSX. The and gradient checkpointing minimize memory us-
testing datasets are OCNLI and CMNLI. Perfor- age. Training runs on 8 NVIDIA A100 GPUs with
mance metrics include Accuracy, Average Preci- 80GB each.
sion (AP), Precision, and Recall. For the consumption of computing resources,
• STS: For the STS task, the objective is to pre- note that the number of tunable parameters in
dict the degree of similarity between sentence D2LLM, which amounts to 89 million including
pairs, with a higher predicted score indicating the modifications for PMA and IEM, remains on
greater similarity. The training data is the same par with Udever and LLaRA. The consistent pa-
as the above NLI task. The testing data involves rameter count across these methods is due to our
ATEC, BQ, LCQMC, PAWSX, STSB, AFQMC, choice of using the LoRA adaptation technique for
and QBQTC. Pearson and Spearman correlation all models, and the foundational model is the same
coefficients serve as evaluation metrics. for D2LLM, Udever, and LLaRA, leading to a com-
• IR: For the IR task, each dataset comprised parable computational cost for these LLM-based
a corpus, a set of queries, and an associated approaches. Besides, to optimize GPU memory
mapping of each query to the relevant docu- utilization, we engage in a two-stage approach: ini-
ments within the corpus. The aim was to accu- tially, we utilize the Teacher LLM to infer and
rately identify these pertinent documents for each store logits, relevant for contrastive and rank imita-
query. The training data are randomly sampled 5
https://github.com/THUIR/T2Ranking
6
https://github.com/baidu/DuReader
3 7
https://huggingface.co/datasets/shibing624/snli-zh https://github.com/zhangsheng93/cMedQA2
4 8
https://huggingface.co/datasets/shibing624/nli_zh https://huggingface.co/datasets/unicamp-dl/mmarco
tion losses, as well as similarity matrices necessary • Udever (Zhang et al., 2023) also aims to mod-
for feature imitation. These logits and similarity ify GPT-style LLMs for text embedding. Udever
matrices, with relatively small memory footprints, introduces a special token appended to the end
correlate only with selected positives and hard neg- of the sentences and trains this token to summa-
atives for each sample. Next, during the subse- rize the sentences via sentence-level contrastive
quent training phase, the student model can learn learning.
from these pre-saved components without the si- • LLaRA (Li et al., 2023a) is similar to Udever,
multaneous GPU presence of the teacher model. but employs token-level contrastive learning to
In practice, D2LLM training culminates in approx- further refine sentence embeddings. In particular,
imately 10 and 22 hours of training time for the for Udever and LLaRA, we set the rank of the
symmetric (i.e., NLI and STS) and asymmetric low-rank adapters (LoRA) to 40, in order to guar-
search tasks (i.e., IR) respectively. By contrast, antee the number of trainable parameters in these
the training durations for LLM-based bi-encoders, two methods is the same as that in D2LLM.
such as LLaRA, hover around 9 and 20 hours for
the same tasks. This indicates that the resource D Teacher Model Finetuning
usage is nearly equivalent between these methods,
To finetune the teacher model for a specific task, we
with D2LLM introducing only a minimal additional
incorporate all positive sentence pairs and select an
computational burden. To implement D2LLM
equal number of hard negatives (cf. Section 3.2.1),
with diminished resource utilization, a reduction
maintaining a balanced ratio of 1:1 between posi-
in the number of tunable parameters could be ben-
tives and negatives. Depending on the task at hand,
eficial—potentially by lowering the rank value in
we utilize prompts suited for either symmetric or
LoRA or by adopting other parameter-efficient fine-
asymmetric searches to structure the finetuning
tuning methods like QLoRA (Dettmers et al., 2023),
dataset, composing inputs of prompted sentence
and investigating these resource-conserving alter-
pairs and outputs of binary responses "yes" or "no".
natives will be a focus of our future work.
We typically opt for LoRA finetuning, where we
set the rank within LoRA to 32. This specific rank
C Baselines setting is chosen to align the number of learnable
parameters with those used in other comparable
The five benchmark methods are summarized be-
methods.
low:
• BGE (Xiao et al., 2023) is a BERT-style bi- E Results on IR
encoder. It involves three training stages. First,
In this section, we check the performance of all
the model pretrained with massive data using
methods for the IR task. As shown in Table 4,
MAE. Next, it is finetuned with unlabeled and
D2LLM again outperforms other methods trained
labeled data separately.
on the same dataset for the majority of the time (8
• RocketQAv2 (Ren et al., 2021b) is also BERT-
out of 12 instances), despite not having a teacher
style bi-encoder. It is distilled from a BERT-style
model finetuned specifically for this task. Al-
cross-encoder via dynamic listwise distillation.
though we lack performance data for the teacher
This technique enables joint update of both the
model LLM-ce due to its cross-encoder design
teacher and student. Particularly in our experi-
being impractically slow for real-world retrieval
ments, we initialize both the student and teacher
tasks, D2LLM proves to be an effective surro-
as a pretrained Chinese-roberta-large model9 .
gate. It strikes a balance between accuracy and
• SGPT (Muennighoff, 2022) exploits both bi and efficiency, making it suitable for practical appli-
cross-encoder architectures to enhance GPT-style cations. Nonetheless, it’s worth noting that BGE
LLMs for semantic search. Here, we only utilize achieves the highest performance, a likely result of
the bi-encoder variant due to its efficiency. It its extensive training on data pertinent to IR.
incorporates BitFit finetuning of the LLM and
position-weighted mean pooling to generate an F Ablation Studies
overall embedding for a query or a passage.
In this section, we conduct a series of ablation stud-
9
https://huggingface.co/hfl/chinese-roberta-wwm-ext- ies to further show the competency of the proposed
large D2LLM.
Table 4: Results for IR, with the best-performing method and the second-best results marked in bold and underlined,
respectively.
T2Retrieval DuRetrieval mMARCORetrieval cMedQARetrieval CovidRetrieval MedicalRetrieval
Metric #Data #Param. MRR Recall MRR Recall MRR Recall MRR Recall MRR Recall MRR Recall
BGE-ft(326M) 0.9M 326M 0.8739 0.7659 0.9125 0.8546 0.6908 0.8363 0.2208 0.2638 0.5828 0.7363 0.4561 0.5490
RocketQAv2(326M) 0.9M 326M 0.6670 0.5579 0.6502 0.5079 0.5012 0.6920 0.1995 0.2321 0.4032 0.5822 0.2979 0.3851
SGPT(7B) 0.9M 0.4M 0.7408 0.6026 0.7762 0.6805 0.5516 0.7307 0.2454 0.3249 0.4411 0.6109 0.3266 0.4121
Udever(7B) 0.9M 89M 0.8653 0.7358 0.8905 0.8213 0.6327 0.7898 0.3145 0.3903 0.5124 0.6742 0.4346 0.5210
LLaRA(7B) 0.9M 89M 0.8412 0.7362 0.8777 0.8083 0.6511 0.8003 0.3221 0.3886 0.5250 0.6793 0.4612 0.5632
D2LLM(7B) 0.9M 89M 0.8893 0.7719 0.9162 0.8608 0.6723 0.8221 0.3501 0.4028 0.5639 0.7121 0.4991 0.6021
Improv.(%) N/A N/A 1.76% 0.78% 0.41% 0.73% -2.68% -1.70% 8.69% 3.20% -3.24% -3.29% 8.22% 6.91%
BGE(326M) 100M 326M 0.9094 0.8084 0.9345 0.8851 0.7583 0.8934 0.4349 0.4452 0.6587 0.8246 0.5504 0.6660
LLM-be(7B) N/A N/A 0.1843 0.1098 0.3331 0.2178 0.0541 0.1011 0.0456 0.0547 0.2630 0.4131 0.0462 0.076
LLM-ce(7B) N/A N/A - - - - - - - - - - - -

F.1 Impact of Losses and Modules Table 5: Performance on individual or mixed data type
via IEM and cosine similarity.
Our analysis focuses on the efficacy of various OCNLI CMNLI T2Retrieval
losses and modules detailed in Section 3. Below Metric ACC AP ACC AP MRR Recall
D2LLM-cos-sym 0.7461 0.7886 0.7867 0.8377 0.2791 0.2021
are the specific modifications we tested, with the D2LLM-cos-asym 0.4913 0.4704 0.5003 0.5082 0.8072 0.7037
corresponding results shown in Table 3: D2LLM-cos-mixed 0.7035 0.7524 0.7593 0.8024 0.7771 0.6802
D2LLM-sym 0.7905 0.8226 0.8084 0.8839 0.2622 0.1650
• −CI+CL: We replaced the proposed CI loss in D2LLM-asym 0.5138 0.4893 0.5144 0.5003 0.8346 0.7218
D2LLM-dual 0.7834 0.8017 0.7825 0.8778 0.8321 0.7059
Section 3.2.1 with the standard contrastive loss.
The CI loss amounts to the standard contrastive
loss by setting the score sTij given by the teacher pose of the [EOS] token in the pretrained LLM,
to 1. As expected, the resulting performance dete- thus not fully capitalizing on the LLM’s capabili-
riorates by 3.59%, since the standard contrastive ties.
loss is sensitive to positives concealed within the • −IEM+cos: Substituting the Interaction Emula-
hard negative set D− H. tion Module (IEM) with cosine similarity, akin
• −RI_PH: Omitting the rank imitation loss for to original bi-encoders, led to a 3.83% decline in
positives and hard negatives (Equation (7)) led to performance. This change buttresses our asser-
a 6.57% reduction in performance. This under- tion that the MLP is integral to modeling com-
scores the value of the student model mirroring plex sentence relationships more effectively than
the teacher in ranking these critical pairs. cosine similarity alone.
• −RI_HI: The removal of the rank imitation loss • D2LLM-1.8B: Scaling down the teacher LLM
for hard and easy negatives (Equation (8)) re- from 7B to 1.8B exhibited a 14.76% decrease in
sulted in a 4.92% performance drop. This sup- performance. This suggests that the capacity of
ports our initial argument (Section 3.2.2) that the teacher LLM is a key determinant in the effec-
distinguishing these sample sets is key for robust tiveness of the student D2LLM, with the findings
student model training. indicating that larger teacher models engender
• −FI: Excluding the feature imitation loss in- more proficient student models.
curred a 2.17% loss in performance, highlighting
the role of feature distillation in transferring a F.2 Impact of Single and Dual Branches in the
broader spectrum of knowledge from the teacher IEM
to the student model. We further investigated whether it is essential
• −PMA+mean: Replacing Pooling by Multi- and beneficial to include two branches in the
head Attention (PMA) with mean pooling led IEM. For this purpose, we selected two represen-
to a 1.71% decrease in performance. This result tative datasets from those listed in Appendix A:
emphasizes the superior flexibility of the learn- SNLI-zh for symmetric semantic relationships and
able PMA compared to the static mean pooling. T2Ranking for asymmetric semantic relationships.
• −PMA+[EOS]: Forgoing PMA in favor of us- These datasets were employed to train a dual-
ing the [EOS] token as a sentence-wide embed- branch model, designated as D2LLM-dual. For
ding, and applying contrastive finetuning, caused comparative analysis, we trained two separate
a 1.51% performance downturn. The shift in models—D2LLM-sym and D2LLM-asym—each
training objective strays from the original pur- specializing in one of the two dataset types. We
0.81 0.80 0.80
0.80
0.79 0.79 0.79
Accuracy

Accuracy

Accuracy
0.78 0.78 0.78
0.77
0.76 OCNLI 0.77 OCNLI 0.77 OCNLI
CMNLI CMNLI CMNLI
0.75 0.4 0.6 0.8 1 1.2 1.4 0.76 0.1 0.2 0.3 0.4 0.5 0.760.01 0.05 0.1 0.5 1
Value of Value of Value of
Figure 3: Effect of the hyperparameter α, β and γ on OCNLI and CMNLI.

0.78 0.78 0.775


0.77 0.77
Recall

Recall
Recall

0.765
0.76 0.76

0.75 0.6 0.8 1 1.2 0.75 0.2 0.3 0.4 0.5 0.7550.01 0.05 0.1 0.5
Value of Value of Value of
Figure 4: Effect of the hyperparameter α, β and γ on T2Retrieval.

assessed the models’ capabilities on test datasets α, we further incorporate the loss component LRI HI
that were aligned with either symmetric relations and adjust to identify an optimal pair (α, β). We
(OCNLI and CMNLI) or asymmetric relations continue in a similar fashion to identify the optimal
(T2Retrieval). The results are summarized in Ta- (α, β, γ) within the vicinity of the previously de-
ble 5. termined (α, β). Although this approach may not
It can be observed that D2LLM-dual achieves guarantee globally optimal parameters, it substan-
performance on par with D2LLM-sym for sym- tially conserves resources and has proven to yield
metric relationship tasks and with D2LLM-asym satisfactory performance. We conduct experiments
for asymmetric tasks. In stark contrast, D2LLM- based on NLI and IR tasks and report the perfor-
sym experiences a notable performance drop in mance for OCNLI, CMNLI, and T2Retrieval.
asymmetric contexts, while D2LLM-asym strug- We first depict ACC and Recall as functions of
gles similarly with symmetric tasks. These findings α in Figure 3 and Figure 4. It can be observed that
highlight the efficacy of the dual-branch IEM in a low α value correlates with suboptimal perfor-
D2LLM-dual, which is adept at tackling both sym- mance, indicating that the student model is inade-
metric and asymmetric semantic challenges. Thus, quately leveraging the teacher’s ranking capacity.
D2LLM-dual emerges as a versatile and unified As α is incremented to 1, we observe a progressive
model, proficiently capturing the nuanced interac- enhancement in performance. However, surpassing
tions between sentences regardless of the semantic this threshold results in a decline, which suggests
relationship type. an optimal value at α = 1.
Regarding β, this parameter determines the at-
F.3 Weight Sensitivity Analysis tention on the loss term LRI
HI . The empirical results,
We also investigate the impact of three hyperpa- depicted in Figure 3 and Figure 4, reveal that a β
rameters α, β, and γ, which represent the balance value of 0.3 yields the most significant augmenta-
weights of LCI , LRI RI
P H , and LHI respectively. In
tion in model performance.
practice, to determine (α, β, γ), a grid search is Finally, we change the value of γ to discern
obviously impractical. Instead, we opt for a se- the importance of feature imitation. Figure 3 and
quential method to choose the weight parameters. Figure 4 indicate that higher γ leads to enhanced
Specifically, we first obtain a locally optimal value model performance up to a certain point. Never-
for α focusing solely on the losses LCI and LRI PH. theless, an excessive prominence on it disrupts the
Next, by slightly expanding the tentative range of equilibrium among the other loss functions, thereby
impairing overall performance. semantic relationship type to be learned indepen-
Based on the above analysis, we note that the dently during training, thereby minimizing cross-
optimal hyperparameter combination remains simi- interference. Conversely, cosine similarity, as
lar across different tasks, suggesting that D2LLM implicitly mentioned in (Vuli and Mrki, 2018)
generally does not require laborious task-specific and (Muennighoff, 2022), struggles to distinguish
hyperparameter adjustments, thereby aiding in re- between these relationship types accurately. More-
source conservation. Experimental results for both over, the enhanced performance of the D2LLM-
the NLI and IR tasks reveal that D2LLM’s perfor- sym and D2LLM-asym, compared to their cosine-
mance is relatively stable across a certain parameter based counterparts, further substantiates the profi-
range for the weight parameters (α, β, γ). we set ciency of the learnable MLP components f1 and
α = 1, β = 0.3, and γ = 0.1 in all our experi- f2 within the IEM at navigating semantic nuances
ments, unless otherwise specified. more adeptly than cosine similarity.
Nevertheless, it is essential to acknowledge the
G Discussion on IEM limitations of the IEM, in comparison with cross-
Recall that the IEM can be expressed as: encoders (i.e., the teacher model LLM-ce or LLM-
ce-ft). As shown in Tables 1, 2 and 4, IEM does
S agg agg not yet match the performance of cross-encoder
yij = f2 (f1 ([yi , yj ])). (12)
models, which is primarily due to its mechanism of
In other words, the IEM operates by accepting con- facilitating information exchange only after the sep-
catenated representations of the query and passage arate derivation of query and passage embeddings.
and determining their relevancy. Fundamentally, This approach overlooks the nuanced relationship
through f1 , it tries to emulate the cross-attention modeling achieved through the continuous cross-
mechanism present in the teacher model, foster- attention layers in the cross-encoders.
ing an information exchange between the passage
and query. Depending on whether the relationship H Efficiency Enhancement in Real-world
agg agg
between [yi , yj ] is symmetric or asymmetric, Applications
IEM directs the input to the relevant branch of
sym asym In real-world applications, especially with larger
MLP f2 ∈ {f2 , f2 }, which accordingly out-
datasets, bi-encoders, despite their capability to
puts a "yes" or "no" representation. This differen-
pre-compute vectors of passages, may face signifi-
tiation allows IEM to adeptly handle both types
cant expenses when performing full-scale vector re-
of semantic search, mirroring the diverse prompts
trieval. To ensure efficient search, various technolo-
P ∈ {Psym , Pasym } utilized by the teacher model
gies (Jegou et al., 2010; Ge et al., 2013) have been
during inference.
put forward, of which a representative solution is
To evaluate IEM’s performance against tradi-
quantization-based Approximate Nearest Neighbor
tional cosine similarity measures, we engaged in
(ANN) methods. These methods compress the vec-
a comprehensive experimental suite, the details
tor space through quantization to enhance retrieval
of which are provided in Appendix F.2. Within
efficiency and reduce storage demands while only
this analysis, we focused on the SNLI-zh and
marginally sacrificing accuracy, thereby markedly
T2Ranking datasets to represent symmetric and
accelerating search speeds. These techniques are
asymmetric semantic relationships, respectively.
equally applicable to D2LLM. By applying quanti-
Models trained on these datasets include a dual-
zation techniques to both the embeddings produced
branch version (D2LLM-dual) and two specialized
by the PMA and the network parameters of the
models (D2LLM-sym and D2LLM-asym). Ad-
IEM (which functions as the distance metric), we
ditionally, we introduced cosine similarity into
can slightly compromise accuracy to significantly
this comparative study, resulting in models de-
enhance search speed. This improvement in ef-
noted as D2LLM-cos-mixed, D2LLM-cos-sym,
ficiency can, in turn, contribute to better ranking
and D2LLM-cos-asym. The results in Table 5
performance in subsequent procedures.
demonstrate that D2LLM-dual exhibits superior
capability in handling both symmetric and asym- I Bad Case Study
metric search tasks compared to the D2LLM-cos-
mixed model. This can be attributed to the dual- For the NLI task, we present two bad cases, one
branch structure of the IEM, which allows each representing an entailment pair, and the other a
contradiction pair. In addition to the labels, we have strategies signify promising areas for our future
provided the student model’s prediction and the explorations.
teacher logits. The specific results are as follows:
NLI bad case #1:
Sentence A: There is a tennis court in this city, and
they are the first batch of customers.
Sentence B: There are customers who came to the
tennis court before them.
Student’s Prediction: 0.5959
Teacher’s logits: 0.9141
Label: 0 (Contradiction)
NLI bad case #2:
Sentence A: The operator said that the two parties
were bargaining.
Sentence B: There were at least three people.
Student’s Prediction: 0.3541
Teacher’s logits: 0.4688
Label: 1 (Entailment)
Similarly for the IR task, we provide two bad
cases, both of which are positive examples. We
present the student’s rankings of these cases and
the teacher logits. The specific results are as fol-
lows:
IR bad case #1:
Query: What are the four major artifacts of Asia?
Passage: One of the artifacts: motorcycle. Motor-
cycle can be said to be a status symbol in India.
Student’s Rank: 32
Teacher’s logits: 0.081
Label: 1 (Correct match)
IR bad case #2:
Query: Why does my phone always show that there
are messages that cannot be opened?
Passage: If you have a smart phone, you can try
flashing it.
Student’s Rank: 46
Teacher’s logits: 0.135
Label: 1 (Correct match)
We suggest that these "bad cases" may stem from
the teacher model’s initial misjudgment, which in-
advertently hinders the student model’s learning
process. To address this, we propose two potential
strategies to refine the teacher model’s predictions:
First, for scenarios with limited resources, applying
In-Context Learning (Min et al., 2022) by includ-
ing examples of these errors during training could
enhance the teacher’s ability to provide more ac-
curate logits. Second, with sufficient resources,
fine-tuning the teacher model can improve its per-
formance while being mindful to avoid catastrophic
forgetting (Kirkpatrick et al., 2017) to ensure it re-
tains its generalized learning capabilities. Both

You might also like