0% found this document useful (0 votes)
20 views9 pages

LLM4Tag: Automatic Tagging System For Information Retrieval

The document presents LLM4Tag, an automatic tagging system designed to enhance information retrieval using Large Language Models (LLMs). It addresses limitations in existing tagging methods by incorporating a graph-based tag recall module, a knowledge-enhanced tag generation module, and a tag confidence calibration module. LLM4Tag has demonstrated superior performance in extensive experiments and is already deployed for content tagging, serving millions of users.

Uploaded by

sergiogv6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views9 pages

LLM4Tag: Automatic Tagging System For Information Retrieval

The document presents LLM4Tag, an automatic tagging system designed to enhance information retrieval using Large Language Models (LLMs). It addresses limitations in existing tagging methods by incorporating a graph-based tag recall module, a knowledge-enhanced tag generation module, and a tag confidence calibration module. LLM4Tag has demonstrated superior performance in extensive experiments and is already deployed for content tagging, serving millions of users.

Uploaded by

sergiogv6
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

LLM4Tag: Automatic Tagging System for Information Retrieval

via Large Language Models


Ruiming Tang Chenxu Zhu Bo Chen
[email protected] [email protected] [email protected]
Huawei Noah’s Ark Lab, China Huawei Noah’s Ark Lab, China Huawei Noah’s Ark Lab, China

Weipeng Zhang Menghui Zhu Xinyi Dai


[email protected] [email protected] [email protected]
Huawei Noah’s Ark Lab, China Huawei Noah’s Ark Lab, China Huawei Noah’s Ark Lab, China
arXiv:2502.13481v1 [cs.IR] 19 Feb 2025

Huifeng Guo
[email protected]
Huawei Noah’s Ark Lab, China
ABSTRACT 03–07, 2025, Toronto, ON, Canada. ACM, New York, NY, USA, 9 pages. https:
Tagging systems play an essential role in various information re- //doi.org/10.1145/nnnnnnn.nnnnnnn
trieval applications such as search engines and recommender sys-
tems. Recently, Large Language Models (LLMs) have been applied 1 INTRODUCTION
in tagging systems due to their extensive world knowledge, seman- Tagging is the process of assigning tags, such as keywords or labels,
tic understanding, and reasoning capabilities. Despite achieving to digital content, products, or users to facilitate organization, re-
remarkable performance, existing methods still have limitations, trieval, and analysis. Tags serve as descriptors that summarize key
including difficulties in retrieving relevant candidate tags com- attributes or themes, enabling efficient categorization and search-
prehensively, challenges in adapting to emerging domain-specific ability, which play a crucial role in information retrieval systems,
knowledge, and the lack of reliable tag confidence quantification. such as search engines, recommender systems, content manage-
To address these three limitations above, we propose an automatic ment, and social networks [3, 11, 17, 33]. For information retrieval
tagging system LLM4Tag. First, a graph-based tag recall module systems, tags are widely used in various stages, including con-
is designed to effectively and comprehensively construct a small- tent distribution strategies, ranking algorithms, and operational
scale highly relevant candidate tag set. Subsequently, a knowledge- decision-making processes [2, 7]. Therefore, tagging systems must
enhanced tag generation module is employed to generate accurate not only require high accuracy and coverage, but also provide in-
tags with long-term and short-term knowledge injection. Finally, a terpretability and strong confidence.
tag confidence calibration module is introduced to generate reliable Before the era of Large Language Models (LLMs), the main-
tag confidence scores. Extensive experiments over three large-scale stream tagging methods mainly included statistics-based (i.e., TF-
industrial datasets show that LLM4Tag significantly outperforms IDF-based [24], LDA-based [8]), supervised classification-based (i.e.,
the state-of-the-art baselines and LLM4Tag has been deployed on- CNN-based [9, 32], RNN-based [21, 27]), pre-trained model-based
line for content tagging to serve hundreds of millions of users. methods (i.e., BERT-based [12, 23, 30]), etc. However, limited by
the model capacity, these methods cannot achieve satisfactory re-
CCS CONCEPTS sults, especially for complex contents. Besides, they heavily rely
• Information Retrieval → Tagging Systems. on annotated training data, resulting in limited generalization and
transferability.
KEYWORDS The rise of LLMs, with their extensive world knowledge, power-
Tagging Systems; Large Language Models; Information Retrieval ful semantic understanding, and reasoning capabilities, has signifi-
cantly enhanced the effectiveness of tagging systems. LLM4TC [5]
ACM Reference Format:
employs LLMs directly as tagging classifiers and leverages anno-
Ruiming Tang, Chenxu Zhu, Bo Chen, Weipeng Zhang, Menghui Zhu, Xinyi
tated data to fine-tune LLMs. TagGPT [15] further introduces a
Dai, and Huifeng Guo. 2025. LLM4Tag: Automatic Tagging System for
Information Retrieval via Large Language Models. In KDD ’2025, August match-based recall to filter out a small-scale tag set to address the
limited input length of LLMs. ICXML [35] proposes an in-context
Permission to make digital or hard copies of all or part of this work for personal or learning algorithm to guide LLMs to further improve performance.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation However, existing LLM-enhanced tagging algorithms exhibit
on the first page. Copyrights for components of this work owned by others than the several critical limitations that require improvement (shown in
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Figure 1):
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. (L1) Constrained by the input length and inference efficiency of
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada LLMs, existing methods adopt simple match-based recall to
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM filter out a small-scale candidate tag set [15, 35], which is
https://doi.org/10.1145/nnnnnnn.nnnnnnn prone to missing relevant tags, thereby reducing accuracy.
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.

Input
Limitation 1 deployed online for content tagging, serving hundreds of mil-
Tag repository lions of users.
Match-based
recall
2 RELATED WORK
Missed Candidate
relevant tags tags In this section, we briefly review traditional tagging systems and
LLM-enhanced tagging systems.
Prompt Result tags
Given a content which is {content nuts 2.1 Traditional Tagging Systems
information}, select the proper tags almond
from {tag set}. Traditional tagging systems [6, 11, 22] generally employ multi-
LLMs camdy
label classification models, which utilize human-annotated tags as
Limitation 2 Limitation 3 ground truth labels and employ content descriptions as input to
nuts camdy predict. Qaiser et al. [24] utilize TF-IDF to categorize the tag while
Diaz et al. [8] employ LDA to automatically tag the resource based
Domain-specific Publicly available on the most likely tags derived from the latent topics identified. The
knowledge corpora Yes No Yes No
advent of deep learning has led to the proposal of RNN-based [21]
and CNN-based [32] methods for achieving multi-label learning,
Figure 1: LLM-enhanced tagging systems and their three lim- which are directly applied to tagging systems [9, 27]. Hasegawa
itations. (L1) Simple match-based recall is prone to missing et al. [12] further adopt the BERT pre-training technique in their
relevant tags; (L2) The emerging domain-specific knowledge tagging systems. Recently, with the growing popularity of pre-
may not align with the pre-trained knowledge of LLMs; (L3) trained Small Language Models (SLMs), numerous pre-training
LLMs cannot accurately quantify tag confidence. embedding models, such as BGE [30], GTE [19], and CONAN [16],
have been proposed and directly employed in tagging systems
through domain knowledge fine-tuning.
Nonetheless, the capabilities of these models are constrained
(L2) General purpose LLMs pre-trained in publicly available cor- by their limited model capacity, particularly in the presence of
pora exhibit limitations in comprehending emerging domain- complex content. Additionally, they depend excessively on anno-
specific knowledge within information retrieval, leading to tated training data, resulting in sub-optimal generalization and
lower accuracy in challenging cases [5, 18]. transferability.
(L3) Due to the hallucination and uncertainty [13, 14], LLMs can-
not accurately quantify tag confidence, which is crucial for 2.2 LLM-Enhanced Tagging Systems
information retrieval applications.
With Large Language Models (LLMs) achieving remarkable break-
To address the three limitations of existing approaches, we pro- throughs in natural language processing [1, 4, 10, 26] and infor-
pose an automatic tagging system called LLM4Tag, which consists mation retrieval systems [20, 34], LLM-enhanced tagging systems
of three key modules. Specifically, to improve the completeness of have received much attention and have been actively explored cur-
candidate tags (L1), we propose a graph-based tag recall module rently [5, 15, 25, 29, 35]. Wang et al. [29] employ LLMs as a direct
designed to construct small-scale, highly relevant candidate tags tagging classifier, while Sun et al. [25] introduce clue and reasoning
from a massive tag repository efficiently and comprehensively. To prompts to further enhance performance. LLM4TC [5] undertakes
enhance domain-specific knowledge and adaptability to emerging studies on diverse LLMs architectures and leverages annotated
information of general-purpose LLMs (L2), a knowledge-enhanced samples to fine-tune the LLMs. TagGPT [15] introduces an early
tag generation module that integrates long-term supervised knowl- match-based recall mechanism to generate candidate tags from a
edge injection and short-term retrieved knowledge injection is large-scale tag repository with textual clues from multimodal data.
designed to generate accurate tags. Moreover, a tag confidence cali- ICXML [35] proposes a two-stage framework through in-context
bration module is introduced to generates reliable tag confidence learning to guide LLMs to align with the tag space.
scores, ensuring more robust and trustworthy results (L3). However, the aforementioned works suffer from three critical
To summarize, the main contributions of this paper can be high- limitations (mentioned in Section 1): (1) difficulties in comprehen-
lighted as follows: sively retrieving relevant candidate tags, (2) challenges in adapting
• We propose an LLM-enhanced tagging framework LLM4Tag, to emerging domain-specific knowledge, and (3) the lack of reliable
characterized by completeness, continuous knowledge evolu- tag confidence quantification. To this end, we propose LLM4Tag,
tion, and quantifiability. an automatic tagging system, to address the aforementioned limita-
• To address the limitations of existing approaches, LLM4Tag in- tions.
tegrates three key modules: graph-based tag recall, knowledge-
enhanced tag generation, and tag confidence calibration, ensur- 3 METHODOLOGY
ing the generation of accurate and reliable tags. In this section, we present our proposed LLM4Tag framework in
• LLM4Tag achieves state-of-the-art in three large-scale industrial detail. We start by providing an overview of the proposed frame-
datasets with detailed analysis that provides a deeper under- work and then give detailed descriptions of the three modules in
standing of model performance. Moreover, LLM4Tag has been LLM4Tag.
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada

porridge
almond nuts
Train Dataset
candy
Long-term Supervised
Tag Selection Prompt:
almond chocolate Knowledge Injection Content LLM Selected Tags
fruit Background knowledge: {retrieved
... information}. Given a content which
chocolate nuts
candy is {content information}, select the
C2C2T nuts proper tags from {candidate tag
Content Candidate set}.
Meta-Path LLMs
Tag Set Tag Confidence Judgment
almond
Short-term Retrieved
Knowledge Injection Content P(Yes)
C2T
candy
Meta-Path
Retrieved In-Context Retrieved Augmented Tag LLMs P(No)
Learning Injection Generation Injection
Target
The almond (Prunus amygdalus,
Content Retrievable syn. Prunus dulcis) is a species of
Database tree from the genus Prunus. Along
with the peach, it is classified in
Meta-Path almond nuts candy
Deterministic Edge the subgenus Amygdalus,
tag: nuts tag: almond, nuts
Similar Edge Content Vertex Tag Vertex distinguished from ... 0.93 0.90 0.05

Graph-based Tag Recall Knowledge-enhanced Tag Generation Tag Confidence Calibration

Figure 2: The overall framework of LLM4Tag architecture of LLM4Tag, consisting of three modules: graph-based tag recall
module, knowledge-enhanced tag generation module, and tag confidence calibration module.

3.1 Overview candidate tags, we construct a semantic graph globally and propose
As illustrated in Figure 2, our proposed LLM4Tag framework con- a graph-based tag recall module.
sists of three major modules: (1) Graph-based Tag Recall, (2) Knowledge- Firstly, we initial an undirected graph G with contents and tags
enhanced Tag Generation, and (3) Tag Confidence Calibration, as:
which respectively provides completeness, continual knowledge G = {V, E} , (1)
evolution, and quantifiability. where vertex set V = {C, T } is the set of existing content vertices
Graph-based Tag Recall module is responsible for retrieving C and all tag vertices T . As for the edge set E, it contains two types
a small-scale, highly relevant candidate tag set from a massive of edges, called Deterministic Edges and Similarity Edges.
tag repository. Based on a scalable content-tag graph constructed Deterministic Edges only connect between content vertex 𝑐
dynamically, graph-based tag recall is utilized to fetch dozens of and tag vertex 𝑡, formulated as 𝑒𝑐𝑑−𝑡 , which indicates that content 𝑐
relevant tags for each content. is labeled with tag 𝑡 based on historical annotation data. To ease the
Knowledge-enhanced Tag Generation module is designed high sparsity of the deterministic edges in the graph G, we further
to accurately generate tags for each content via Large Language introduce semantic similarity-based edges (Similarity Edges) that
Models (LLMs). To address the lack of domain-specific and emerg- connect not only between content vertex 𝑐 and tag vertex 𝑡, but also
ing knowledge in general-purpose LLMs, this module implements between different content vertices, formulated as 𝑒𝑐𝑠 −𝑡 and 𝑒𝑐𝑠 −𝑐 ,
a scheme integrating the injection of both long-term and short- respectively.
term domain knowledge, thereby achieving continual knowledge Specifically, for the 𝑖-th vertex 𝑣 𝑖 ∈ V in graph G, we summarize
evolution. all textual information (i.e., title and category in content, tag de-
Tag Confidence Calibration module is aimed to generate a scription) as 𝑡𝑒𝑥𝑡 𝑖 and vectorize it with an encoder to get a semantic
quantifiable and reliable confidence score for each tag, thus allevi- representation 𝒓 𝑖 :
ating the issues of hallucination and uncertainty in LLMs. Further-
r𝑖 = Encoder(𝑡𝑒𝑥𝑡 𝑖 ) , (2)
more, the confidence score can be employed as a relevance metric
for downstream information retrieval tasks. where Encoder is a small language model, such as BGE [30]. Then
the similarity distance of two different vertices 𝑣 𝑖 , 𝑣 𝑗 can be com-
puted as:
3.2 Graph-based Tag Recall r𝑖 · 𝒓 𝑗
Dis(𝑣 𝑖 , 𝑣 𝑗 ) = 𝑖 . (3)
∥𝒓 ∥∥𝒓 𝑗 ∥
Given the considerable magnitude of tags (millions) in industrial
After obtaining the similarity estimations, we can use a threshold-
information retrieval system, the direct integration of the whole tag
based method to determine the similarity edge construction, i.e.,
repository into LLMs is impractical due to the constrained nature
of the context window and inference efficiency of LLMs. Existing • 𝑒𝑐𝑠 −𝑡 connects the content 𝑐 and the tag 𝑡 if the similarity distance
approaches [15, 35] adopt simple match-based tag recall to filter between them exceeds 𝛿𝑐 −𝑡 .
out a small-scale candidate tag set based on small language models • 𝑒𝑐𝑠 −𝑐 connects the similar contents when their similar distance
(SLMs), such as BGE [30]. However, they are prone to missing exceeds 𝛿𝑐 −𝑐 .
relevant tags due to the limited capabilities of SLMs. To address In this way, we can construct a basic content-tag graph with
this issue and improve the comprehensiveness of the retrieved deterministic/similarity edges. Then, when a new content 𝑐 appears
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.

that needs to be tagged, we dynamically insert it into this graph Basic Tag Generation Prompt Template
by adding similarity edges. Next, we define two types of meta- You are an advertising tag bot. Considering an advertisement
paths (i.e., C2T meta-path and C2C2T meta-path) and adopt the creative with {image} and {product description}, select the most
meta-path-based approach to recall candidate tags. relevant tag from {candidate tag set}.
C2T Meta-Path: Based on the given content 𝑐, we first recall
the tags which is connected directly to 𝑐 as the candidate tags. The
Figure 3: Prompt template for basic tag generation in adver-
meta-path can be defined as:
tisement creatives tagging scenario.
𝑠
𝑝𝐶2𝑇 = 𝑐 → 𝑡 , (4)
(shown in Figure 3).
𝑠
where → is the similarity edge. 𝑁
D = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 ,
C2C2T Meta-Path: C2C2T contains two sub-procedures: C2C (7)
and C2T. C2C is aimed at discovering similar contents while C2T 𝑥𝑖 = 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑏 (𝑐𝑖 , Φ(𝑐𝑖 )) ,
further attempts to recall the deterministic tags from these similar where 𝑁 is the size of training dataset. Notably, to ensure the com-
contents. The meta-path can be formulated as: prehensiveness of domain-specific knowledge, we employ the prin-
ciple of diversity for sample selection and obtain correct answers
𝑠 𝑑
𝑝𝐶2𝐶2𝑇 = 𝑐 → 𝑐 → 𝑡 , (5) 𝑦𝑖 by combining LLMs generation with human expert annotations.
After obtaining the training set, we leverage the causal language
𝑑 𝑠 modeling objective for LLM Supervised Fine-Tuning (SFT):
where → is the deterministic edge and → is the similarity edge.
With these two types of meta-paths, we can generate a more |𝑦𝑖 |
𝑁 ∑︁
∑︁ 
comprehensive candidate tag set for content 𝑐 as max log 𝑃Θ 𝑦𝑖,𝑗 | 𝑥𝑖 , 𝑦𝑖,< 𝑗 , (8)
Θ
𝑖=1 𝑗=1
Φ(𝑐) = Φ 𝐶2𝑇
(𝑐) ∪ Φ 𝐶2𝐶2𝑇
(𝑐) , (6)
where Θ is the parameter of LLM, 𝑦𝑖,𝑗 is the 𝑗-th token of the
textual output 𝑦𝑖 , and 𝑦𝑖,< 𝑗 denotes the tokens before 𝑦𝑖,𝑗 in the
where Φ𝐶2𝑇 (𝑐) is retrieved by C2T meta-path and Φ𝐶2𝐶2𝑇 is re-
𝑖-th samples.
trieved by C2C2T meta-path. Notably, the final tagging results of
By adopting this approach, we can effectively integrate the
LLM4Tag for the content 𝑐 will also be added to the graph as deter-
domain-specific knowledge from information retrieval systems
ministic edges, enabling dynamic scalability of the graph.
into LLMs, thus improving the tagging performance.
Compared to simple match-based tag recall, our graph-based tag
recall leverages semantic similarity to construct a global content- 3.3.2 Short-term Retrieved Knowledge Injection. Although LSKI
tag graph and incorporates a meta-path-based multi-hop recall effectively provides domain-specific knowledge, continuously in-
mechanism to enhance candidate tags completeness, which will be corporating short-term knowledge through LLM fine-tuning is
demonstrated in Sec 4.3. highly resource-intensive, especially given the rapid emergence of
new domain knowledge. Additionally, this approach suffers from
3.3 Knowledge-enhanced Tag Generation poor timeliness, making it more challenging to adapt to rapidly
After obtaining the candidate tag set, we can directly use the Large evolving content in information retrieval systems, particularly for
Language Models (LLMs) to select the most appropriate tags. How- emerging hot topics.
ever, due to the diversity and industry-specific nature of the infor- Therefore, we further introduce a short-term retrieved knowl-
mation retrieval system applications, domain-specific knowledge edge injection (SRKI). Specifically, we derive two retrieved knowl-
varies significantly across different scenarios. That is, the same edge injection methods: retrieved in-context learning injection and
content and tags may have distinct definitions and interpretations retrieved augmented generation injection.
depending on the specific application context. Furthermore, the Retrieved In-Context Learning Injection. We first construct
domain-specific knowledge is emerged continually and constantly a retrievable sample knowledge base (including contents and their
at an expeditious pace. As a result, the general-purpose LLMs have correct/incorrect annotated tags) and continuously append newly
difficulty in understanding the emerging domain-specific informa- emerging samples. Then, given the target content 𝑐, this composi-
tion, such as newly listed products, emerging hot news, or newly tion retrieves 𝑛 relevant samples from the sample knowledge base.
added tags, leading to a lower accuracy on challenging cases. This approach not only leverages the few-shot in-context learning
To address the lack of emerging domain-specific information in capability of LLMs but also enables them to quickly adapt to emerg-
LLMs, we devise a knowledge-enhanced tag generation scheme that ing domain knowledge, enhancing tagging accuracy for challenging
takes into account both long-term and short-term domain-specific cases.
knowledge by two key components, namely Long-term Supervised Retrieved Augmented Generation Injection. Given the con-
Knowledge Injection (LSKI), Short-term Retrieved Knowledge Injection tent 𝑐 and the candidate tag set Φ(𝑐), this composition retrieves
(SRKI). relevant descriptive corpus from web search and domain knowl-
edge base. It can retrieve extensive information that assists LLMs in
3.3.1 Long-term Supervised Knowledge Injection. For long-term understanding unknown domain-specific knowledge or new knowl-
domain-specific knowledge, we first construct a training dataset D edge, such as the definition of terminology in the content/tag or
and adopt a basic prompt template as 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑏 for tag generation some manually defined tagging rules.
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada

Retrieval Enhanced Tag Generation Prompt Template confidence, achieving a better performance by mitigating the hallu-
You are an advertising tag bot. Considering an advertisement cination problem. Furthermore, this confidence score can be directly
creative with {image} and {product description}. Domain-specific set as a relevance metric for the downstream tasks.
knowledge: (1) Similar advertisement creatives and corresponding Tag Confidence Training. In order to make the confidence
tags: {(ad1, tag1), (ad2, tag2), ... }; (2) Information related to the score more consistent with the requirements of information re-
advertisement and tags: {definition of terminology, related tagging trieval, we construct a confidence training dataset D ′ as:
rules, ...}. Now you need to select the most relevant tag from
{candidate tag set}. D ′ = {(𝑥𝑖′, 𝑦𝑖′ )}𝑖=1
𝑀
,
𝑥𝑖′ = Promptc (𝑐𝑖 , 𝑡𝑖 ) , (11)
Figure 4: Prompt template for retrieval enhanced tag genera-
𝑦𝑖′ ∈ {“𝑌𝑒𝑠”, “𝑁𝑜”} ,
tion in advertisement creatives tagging scenario.
where 𝑦𝑖 is annotated by experts and 𝑀 is the size of training
After obtaining the retrieved knowledge, we design a prompt dataset. Then we leverage the causal language modeling objective,
template, 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑟 , (shown in Figure 4) to integrate knowledge which is the same as Equation (8) to perform supervised fine-tuning.
with the content 𝑐 and candidate tag set Φ(𝑐) to provide the in- In that case, the confidence score predicted by this module aligns
context guidance for LLMs to predict the most appropriate tags for with the requirements of the information retrieval systems, thereby
content 𝑐 as: facilitating the calibration of incorrect tags.
Γ(𝑐) = LLM(𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑟 (𝑐, Φ(𝑐), 𝑅(𝑐))) ,
 (9) 4 EXPERIMENTS
= 𝑡 1𝑐 , 𝑡 2𝑐 , · · · , 𝑡𝑚
𝑐
,
In this section, we conduct extensive experiments to answer the
where 𝑅(𝑐) is the retrieved knowledge above, and 𝑚 is the number following research questions:
of appropriate tags generated by LLMs.
RQ1 How does LLM4Tag perform in comparison to existing tag-
ging algorithms?
3.4 Tag Confidence Calibration
RQ2 How effective is the graph-based tag recall module?
After tag generation, there still exist two serious problems for real- RQ3 Does the injection of domain-specific knowledge enhance
world applications: (1) the hallucination due to the uncertainty of the tagging performance?
LLMs, which leads to generating irrelevant or wrong tags; (2) the RQ4 What is the impact of the tag confidence calibration module?
necessity of assigning a quantifiable relevance score for each tag for
the sake of downstream usage in the information retrieval systems
4.1 Experimental Settings
(e.g., recall and marketing).
4.1.1 Dataset. We conducted experiments on a mainstream infor-
mation distribution platform with hundreds of millions of users and
Tag Confidence Judgment Prompt Template
sampled three representative industrial datasets from online logs
You are an advertising tag relevance judgment bot. Given an to ensure consistency in data distribution, containing two types
advertisement creative with {image} and {product description} and of tasks: (1) multi-tag task (Browser News), and (2) single-tag task
a tag with {tag description}, judge whether the advertisement is
(Advertisement Creatives and Search Query).
relevant to the tag. Answer with "Yes" or "No".
• Browser News dataset includes popular news articles and user-
generated videos, primarily in the form of text, images, and
Figure 5: Prompt template for tag confidence judgment in
short videos. This is a multi-tag task, wherein the objective is to
advertisement creatives tagging scenario.
select multiple appropriate tags for each content from a massive
tag repository (more than 100,000 tags). Around 30,000 contents
To handle these two problems, the tag confidence calibration are randomly collected as the testing dataset through expert
module is adopted. Specifically, given a target content 𝑐 and a certain annotations.
tag 𝑡 𝑐 ∈ Γ(𝑐), we derive a prompt template, 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑐 (shown in • Advertisement Creatives dataset includes ad creatives, includ-
Figure 5), to leverage the reasoning ability of LLMs to achieve a tag ing cover images, copywriting, and product descriptions from
confidence judgment task, i.e., whether 𝑐 and 𝑡 𝑐 is relevant. Then advertisers. The task for this dataset is a single-tag task, where
we extract the probability of the token in the LLM result to get a we need to select the most relevant tag to the advertisement
confidence score Conf (𝑐, 𝑡 𝑐 ): from a well-designed tag repository (more than 1,000 tags) and
𝒔 = LLM(𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑐 (𝑐, 𝑡 𝑐 )) ∈ R𝑉 , we collect around 10,000 advertisements randomly as the testing
dataset through expert annotation.
exp (𝒔 [“𝑌𝑒𝑠”]) (10)
Conf (𝑐, 𝑡 𝑐 ) = ∈ (0, 1) , • Search Query dataset primarily consists of user search queries
exp (𝒔 [“𝑌𝑒𝑠”]) + exp (𝒔 [“𝑁𝑜”])) from a web search engine, used for user intent classification.
where 𝒔 is the probability score for all tokens, and 𝑉 is the vocabu- The task for this dataset is also a single-tag task, where the most
lary size of LLMs. probable intent needs to be selected as the tag for each query.
After obtaining the confidence score Conf (𝑐, 𝑡), we implement The size of the tag repository is about 1,000, and 2,000 queries
self-calibration for the results by eliminating those tags with low are collected and manually tagged as the testing dataset.
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.

Table 1: Performance comparison of different methods. Note that different tasks, multi-tag tasks (Brower News) and single-tag
tasks (Advertisement Creatives and Search Query), have different metrics. The best result is given in bold, and the second-best
value is underlined. "RI" indicates the relative improvements of LLM4Tag over the corresponding baseline.

Browser News Advertisement Creatives Search Query


Model
Acc@1 Acc@2 Acc@3 RI Precision Recall F1 RI Precision Recall F1 RI
BGE 0.7427 0.6584 0.5976 29.8% 0.7817 0.7396 0.7601 18.5% 0.6364 0.5122 0.5676 56.2%
GTE 0.7292 0.6507 0.5941 31.3% 0.7369 0.7026 0.7194 25.0% 0.6129 0.4634 0.5278 67.9%
CONAN 0.7568 0.6814 0.6266 25.5% 0.7491 0.7194 0.7339 22.6% 0.6056 0.5244 0.5621 57.9%
TagGPT 0.8351 0.7813 0.7424 9.5% 0.8454 0.7997 0.8219 9.4% 0.8421 0.7805 0.8101 9.7%
ICXML 0.8398 0.7883 0.7560 8.4% 0.8492 0.8025 0.8252 9.0% 0.8600 0.7840 0.8202 8.3%
LLM4TC 0.8602 0.8069 0.8235 3.7% 0.8726 0.8245 0.8479 6.1% 0.9028 0.8025 0.8497 4.5%
LLM4Tag 0.9041 0.8511 0.8273 - 0.9138 0.8857 0.8995 - 0.9325 0.8485 0.8885 -

4.1.2 Baselines. To evaluate the superiority and effectiveness of For the single-tag task, we adopt Precision, Recall, and F1 follow-
our proposed model, we compare LLM4Tag with two classes of ing previous works [5, 15]. Higher values of these metrics indicate
existing models: better performance.
Moreover, we report Relative Improvement (RI) to represent
• Traditional Methods that encode the contents and tags by
the relative improvement our model achieves over the compared
leveraging pre-trained language models and select the most rel-
models. Here we calculate the average RI of the above all metrics.
evant tags according to cosine distance for each content as the
result. Here we compare three different pre-trained language 4.1.4 Implementation Details. In the selection of LLMs, we select
models. BGE [30] pre-trains the models with retromae on large- Huawei’s large languge model PanGu-7B [28, 31]. For the graph-
scale pairs data using contrastive learning. GTE [19] further based tag recall module, we choose BGE [30] as the encoder model.
proposes multi-stage contrastive learning to train the text em- 𝛿𝑐 −𝑡 and 𝛿𝑐 −𝑐 are set as 0.5 and 0.8, respectively. Besides, we set
bedding. CONAN [16] maximizes the utilization of more and maximum recall numbers for different meta-paths, 15 for C2T meta-
higher-quality negative examples to pre-train the model. path and 5 for C2C2T meta-path. For knowledge-enhanced tag
• LLM-Enhanced Methods that utilize large language models generation module, the size of the training dataset in long-term
to assist the tag generation. TagGPT [15] proposes a zero-shot supervised knowledge injection contains approximately 10, 000
automated tag extraction system through prompt engineering annotated samples and the tuning is performed every two weeks.
via LLMs. ICXML [35] introduces a two-stage tag generation As for the short-term retrieved knowledge injection, the retrievable
framework, involving generation-based label shortlisting and database is updated in real-time and we retrieve at most 3 relevant
label reranking through in-context learning. LLM4TC [5] fur- samples/segments for in-context learning injection and augmented
ther leverages fine-tuning using domain knowledge to improve generation injection, respectively. For the tag confidence calibration
the performance of tag generation. module, we eliminate tags with confidence scores less than 0.5 and
rank the remaining tags in order of confidence scores as the result.
4.1.3 Evaluation Metrics. For multi-tag tasks, due to the excessive
number of tags (millions), we can not annotate all the correct tags 4.2 Result Comparison & Deployment (RQ1)
and thus only directly judge whether the results generated by the
Table 1 summarizes the performance of different methods on three
model are correct or not. In this case, we define Acc@k to evaluate
industrial datasets, from which we have the following observations:
the performance:
• Leveraging large language models (LLMs) brings bene-
𝑁 ′ ∑︁
𝑘′ fit to model performance. TagGPT, ICXML, and LLM4TC,
1 ∑︁ I (𝑇𝑖 [ 𝑗])
Acc@k = , utilize LLMs to assist the tag generation, achieving better per-
𝑁 ′ 𝑖=1 𝑗=1 𝑘′
formances than other small language models (SLMs), such as
𝑘 ′ = min(𝑘, 𝑙𝑒𝑛(𝑇𝑖 )) , (12) BGE, GTE, and CONAN. This phenomenon indicates that the
( world knowledge and reasoning capabilities of LLMs enable
1, 𝑇𝑖 [ 𝑗] is right , better content understanding and tag generation, significantly
I (𝑇𝑖 [ 𝑗]) =
0, otherwise , improving tagging effectiveness.
• Introducing domain knowledge can significantly improve
where 𝑇𝑖 [ 𝑗] is the 𝑗-th generated tag of the 𝑖-th content and 𝑁 ′ is performance. Although LLMs benefit from general world knowl-
the size of test dataset. It is worth noticing there exists contents that edge, there remains a significant gap compared with domain-
do not have 𝑘 proper tags, thus we allow the number of generated specific knowledge. Therefore, LLM4TC injects domain knowl-
tags to be less than 𝑘. edge by fine-tuning the LLMs and achieves better performance
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada

than other baselines in all metrics, which validates the impor- Match-based Recall
tance of domain knowledge injection. Green Tea Instant Food Baking Snacks
• The superior performance of LLM4Tag. We can observe Graph-based Recall
from Table 1 that LLM4Tag yields the best performance on all
datasets consistently and significantly, validating the superior
effectiveness of our proposed LLM4Tag. Concretely, LLM4Tag Green Beans
beats the best baseline by 3.7%, 6.1%, and 4.5% on three datasets, Case A

respectively. This performance improvement is attributed to the


advanced nature of our LLM4Tag, including more comprehen- Match-based Recall
sive graph-based tag recall, deeper domain-specific knowledge Candy and
Instant Food Baking Snacks
injection, and more reliable confidence calibration. Chocolate

• Notably, LLM4Tag has been deployed online and covers Graph-based Recall
all the traffic. We randomly resampled the online data, and the
online report shows consistency between the improvements in
Nuts
the online metrics and those observed in the offline evaluation. Case B
Now, LLM4Tag has been deployed in the content tagging system
of these three online applications, serving hundreds of millions
of users daily. Match-based Recall
Instant Food Porridge Fruit

4.3 The Effectiveness of Graph-based Tag Recall Graph-based Recall


Module (RQ2)
In this subsection, we compare our proposed graph-based tag recall
Oat
module with match-based recall to validate the effectiveness of
Case C
candidate tags retrieval over the Browser News Dataset. For fairness,
both methods use the same pre-trained language model BGE to
encode contents and tags, and the number of candidate tags is fixed
as 20. We define two metrics to evaluate the performance: #Right Figure 6: The online cases for verifying the effectiveness of
means the average number of correct tags in candidate tags, and graph-based tag recall.
HR#k means the proportion of cases where at least 𝑘 correct tags
are hit in the candidate tag set.
4.4 The Effectiveness of Knowledge-enhanced
Table 2: Performance comparison between different recall Tag Generation (RQ3)
types over the Browser News Dataset. In order to systematically evaluate the contribution of domain
knowledge-enhanced tag generation (KETG) in our framework, we
have designed the following variants:
Recall Type #Right HR#1 HR#2 HR#3
• LLM4Tag (w/o KETG) removes both long-term supervised
Match-based 4.48 0.9586 0.8841 0.7643 knowledge injection (LSKI) and short-term retrieved knowledge
Ours 5.37 0.9745 0.9212 0.8425 injection (SRKI), and selects tags using native LLMs.
• LLM4Tag (w/o LSKI) removes LSKI and only maintains SRKI
to inject the short-term domain-specific knowledge.
As shown in Table 2, we can find that our graph-based recall
• LLM4Tag (w/o SRKI) removes SRKI and only maintains LSKI
method can significantly improve the quality of candidate tags. The
to inject the long-term domain-specific knowledge.
metrics #Right and HR#3 increase by 19.8% and 10.2% respectively,
• LLM4Tag (Ours) incorporates both LSKI and SRKI to inject the
which demonstrates that our method yields a more complete and
long/short-term domain-specific knowledge.
comprehensive candidate tag set via a meta-path-based multi-hop
recall mechanism. Moreover, the lifting of HR#1 illustrates that our Figure 7 presents the comparative results on Browser News
method can recall the correct tags when the match-based method Dataset, revealing three key findings:
encounters challenges in hard cases and fails to select the relevant • The complete framework achieves optimal performance, demon-
tags. strating the synergistic value of combining supervised fine-
Besides, to verify the effectiveness and interpretability of our tuning in long-term supervised knowledge injection with non-
proposed graph-based tag recall, we randomly select some cases parametric short-term retrieved knowledge injection.
in our deployed tagging scenario and visualize the recall results • The removal of either component in knowledge-enhanced tag
in Figure 6. It can be observed that, when match-based recall fails generation module causes measurable degradation. Among them,
to select the correct tags for some challenging cases, our method the removal of long-term knowledge results in a greater decline,
effectively retrieves accurate tags by C2C2T meta-path multi-hop indicating that long-term knowledge may cover a broader range
traversal in the graph, thus avoiding missing correct tags due to of domain-specific knowledge and highlighting the importance
the limited capabilities of SLMs. of SFT in model knowledge injection.
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.


ZR.(7* ZR65., down. This characteristic allows us to set an appropriate confidence
 ZR/6., 2XUV threshold in practical deployment scenarios to achieve a balance
 between prediction accuracy and tag coverage.










$FF# $FF# $FF#

Figure 7: Ablation study about the effectiveness of Case A Case B


knowledge-enhanced tag generation module in LLM4Tag. Title: Update of lego city! Title: The winner of a bullfight.
Tag Confidence Score Tag Confidence Score
Lego 0.93046 Buffalo 0.88720
Freight Train 0.04380 Bullfight 0.84662
• The most basic variant (w/o KETG) exhibits the lowest perfor- Building Blocks 0.97737 Bulldog 0.13784
mance, highlighting the crucial role of domain adaptation in
specialized tagging tasks within information retrieval systems.

4.5 The Effectiveness of Tag Confidence


Calibration (RQ4)
To validate the effectiveness of the tag confidence calibration mod-
ule, we evaluate model performance on the Browser News Dataset
and use different confidence thresholds to achieve different pruning Case C Case D
rates. Here we define a metric Coverage@k to evaluate the cover Title: How much can a vampire Title: A group of orca is hunting.
rate of final results as: hunter's kit cost?
Tag Confidence Score Tag Confidence Score
𝑁′ Antiques 0.92117 Orca 0.90375
1 ∑︁ Repair Tools 0.31911 Hunting 0.89805
Coverage@𝑘 = ′ I (|𝑇𝑖 | ≥ 𝑘) ,
𝑁 𝑖=1 Religious Culture 0.03445 Seals 0.55549
( (13)
1, |𝑇𝑖 | ≥ 𝑘 ,
I (|𝑇𝑖 | ≥ 𝑘) =
0, otherwise , Figure 9: The online cases of tag confidence calibration mod-
ule. Tags with low confidence are highlighted in red.
where 𝑇𝑖 is the result tags of the 𝑖-th content and 𝑁 ′ is the size of
testing dataset. Furthermore, we randomly select some cases in our deployed
tagging scenario and visualize them with confidence scores in Fig-

 ure 9. We find that in Cases A, B, and C, irrelevant tags such as
"Freight Train," "Bulldog," and "Religious Culture" receive low con-
  fidence scores and will be calibrated by our model, and in Case D,
 the weak-relevant tag, "Seals", which is a non-primary entity in the
&RYHUDJH#N


$FF#N


figure, receive a medium confidence score and will be ranked low
in the final results, which further demonstrates the superiority of
 
tag confidence calibration module.
 $FF# &RYHUDJH#
$FF# &RYHUDJH# 
$FF# &RYHUDJH# 5 CONCLUSION

         In this work, we propose an automatic tagging system based on
3UXQLQJ5DWH
Large Language Models (LLMs) named LLM4Tag with three key
modules, characterized by completeness, continuous knowledge
Figure 8: Model Accuracy vs. Tag Coverage for Different Prun- evolution, and quantifiability. Firstly, the graph-based tag recall
ing Rates. module is designed to construct a small-scale relevant, compre-
hensive candidate tag set from a massive tag repository. Next,
As shown in Figure 8, our experimental results indicate that the knowledge-enhanced tag generation module is proposed to
when we increase the pruning rate by setting a larger confidence generate accurate tags with knowledge injection. Finally, the tag
threshold, the Acc@k is significantly boosted while the Cover- confidence calibration module is employed to generate reliable
age@k continues to decrease, which demonstrates the effectiveness confidence tag scores. The significant improvements in offline eval-
of our proposed tag confidence calibration module. Additionally, uations have demonstrated its superiority and LLM4Tag has been
as the pruning rate increases, the accuracy gains gradually slow deployed online for content tagging.
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada

REFERENCES [19] Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- Zhang. 2023. Towards general text embeddings with multi-stage contrastive
cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal learning. arXiv preprint arXiv:2308.03281 (2023).
Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 [20] Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu,
(2023). Chuhan Wu, Xiangyang Li, Chenxu Zhu, et al. 2023. How can recommender
[2] Sajad Ahmadian, Milad Ahmadian, and Mahdi Jalili. 2022. A deep learning based systems benefit from large language models: A survey. ACM Transactions on
trust-and tag-aware recommender system. Neurocomputing 488 (2022), 557–571. Information Systems (2023).
[3] Kerstin Bischoff, Claudiu S Firan, Wolfgang Nejdl, and Raluca Paiu. 2008. Can all [21] Pengfei Liu, Xipeng Qiu, and Xuanjing Huang. 2016. Recurrent neural network
tags be used for search?. In Proceedings of the 17th ACM conference on Information for text classification with multi-task learning. arXiv preprint arXiv:1605.05101
and knowledge management. 193–202. (2016).
[4] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, [22] Gilad Mishne. 2006. Autotag: a collaborative approach to automated tag assign-
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda ment for weblog posts. In Proceedings of the 15th international conference on World
Askell, et al. 2020. Language models are few-shot learners. Advances in neural Wide Web. 953–954.
information processing systems 33 (2020), 1877–1901. [23] Şükrü Ozan and D Emre Taşar. 2021. Auto-tagging of short conversational sen-
[5] Youngjin Chae and Thomas Davidson. 2023. Large language models for text tences using natural language processing methods. In 2021 29th Signal Processing
classification: From zero-shot learning to fine-tuning. Open Science Foundation and Communications Applications Conference (SIU). IEEE, 1–4.
(2023). [24] Shahzad Qaiser and Ramsha Ali. 2018. Text mining: use of TF-IDF to examine the
[6] Keunwoo Choi, George Fazekas, and Mark Sandler. 2016. Automatic tagging using relevance of words to documents. International Journal of Computer Applications
deep convolutional neural networks. arXiv preprint arXiv:1606.00298 (2016). 181, 1 (2018), 25–29.
[25] Xiaofei Sun, Xiaoya Li, Jiwei Li, Fei Wu, Shangwei Guo, Tianwei Zhang, and
[7] Antonina Dattolo, Felice Ferrara, and Carlo Tasso. 2010. The role of tags for
Guoyin Wang. 2023. Text classification via large language models. arXiv preprint
recommendation: a survey. In 3rd International Conference on Human System
arXiv:2305.08377 (2023).
Interaction. IEEE, 548–555.
[26] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne
[8] Ernesto Diaz-Aviles, Mihai Georgescu, Avaré Stewart, and Wolfgang Nejdl. 2010.
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
Lda for on-the-fly auto tagging. In Proceedings of the fourth ACM conference on
Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv
Recommender systems. 309–312.
preprint arXiv:2302.13971 (2023).
[9] Ashraf Elnagar, Omar Einea, and Ridhwan Al-Debsi. 2019. Automatic text tagging
[27] Peilu Wang, Yao Qian, Frank K Soong, Lei He, and Hai Zhao. 2015. A unified tag-
of Arabic news articles using ensemble deep learning models. In Proceedings of
ging solution: Bidirectional lstm recurrent neural network with word embedding.
the 3rd international conference on natural language and speech processing. 59–66.
arXiv preprint arXiv:1511.00215 (2015).
[10] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin
[28] Yunhe Wang, Hanting Chen, Yehui Tang, Tianyu Guo, Kai Han, Ying Nie, Xutao
Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1:
Wang, Hailin Hu, Zheyuan Bai, Yun Wang, et al. 2023. PanGu-𝜋 : Enhancing
Incentivizing reasoning capability in llms via reinforcement learning. arXiv
Language Model Architectures via Nonlinearity Compensation. arXiv preprint
preprint arXiv:2501.12948 (2025).
arXiv:2312.17276 (2023).
[11] Manish Gupta, Rui Li, Zhijun Yin, and Jiawei Han. 2010. Survey on social tagging
[29] Zhiqiang Wang, Yiran Pang, and Yanbin Lin. 2023. Large language models are
techniques. ACM Sigkdd Explorations Newsletter 12, 1 (2010), 58–72.
zero-shot text classifiers. arXiv preprint arXiv:2312.01044 (2023).
[12] Tokutaka Hasegawa and Shun Shiramatsu. 2021. BERT-Based Tagging Method
[30] Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. 2023.
for Social Issues in Web Articles. In Proceedings of Sixth International Congress
C-Pack: Packaged Resources To Advance General Chinese Embedding.
on Information and Communication Technology: ICICT 2021, London, Volume 1.
arXiv:2309.07597 [cs.CL]
Springer, 897–909.
[31] Wei Zeng, Xiaozhe Ren, Teng Su, Hui Wang, Yi Liao, Zhiwei Wang, Xin Jiang,
[13] Yuheng Huang, Jiayang Song, Zhijie Wang, Shengming Zhao, Huaming Chen, Fe-
ZhenZhang Yang, Kaisheng Wang, Xiaoda Zhang, et al. 2021. Pangu-𝛼 : Large-
lix Juefei-Xu, and Lei Ma. 2023. Look before you leap: An exploratory study of un-
scale autoregressive pretrained Chinese language models with auto-parallel
certainty measurement for large language models. arXiv preprint arXiv:2307.10236
computation. arXiv preprint arXiv:2104.12369 (2021).
(2023).
[32] Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners’
[14] Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii,
guide to) convolutional neural networks for sentence classification. arXiv preprint
Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in
arXiv:1510.03820 (2015).
natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
[33] Zi-Ke Zhang, Tao Zhou, and Yi-Cheng Zhang. 2011. Tag-aware recommender
[15] Chen Li, Yixiao Ge, Jiayong Mao, Dian Li, and Ying Shan. 2023. Taggpt: Large lan-
systems: a state-of-the-art survey. Journal of computer science and technology 26,
guage models are zero-shot multimodal taggers. arXiv preprint arXiv:2304.03022
5 (2011), 767–777.
(2023).
[34] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong
[16] Shiyu Li, Yang Tang, Shizhe Chen, and Xi Chen. 2024. Conan-
Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2023. Large lan-
embedding: General Text Embedding with More and Better Negative Samples.
guage models for information retrieval: A survey. arXiv preprint arXiv:2308.07107
arXiv:2408.15710 [cs.CL] https://arxiv.org/abs/2408.15710
(2023).
[17] Xin Li, Lei Guo, and Yihong Eric Zhao. 2008. Tag-based social interest discovery.
[35] Yaxin Zhu and Hamed Zamani. 2023. ICXML: An in-context learning framework
In Proceedings of the 17th international conference on World Wide Web. 675–684.
for zero-shot extreme multi-label classification. arXiv preprint arXiv:2311.09649
[18] Yangning Li, Shirong Ma, Xiaobin Wang, Shen Huang, Chengyue Jiang, Hai-Tao
(2023).
Zheng, Pengjun Xie, Fei Huang, and Yong Jiang. 2024. Ecomgpt: Instruction-
tuning large language models with chain-of-task tasks for e-commerce. In Pro-
ceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 18582–18590.

You might also like