LLM4Tag: Automatic Tagging System For Information Retrieval
LLM4Tag: Automatic Tagging System For Information Retrieval
Huifeng Guo
[email protected]
Huawei Noah’s Ark Lab, China
ABSTRACT 03–07, 2025, Toronto, ON, Canada. ACM, New York, NY, USA, 9 pages. https:
Tagging systems play an essential role in various information re- //doi.org/10.1145/nnnnnnn.nnnnnnn
trieval applications such as search engines and recommender sys-
tems. Recently, Large Language Models (LLMs) have been applied 1 INTRODUCTION
in tagging systems due to their extensive world knowledge, seman- Tagging is the process of assigning tags, such as keywords or labels,
tic understanding, and reasoning capabilities. Despite achieving to digital content, products, or users to facilitate organization, re-
remarkable performance, existing methods still have limitations, trieval, and analysis. Tags serve as descriptors that summarize key
including difficulties in retrieving relevant candidate tags com- attributes or themes, enabling efficient categorization and search-
prehensively, challenges in adapting to emerging domain-specific ability, which play a crucial role in information retrieval systems,
knowledge, and the lack of reliable tag confidence quantification. such as search engines, recommender systems, content manage-
To address these three limitations above, we propose an automatic ment, and social networks [3, 11, 17, 33]. For information retrieval
tagging system LLM4Tag. First, a graph-based tag recall module systems, tags are widely used in various stages, including con-
is designed to effectively and comprehensively construct a small- tent distribution strategies, ranking algorithms, and operational
scale highly relevant candidate tag set. Subsequently, a knowledge- decision-making processes [2, 7]. Therefore, tagging systems must
enhanced tag generation module is employed to generate accurate not only require high accuracy and coverage, but also provide in-
tags with long-term and short-term knowledge injection. Finally, a terpretability and strong confidence.
tag confidence calibration module is introduced to generate reliable Before the era of Large Language Models (LLMs), the main-
tag confidence scores. Extensive experiments over three large-scale stream tagging methods mainly included statistics-based (i.e., TF-
industrial datasets show that LLM4Tag significantly outperforms IDF-based [24], LDA-based [8]), supervised classification-based (i.e.,
the state-of-the-art baselines and LLM4Tag has been deployed on- CNN-based [9, 32], RNN-based [21, 27]), pre-trained model-based
line for content tagging to serve hundreds of millions of users. methods (i.e., BERT-based [12, 23, 30]), etc. However, limited by
the model capacity, these methods cannot achieve satisfactory re-
CCS CONCEPTS sults, especially for complex contents. Besides, they heavily rely
• Information Retrieval → Tagging Systems. on annotated training data, resulting in limited generalization and
transferability.
KEYWORDS The rise of LLMs, with their extensive world knowledge, power-
Tagging Systems; Large Language Models; Information Retrieval ful semantic understanding, and reasoning capabilities, has signifi-
cantly enhanced the effectiveness of tagging systems. LLM4TC [5]
ACM Reference Format:
employs LLMs directly as tagging classifiers and leverages anno-
Ruiming Tang, Chenxu Zhu, Bo Chen, Weipeng Zhang, Menghui Zhu, Xinyi
tated data to fine-tune LLMs. TagGPT [15] further introduces a
Dai, and Huifeng Guo. 2025. LLM4Tag: Automatic Tagging System for
Information Retrieval via Large Language Models. In KDD ’2025, August match-based recall to filter out a small-scale tag set to address the
limited input length of LLMs. ICXML [35] proposes an in-context
Permission to make digital or hard copies of all or part of this work for personal or learning algorithm to guide LLMs to further improve performance.
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation However, existing LLM-enhanced tagging algorithms exhibit
on the first page. Copyrights for components of this work owned by others than the several critical limitations that require improvement (shown in
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or Figure 1):
republish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from [email protected]. (L1) Constrained by the input length and inference efficiency of
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada LLMs, existing methods adopt simple match-based recall to
© 2025 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM filter out a small-scale candidate tag set [15, 35], which is
https://doi.org/10.1145/nnnnnnn.nnnnnnn prone to missing relevant tags, thereby reducing accuracy.
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.
Input
Limitation 1 deployed online for content tagging, serving hundreds of mil-
Tag repository lions of users.
Match-based
recall
2 RELATED WORK
Missed Candidate
relevant tags tags In this section, we briefly review traditional tagging systems and
LLM-enhanced tagging systems.
Prompt Result tags
Given a content which is {content nuts 2.1 Traditional Tagging Systems
information}, select the proper tags almond
from {tag set}. Traditional tagging systems [6, 11, 22] generally employ multi-
LLMs camdy
label classification models, which utilize human-annotated tags as
Limitation 2 Limitation 3 ground truth labels and employ content descriptions as input to
nuts camdy predict. Qaiser et al. [24] utilize TF-IDF to categorize the tag while
Diaz et al. [8] employ LDA to automatically tag the resource based
Domain-specific Publicly available on the most likely tags derived from the latent topics identified. The
knowledge corpora Yes No Yes No
advent of deep learning has led to the proposal of RNN-based [21]
and CNN-based [32] methods for achieving multi-label learning,
Figure 1: LLM-enhanced tagging systems and their three lim- which are directly applied to tagging systems [9, 27]. Hasegawa
itations. (L1) Simple match-based recall is prone to missing et al. [12] further adopt the BERT pre-training technique in their
relevant tags; (L2) The emerging domain-specific knowledge tagging systems. Recently, with the growing popularity of pre-
may not align with the pre-trained knowledge of LLMs; (L3) trained Small Language Models (SLMs), numerous pre-training
LLMs cannot accurately quantify tag confidence. embedding models, such as BGE [30], GTE [19], and CONAN [16],
have been proposed and directly employed in tagging systems
through domain knowledge fine-tuning.
Nonetheless, the capabilities of these models are constrained
(L2) General purpose LLMs pre-trained in publicly available cor- by their limited model capacity, particularly in the presence of
pora exhibit limitations in comprehending emerging domain- complex content. Additionally, they depend excessively on anno-
specific knowledge within information retrieval, leading to tated training data, resulting in sub-optimal generalization and
lower accuracy in challenging cases [5, 18]. transferability.
(L3) Due to the hallucination and uncertainty [13, 14], LLMs can-
not accurately quantify tag confidence, which is crucial for 2.2 LLM-Enhanced Tagging Systems
information retrieval applications.
With Large Language Models (LLMs) achieving remarkable break-
To address the three limitations of existing approaches, we pro- throughs in natural language processing [1, 4, 10, 26] and infor-
pose an automatic tagging system called LLM4Tag, which consists mation retrieval systems [20, 34], LLM-enhanced tagging systems
of three key modules. Specifically, to improve the completeness of have received much attention and have been actively explored cur-
candidate tags (L1), we propose a graph-based tag recall module rently [5, 15, 25, 29, 35]. Wang et al. [29] employ LLMs as a direct
designed to construct small-scale, highly relevant candidate tags tagging classifier, while Sun et al. [25] introduce clue and reasoning
from a massive tag repository efficiently and comprehensively. To prompts to further enhance performance. LLM4TC [5] undertakes
enhance domain-specific knowledge and adaptability to emerging studies on diverse LLMs architectures and leverages annotated
information of general-purpose LLMs (L2), a knowledge-enhanced samples to fine-tune the LLMs. TagGPT [15] introduces an early
tag generation module that integrates long-term supervised knowl- match-based recall mechanism to generate candidate tags from a
edge injection and short-term retrieved knowledge injection is large-scale tag repository with textual clues from multimodal data.
designed to generate accurate tags. Moreover, a tag confidence cali- ICXML [35] proposes a two-stage framework through in-context
bration module is introduced to generates reliable tag confidence learning to guide LLMs to align with the tag space.
scores, ensuring more robust and trustworthy results (L3). However, the aforementioned works suffer from three critical
To summarize, the main contributions of this paper can be high- limitations (mentioned in Section 1): (1) difficulties in comprehen-
lighted as follows: sively retrieving relevant candidate tags, (2) challenges in adapting
• We propose an LLM-enhanced tagging framework LLM4Tag, to emerging domain-specific knowledge, and (3) the lack of reliable
characterized by completeness, continuous knowledge evolu- tag confidence quantification. To this end, we propose LLM4Tag,
tion, and quantifiability. an automatic tagging system, to address the aforementioned limita-
• To address the limitations of existing approaches, LLM4Tag in- tions.
tegrates three key modules: graph-based tag recall, knowledge-
enhanced tag generation, and tag confidence calibration, ensur- 3 METHODOLOGY
ing the generation of accurate and reliable tags. In this section, we present our proposed LLM4Tag framework in
• LLM4Tag achieves state-of-the-art in three large-scale industrial detail. We start by providing an overview of the proposed frame-
datasets with detailed analysis that provides a deeper under- work and then give detailed descriptions of the three modules in
standing of model performance. Moreover, LLM4Tag has been LLM4Tag.
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada
porridge
almond nuts
Train Dataset
candy
Long-term Supervised
Tag Selection Prompt:
almond chocolate Knowledge Injection Content LLM Selected Tags
fruit Background knowledge: {retrieved
... information}. Given a content which
chocolate nuts
candy is {content information}, select the
C2C2T nuts proper tags from {candidate tag
Content Candidate set}.
Meta-Path LLMs
Tag Set Tag Confidence Judgment
almond
Short-term Retrieved
Knowledge Injection Content P(Yes)
C2T
candy
Meta-Path
Retrieved In-Context Retrieved Augmented Tag LLMs P(No)
Learning Injection Generation Injection
Target
The almond (Prunus amygdalus,
Content Retrievable syn. Prunus dulcis) is a species of
Database tree from the genus Prunus. Along
with the peach, it is classified in
Meta-Path almond nuts candy
Deterministic Edge the subgenus Amygdalus,
tag: nuts tag: almond, nuts
Similar Edge Content Vertex Tag Vertex distinguished from ... 0.93 0.90 0.05
Figure 2: The overall framework of LLM4Tag architecture of LLM4Tag, consisting of three modules: graph-based tag recall
module, knowledge-enhanced tag generation module, and tag confidence calibration module.
3.1 Overview candidate tags, we construct a semantic graph globally and propose
As illustrated in Figure 2, our proposed LLM4Tag framework con- a graph-based tag recall module.
sists of three major modules: (1) Graph-based Tag Recall, (2) Knowledge- Firstly, we initial an undirected graph G with contents and tags
enhanced Tag Generation, and (3) Tag Confidence Calibration, as:
which respectively provides completeness, continual knowledge G = {V, E} , (1)
evolution, and quantifiability. where vertex set V = {C, T } is the set of existing content vertices
Graph-based Tag Recall module is responsible for retrieving C and all tag vertices T . As for the edge set E, it contains two types
a small-scale, highly relevant candidate tag set from a massive of edges, called Deterministic Edges and Similarity Edges.
tag repository. Based on a scalable content-tag graph constructed Deterministic Edges only connect between content vertex 𝑐
dynamically, graph-based tag recall is utilized to fetch dozens of and tag vertex 𝑡, formulated as 𝑒𝑐𝑑−𝑡 , which indicates that content 𝑐
relevant tags for each content. is labeled with tag 𝑡 based on historical annotation data. To ease the
Knowledge-enhanced Tag Generation module is designed high sparsity of the deterministic edges in the graph G, we further
to accurately generate tags for each content via Large Language introduce semantic similarity-based edges (Similarity Edges) that
Models (LLMs). To address the lack of domain-specific and emerg- connect not only between content vertex 𝑐 and tag vertex 𝑡, but also
ing knowledge in general-purpose LLMs, this module implements between different content vertices, formulated as 𝑒𝑐𝑠 −𝑡 and 𝑒𝑐𝑠 −𝑐 ,
a scheme integrating the injection of both long-term and short- respectively.
term domain knowledge, thereby achieving continual knowledge Specifically, for the 𝑖-th vertex 𝑣 𝑖 ∈ V in graph G, we summarize
evolution. all textual information (i.e., title and category in content, tag de-
Tag Confidence Calibration module is aimed to generate a scription) as 𝑡𝑒𝑥𝑡 𝑖 and vectorize it with an encoder to get a semantic
quantifiable and reliable confidence score for each tag, thus allevi- representation 𝒓 𝑖 :
ating the issues of hallucination and uncertainty in LLMs. Further-
r𝑖 = Encoder(𝑡𝑒𝑥𝑡 𝑖 ) , (2)
more, the confidence score can be employed as a relevance metric
for downstream information retrieval tasks. where Encoder is a small language model, such as BGE [30]. Then
the similarity distance of two different vertices 𝑣 𝑖 , 𝑣 𝑗 can be com-
puted as:
3.2 Graph-based Tag Recall r𝑖 · 𝒓 𝑗
Dis(𝑣 𝑖 , 𝑣 𝑗 ) = 𝑖 . (3)
∥𝒓 ∥∥𝒓 𝑗 ∥
Given the considerable magnitude of tags (millions) in industrial
After obtaining the similarity estimations, we can use a threshold-
information retrieval system, the direct integration of the whole tag
based method to determine the similarity edge construction, i.e.,
repository into LLMs is impractical due to the constrained nature
of the context window and inference efficiency of LLMs. Existing • 𝑒𝑐𝑠 −𝑡 connects the content 𝑐 and the tag 𝑡 if the similarity distance
approaches [15, 35] adopt simple match-based tag recall to filter between them exceeds 𝛿𝑐 −𝑡 .
out a small-scale candidate tag set based on small language models • 𝑒𝑐𝑠 −𝑐 connects the similar contents when their similar distance
(SLMs), such as BGE [30]. However, they are prone to missing exceeds 𝛿𝑐 −𝑐 .
relevant tags due to the limited capabilities of SLMs. To address In this way, we can construct a basic content-tag graph with
this issue and improve the comprehensiveness of the retrieved deterministic/similarity edges. Then, when a new content 𝑐 appears
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.
that needs to be tagged, we dynamically insert it into this graph Basic Tag Generation Prompt Template
by adding similarity edges. Next, we define two types of meta- You are an advertising tag bot. Considering an advertisement
paths (i.e., C2T meta-path and C2C2T meta-path) and adopt the creative with {image} and {product description}, select the most
meta-path-based approach to recall candidate tags. relevant tag from {candidate tag set}.
C2T Meta-Path: Based on the given content 𝑐, we first recall
the tags which is connected directly to 𝑐 as the candidate tags. The
Figure 3: Prompt template for basic tag generation in adver-
meta-path can be defined as:
tisement creatives tagging scenario.
𝑠
𝑝𝐶2𝑇 = 𝑐 → 𝑡 , (4)
(shown in Figure 3).
𝑠
where → is the similarity edge. 𝑁
D = {(𝑥𝑖 , 𝑦𝑖 )}𝑖=1 ,
C2C2T Meta-Path: C2C2T contains two sub-procedures: C2C (7)
and C2T. C2C is aimed at discovering similar contents while C2T 𝑥𝑖 = 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑏 (𝑐𝑖 , Φ(𝑐𝑖 )) ,
further attempts to recall the deterministic tags from these similar where 𝑁 is the size of training dataset. Notably, to ensure the com-
contents. The meta-path can be formulated as: prehensiveness of domain-specific knowledge, we employ the prin-
ciple of diversity for sample selection and obtain correct answers
𝑠 𝑑
𝑝𝐶2𝐶2𝑇 = 𝑐 → 𝑐 → 𝑡 , (5) 𝑦𝑖 by combining LLMs generation with human expert annotations.
After obtaining the training set, we leverage the causal language
𝑑 𝑠 modeling objective for LLM Supervised Fine-Tuning (SFT):
where → is the deterministic edge and → is the similarity edge.
With these two types of meta-paths, we can generate a more |𝑦𝑖 |
𝑁 ∑︁
∑︁
comprehensive candidate tag set for content 𝑐 as max log 𝑃Θ 𝑦𝑖,𝑗 | 𝑥𝑖 , 𝑦𝑖,< 𝑗 , (8)
Θ
𝑖=1 𝑗=1
Φ(𝑐) = Φ 𝐶2𝑇
(𝑐) ∪ Φ 𝐶2𝐶2𝑇
(𝑐) , (6)
where Θ is the parameter of LLM, 𝑦𝑖,𝑗 is the 𝑗-th token of the
textual output 𝑦𝑖 , and 𝑦𝑖,< 𝑗 denotes the tokens before 𝑦𝑖,𝑗 in the
where Φ𝐶2𝑇 (𝑐) is retrieved by C2T meta-path and Φ𝐶2𝐶2𝑇 is re-
𝑖-th samples.
trieved by C2C2T meta-path. Notably, the final tagging results of
By adopting this approach, we can effectively integrate the
LLM4Tag for the content 𝑐 will also be added to the graph as deter-
domain-specific knowledge from information retrieval systems
ministic edges, enabling dynamic scalability of the graph.
into LLMs, thus improving the tagging performance.
Compared to simple match-based tag recall, our graph-based tag
recall leverages semantic similarity to construct a global content- 3.3.2 Short-term Retrieved Knowledge Injection. Although LSKI
tag graph and incorporates a meta-path-based multi-hop recall effectively provides domain-specific knowledge, continuously in-
mechanism to enhance candidate tags completeness, which will be corporating short-term knowledge through LLM fine-tuning is
demonstrated in Sec 4.3. highly resource-intensive, especially given the rapid emergence of
new domain knowledge. Additionally, this approach suffers from
3.3 Knowledge-enhanced Tag Generation poor timeliness, making it more challenging to adapt to rapidly
After obtaining the candidate tag set, we can directly use the Large evolving content in information retrieval systems, particularly for
Language Models (LLMs) to select the most appropriate tags. How- emerging hot topics.
ever, due to the diversity and industry-specific nature of the infor- Therefore, we further introduce a short-term retrieved knowl-
mation retrieval system applications, domain-specific knowledge edge injection (SRKI). Specifically, we derive two retrieved knowl-
varies significantly across different scenarios. That is, the same edge injection methods: retrieved in-context learning injection and
content and tags may have distinct definitions and interpretations retrieved augmented generation injection.
depending on the specific application context. Furthermore, the Retrieved In-Context Learning Injection. We first construct
domain-specific knowledge is emerged continually and constantly a retrievable sample knowledge base (including contents and their
at an expeditious pace. As a result, the general-purpose LLMs have correct/incorrect annotated tags) and continuously append newly
difficulty in understanding the emerging domain-specific informa- emerging samples. Then, given the target content 𝑐, this composi-
tion, such as newly listed products, emerging hot news, or newly tion retrieves 𝑛 relevant samples from the sample knowledge base.
added tags, leading to a lower accuracy on challenging cases. This approach not only leverages the few-shot in-context learning
To address the lack of emerging domain-specific information in capability of LLMs but also enables them to quickly adapt to emerg-
LLMs, we devise a knowledge-enhanced tag generation scheme that ing domain knowledge, enhancing tagging accuracy for challenging
takes into account both long-term and short-term domain-specific cases.
knowledge by two key components, namely Long-term Supervised Retrieved Augmented Generation Injection. Given the con-
Knowledge Injection (LSKI), Short-term Retrieved Knowledge Injection tent 𝑐 and the candidate tag set Φ(𝑐), this composition retrieves
(SRKI). relevant descriptive corpus from web search and domain knowl-
edge base. It can retrieve extensive information that assists LLMs in
3.3.1 Long-term Supervised Knowledge Injection. For long-term understanding unknown domain-specific knowledge or new knowl-
domain-specific knowledge, we first construct a training dataset D edge, such as the definition of terminology in the content/tag or
and adopt a basic prompt template as 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑏 for tag generation some manually defined tagging rules.
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada
Retrieval Enhanced Tag Generation Prompt Template confidence, achieving a better performance by mitigating the hallu-
You are an advertising tag bot. Considering an advertisement cination problem. Furthermore, this confidence score can be directly
creative with {image} and {product description}. Domain-specific set as a relevance metric for the downstream tasks.
knowledge: (1) Similar advertisement creatives and corresponding Tag Confidence Training. In order to make the confidence
tags: {(ad1, tag1), (ad2, tag2), ... }; (2) Information related to the score more consistent with the requirements of information re-
advertisement and tags: {definition of terminology, related tagging trieval, we construct a confidence training dataset D ′ as:
rules, ...}. Now you need to select the most relevant tag from
{candidate tag set}. D ′ = {(𝑥𝑖′, 𝑦𝑖′ )}𝑖=1
𝑀
,
𝑥𝑖′ = Promptc (𝑐𝑖 , 𝑡𝑖 ) , (11)
Figure 4: Prompt template for retrieval enhanced tag genera-
𝑦𝑖′ ∈ {“𝑌𝑒𝑠”, “𝑁𝑜”} ,
tion in advertisement creatives tagging scenario.
where 𝑦𝑖 is annotated by experts and 𝑀 is the size of training
After obtaining the retrieved knowledge, we design a prompt dataset. Then we leverage the causal language modeling objective,
template, 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑟 , (shown in Figure 4) to integrate knowledge which is the same as Equation (8) to perform supervised fine-tuning.
with the content 𝑐 and candidate tag set Φ(𝑐) to provide the in- In that case, the confidence score predicted by this module aligns
context guidance for LLMs to predict the most appropriate tags for with the requirements of the information retrieval systems, thereby
content 𝑐 as: facilitating the calibration of incorrect tags.
Γ(𝑐) = LLM(𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑟 (𝑐, Φ(𝑐), 𝑅(𝑐))) ,
(9) 4 EXPERIMENTS
= 𝑡 1𝑐 , 𝑡 2𝑐 , · · · , 𝑡𝑚
𝑐
,
In this section, we conduct extensive experiments to answer the
where 𝑅(𝑐) is the retrieved knowledge above, and 𝑚 is the number following research questions:
of appropriate tags generated by LLMs.
RQ1 How does LLM4Tag perform in comparison to existing tag-
ging algorithms?
3.4 Tag Confidence Calibration
RQ2 How effective is the graph-based tag recall module?
After tag generation, there still exist two serious problems for real- RQ3 Does the injection of domain-specific knowledge enhance
world applications: (1) the hallucination due to the uncertainty of the tagging performance?
LLMs, which leads to generating irrelevant or wrong tags; (2) the RQ4 What is the impact of the tag confidence calibration module?
necessity of assigning a quantifiable relevance score for each tag for
the sake of downstream usage in the information retrieval systems
4.1 Experimental Settings
(e.g., recall and marketing).
4.1.1 Dataset. We conducted experiments on a mainstream infor-
mation distribution platform with hundreds of millions of users and
Tag Confidence Judgment Prompt Template
sampled three representative industrial datasets from online logs
You are an advertising tag relevance judgment bot. Given an to ensure consistency in data distribution, containing two types
advertisement creative with {image} and {product description} and of tasks: (1) multi-tag task (Browser News), and (2) single-tag task
a tag with {tag description}, judge whether the advertisement is
(Advertisement Creatives and Search Query).
relevant to the tag. Answer with "Yes" or "No".
• Browser News dataset includes popular news articles and user-
generated videos, primarily in the form of text, images, and
Figure 5: Prompt template for tag confidence judgment in
short videos. This is a multi-tag task, wherein the objective is to
advertisement creatives tagging scenario.
select multiple appropriate tags for each content from a massive
tag repository (more than 100,000 tags). Around 30,000 contents
To handle these two problems, the tag confidence calibration are randomly collected as the testing dataset through expert
module is adopted. Specifically, given a target content 𝑐 and a certain annotations.
tag 𝑡 𝑐 ∈ Γ(𝑐), we derive a prompt template, 𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑐 (shown in • Advertisement Creatives dataset includes ad creatives, includ-
Figure 5), to leverage the reasoning ability of LLMs to achieve a tag ing cover images, copywriting, and product descriptions from
confidence judgment task, i.e., whether 𝑐 and 𝑡 𝑐 is relevant. Then advertisers. The task for this dataset is a single-tag task, where
we extract the probability of the token in the LLM result to get a we need to select the most relevant tag to the advertisement
confidence score Conf (𝑐, 𝑡 𝑐 ): from a well-designed tag repository (more than 1,000 tags) and
𝒔 = LLM(𝑇 𝑒𝑚𝑝𝑙𝑎𝑡𝑒𝑐 (𝑐, 𝑡 𝑐 )) ∈ R𝑉 , we collect around 10,000 advertisements randomly as the testing
dataset through expert annotation.
exp (𝒔 [“𝑌𝑒𝑠”]) (10)
Conf (𝑐, 𝑡 𝑐 ) = ∈ (0, 1) , • Search Query dataset primarily consists of user search queries
exp (𝒔 [“𝑌𝑒𝑠”]) + exp (𝒔 [“𝑁𝑜”])) from a web search engine, used for user intent classification.
where 𝒔 is the probability score for all tokens, and 𝑉 is the vocabu- The task for this dataset is also a single-tag task, where the most
lary size of LLMs. probable intent needs to be selected as the tag for each query.
After obtaining the confidence score Conf (𝑐, 𝑡), we implement The size of the tag repository is about 1,000, and 2,000 queries
self-calibration for the results by eliminating those tags with low are collected and manually tagged as the testing dataset.
KDD ’2025, August 03–07, 2025, Toronto, ON, Canada Ruiming Tang, et al.
Table 1: Performance comparison of different methods. Note that different tasks, multi-tag tasks (Brower News) and single-tag
tasks (Advertisement Creatives and Search Query), have different metrics. The best result is given in bold, and the second-best
value is underlined. "RI" indicates the relative improvements of LLM4Tag over the corresponding baseline.
4.1.2 Baselines. To evaluate the superiority and effectiveness of For the single-tag task, we adopt Precision, Recall, and F1 follow-
our proposed model, we compare LLM4Tag with two classes of ing previous works [5, 15]. Higher values of these metrics indicate
existing models: better performance.
Moreover, we report Relative Improvement (RI) to represent
• Traditional Methods that encode the contents and tags by
the relative improvement our model achieves over the compared
leveraging pre-trained language models and select the most rel-
models. Here we calculate the average RI of the above all metrics.
evant tags according to cosine distance for each content as the
result. Here we compare three different pre-trained language 4.1.4 Implementation Details. In the selection of LLMs, we select
models. BGE [30] pre-trains the models with retromae on large- Huawei’s large languge model PanGu-7B [28, 31]. For the graph-
scale pairs data using contrastive learning. GTE [19] further based tag recall module, we choose BGE [30] as the encoder model.
proposes multi-stage contrastive learning to train the text em- 𝛿𝑐 −𝑡 and 𝛿𝑐 −𝑐 are set as 0.5 and 0.8, respectively. Besides, we set
bedding. CONAN [16] maximizes the utilization of more and maximum recall numbers for different meta-paths, 15 for C2T meta-
higher-quality negative examples to pre-train the model. path and 5 for C2C2T meta-path. For knowledge-enhanced tag
• LLM-Enhanced Methods that utilize large language models generation module, the size of the training dataset in long-term
to assist the tag generation. TagGPT [15] proposes a zero-shot supervised knowledge injection contains approximately 10, 000
automated tag extraction system through prompt engineering annotated samples and the tuning is performed every two weeks.
via LLMs. ICXML [35] introduces a two-stage tag generation As for the short-term retrieved knowledge injection, the retrievable
framework, involving generation-based label shortlisting and database is updated in real-time and we retrieve at most 3 relevant
label reranking through in-context learning. LLM4TC [5] fur- samples/segments for in-context learning injection and augmented
ther leverages fine-tuning using domain knowledge to improve generation injection, respectively. For the tag confidence calibration
the performance of tag generation. module, we eliminate tags with confidence scores less than 0.5 and
rank the remaining tags in order of confidence scores as the result.
4.1.3 Evaluation Metrics. For multi-tag tasks, due to the excessive
number of tags (millions), we can not annotate all the correct tags 4.2 Result Comparison & Deployment (RQ1)
and thus only directly judge whether the results generated by the
Table 1 summarizes the performance of different methods on three
model are correct or not. In this case, we define Acc@k to evaluate
industrial datasets, from which we have the following observations:
the performance:
• Leveraging large language models (LLMs) brings bene-
𝑁 ′ ∑︁
𝑘′ fit to model performance. TagGPT, ICXML, and LLM4TC,
1 ∑︁ I (𝑇𝑖 [ 𝑗])
Acc@k = , utilize LLMs to assist the tag generation, achieving better per-
𝑁 ′ 𝑖=1 𝑗=1 𝑘′
formances than other small language models (SLMs), such as
𝑘 ′ = min(𝑘, 𝑙𝑒𝑛(𝑇𝑖 )) , (12) BGE, GTE, and CONAN. This phenomenon indicates that the
( world knowledge and reasoning capabilities of LLMs enable
1, 𝑇𝑖 [ 𝑗] is right , better content understanding and tag generation, significantly
I (𝑇𝑖 [ 𝑗]) =
0, otherwise , improving tagging effectiveness.
• Introducing domain knowledge can significantly improve
where 𝑇𝑖 [ 𝑗] is the 𝑗-th generated tag of the 𝑖-th content and 𝑁 ′ is performance. Although LLMs benefit from general world knowl-
the size of test dataset. It is worth noticing there exists contents that edge, there remains a significant gap compared with domain-
do not have 𝑘 proper tags, thus we allow the number of generated specific knowledge. Therefore, LLM4TC injects domain knowl-
tags to be less than 𝑘. edge by fine-tuning the LLMs and achieves better performance
LLM4Tag: Automatic Tagging System for Information Retrieval via Large Language Models KDD ’2025, August 03–07, 2025, Toronto, ON, Canada
than other baselines in all metrics, which validates the impor- Match-based Recall
tance of domain knowledge injection. Green Tea Instant Food Baking Snacks
• The superior performance of LLM4Tag. We can observe Graph-based Recall
from Table 1 that LLM4Tag yields the best performance on all
datasets consistently and significantly, validating the superior
effectiveness of our proposed LLM4Tag. Concretely, LLM4Tag Green Beans
beats the best baseline by 3.7%, 6.1%, and 4.5% on three datasets, Case A
• Notably, LLM4Tag has been deployed online and covers Graph-based Recall
all the traffic. We randomly resampled the online data, and the
online report shows consistency between the improvements in
Nuts
the online metrics and those observed in the offline evaluation. Case B
Now, LLM4Tag has been deployed in the content tagging system
of these three online applications, serving hundreds of millions
of users daily. Match-based Recall
Instant Food Porridge Fruit
Z R . ( 7 * Z R 6 5 . , down. This characteristic allows us to set an appropriate confidence
Z R / 6 . , 2 X U V threshold in practical deployment scenarios to achieve a balance
between prediction accuracy and tag coverage.
$ F F # $ F F # $ F F #