0% found this document useful (0 votes)
39 views16 pages

Aligning Large Language Models With Recommendation Knowledge

Uploaded by

tiantiantian318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views16 pages

Aligning Large Language Models With Recommendation Knowledge

Uploaded by

tiantiantian318
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Aligning Large Language Models with Recommendation Knowledge

Yuwei Cao1* , Nikhil Mehta2 , Xinyang Yi2 , Raghunandan Keshavan3 ,


Lukasz Heldt3 , Lichan Hong2 , Ed H. Chi2 , and Maheswaran Sathiamoorthy4
1
University of Illinois Chicago 2 Google DeepMind 3 Google
1
[email protected] 2 {nikhilmehta, xinyang}@google.com
4
[email protected]

Abstract LLMs as backbones in recommendation (Kang


Large language models (LLMs) have recently
et al., 2023; Geng et al., 2022; Zhang et al., 2023;
been used as backbones for recommender sys- Bao et al., 2023). Despite their great potential,
LLMs are inferior to supervised recommenders
arXiv:2404.00245v1 [cs.IR] 30 Mar 2024

tems. However, their performance often lags


behind conventional methods in standard tasks (He et al., 2017; Rendle et al., 2009) in recom-
like retrieval. We attribute this to a mis- mendation tasks such as rating-prediction under
match between LLMs’ knowledge and the zero-shot and few-shot in-context learning settings
knowledge crucial for effective recommenda- (Kang et al., 2023). We hypothesize that this stems
tions. While LLMs excel at natural language
from a gap between LLMs’ knowledge and rec-
reasoning, they cannot model complex user-
item interactions inherent in recommendation ommendation knowledge: LLMs are proficient at
tasks. We propose bridging the knowledge gap natural language reasoning, while recommendation
and equipping LLMs with recommendation- involves modeling complex user-item interactions.
specific knowledge to address this. Opera- In this work, we propose to mitigate this gap by
tions such as Masked Item Modeling (MIM) fine-tuning LLMs with data samples that encode
and Bayesian Personalized Ranking (BPR) recommendation knowledge.
have found success in conventional recom- Recent works (Geng et al., 2022; Zhang et al.,
mender systems. Inspired by this, we sim-
ulate these operations through natural lan-
2023; Bao et al., 2023) show that certain recom-
guage to generate auxiliary-task data samples mendation knowledge can be introduced into LLMs
that encode item correlations and user prefer- through instruction tuning. As shown in Figure
ences. Fine-tuning LLMs on such auxiliary- 1(a), their training data samples, which we refer
task data samples and incorporating more in- to as recommendation-task data samples, primar-
formative recommendation-task data samples ily help LLMs understand the recommendation
facilitates the injection of recommendation- tasks by providing instructions on what to do (e.g.,
specific knowledge into LLMs. Extensive ex-
“Pick an item for the user from the following candi-
periments across retrieval, ranking, and rating
prediction tasks on LLMs such as FLAN-T5- dates.”). In terms of modeling the target recommen-
Base and FLAN-T5-XL show the effectiveness dation domain, however, they present raw user and
of our technique in domains such as Amazon item features for personalization (e.g., the user’s
Toys & Games, Beauty, and Sports & Outdoors. ID or the IDs of the items they recently interacted
Notably, our method outperforms conventional with), which are insufficient for LLMs to fully com-
and LLM-based baselines, including the cur- prehend the target domain.
rent SOTA, by significant margins in retrieval, Considering the aforementioned limitations of
showcasing its potential for enhancing recom-
mendation quality.
using LLMs as recommenders, we propose a novel
approach to generate additional fine-tuning data
1 Introduction samples for LLMs that effectively encode recom-
mendation knowledge, particularly focusing on
Large language models (LLMs) exhibit strong gen- item correlations within the target domain. We
eralization abilities through zero-shot learning, in- refer to these generated data samples as auxiliary-
context learning (Brown et al., 2020), fine-tuning, task data samples, as they are used as auxiliary
and instruction tuning (Wei et al., 2022). Encour- tasks in addition to the recommendations tasks.
aged by this, recent studies explore the use of While developing the auxiliary tasks, our key in-
*
Work done when interning at Google. spiration comes from the classical operations that
1) P5 Retrieval (Sequential Recommendation) 1) Retrieval 4) Masked Item Modeling (MIM)
Input: I find the purchase history list of user_15466: 4110 -> 4467 -> Input: A user has purchased the Input: A user has purchased the following
4468 -> 4472 I wonder what is the next item to recommend to the following products: Item ID: I811, Title: products: Item ID: I811, Title: Women’s
user. Can you help me decide? Women’s Gel-Excite Running Shoes; Gel-Excite Running Shoes; [masked item]; …
Output: 1581 Item ID: I1014, Title: Women’s Dry-fit [masked item]; … What are the masked
Tempo Shorts; … What would the user items, in chronological order?
2) P5 Ranking (Direct Recommendation) buy next? Output: Item ID: I1014, Title: Women’s Dry-fit
Input: Pick the most suitable item from the following list and Output: I10145 Tempo Shorts; Item ID: I10145, …
recommend to user_250: 4915 , 1823 , 3112 , 3821 , 3773 , 520 , …
2) Ranking 5) Masked Language Modeling (MLM)
Output: 520
Input: A user has purchased the Input: Item ID: I811, Title: Women’s Gel-Excite
3) P5 Rating Prediction following products: Item ID: I811, Title:
Women’s Gel-Excite Running Shoes; … Running Shoes; Item ID: I1014, Title: Women’s
Input: What star rating do you think user_23 will give item_7391? Which of the following candidate items Dry-fit Tempo Shorts;
Output: 5.0 would you recommend the user to buy
next? Candidate items are: I8, I92, Output: <B> ID: I8 <S> : Women <S>
4) InstructRec Ranking (type <P1, I0, T3>) I10145, …
Shoes; Item ID: I1014, Title <S> Shorts; <E>
Output: I10145
Input: The user has purchased these items: <historical interactions>.
Please respond to this user by selecting items from the candidates:
<candidate items>. 3) Rating Prediction 6) Bayesian Personalized Ranking (BPR)
Output: <target item> Input: A user likes the following Input: A user has purchased the following
products: Item ID: I811, Title: Women’s products: Item ID: I811, Title: Women’s
5) TALLRec Rating Prediction Gel-Excite Running Shoes;… The user Gel-Excite Running Shoes; … Which of the
dislikes the following products: … following two products would the user buy
Input: Given the user’s historical interactions, please determine Predict whether the user would like the next? Item ID: I123, Title: Golf Club Cleaner
whether the user will enjoy the target new movie by answering "Yes" following item. Answer yes or no. Item Brush; or Item ID: I1014, Title: Women’s Dry-fit
or "No". User’s liked items: GodFather. User’s disliked items: Star ID: I1014, Title: Women’s Dry-fit Tempo Tempo Shorts?
Wars. Target new movie: Iron Man. Shorts;
Output: No. Output: Item ID: I1014, Title: Women’s Dry-fit
Output: Yes Tempo Shorts

a) Recommendation-task data samples of the existing studies b) Our recommendation-task and auxiliary-task data samples

Figure 1: Data samples adopted by the existing studies and this work. (a) shows the recommendation-task data
samples of the existing studies. Specifically, (a1)-(a3) demonstrate the retrieval, ranking, and rating prediction data
samples of P5 (Geng et al., 2022); (a4) shows a ranking (type <P1, I0, T3>) data sample of InstructRec (Zhang et al.,
2023); (a5) is a rating prediction data sample of TALLRec (Bao et al., 2023). (b) shows our recommendation-task
(blue boxes) and auxiliary-task (purple boxes) data samples (we present more samples in Appendix C).

are typically used to train conventional recom- ples in a simple multi-task learning framework.
mender systems, namely, masked item modeling Experiments on various recommendation tasks,
(MIM) (Sun et al., 2019) and Bayesian Personal- i.e., retrieval, ranking, and rating-prediction,
ized Ranking (BPR) (Rendle et al., 2009). Our key across three target domains, i.e., Amazon Toys
innovation lies in converting the MIM and BPR & Games, Beauty, and Sports & Outdoors, show
tasks into natural language tasks that can be used the effectiveness of our proposed method and its
to train the LLMs. We also incorporate the masked components. For retrieval, our model outperforms
language modeling (MLM) (Devlin et al., 2019) both conventional and LLM-based baselines, in-
task for the user’s past interactions to supplement cluding the current SOTA, by large margins.
the MIM task with fine-grained item correlations.
Our contributions can be summarized as follows: 2 Related Work
Recommender Systems. Recommender systems
• We propose a novel method to align LLMs with
help users in discovering items of interest. As a
new recommendation domains, i.e., supplement-
practical approach, Collaborative Filtering (CF)
ing the fine-tuning of the LLMs with auxiliary-
(Mao et al., 2021) explores historical user-item in-
task data samples that mimic the classical opera-
teractions, assuming that users with similar behav-
tions in training conventional recommender sys-
iors have similar preferences for items. Among
tems with natural language prompts.
various CF methods, Matrix Factorization (MF)
• We propose recommendation-task data samples methods (Rendle et al., 2009; Mao et al., 2021)
that are more informative as compared to the ex- project users and items into a shared vector space
isting work (Geng et al., 2022). Specifically, we and estimate a user’s preference for an item through
reduce the complexity of the input/output spaces the inner product of their vectors and are widely
by eliminating the user IDs. We further enhance adopted. Context-aware approaches (Cheng et al.,
the user sequences by providing item titles. 2016) further include additional information, such
• We fine-tune the open-source 3B FLAN-T5-XL as user and contextual features, to improve rec-
and 223M FLAN-T5-Base with our proposed ommendation quality. However, CF fails to cap-
recommendation-task and auxiliary-task data sam- ture the sequential patterns in users’ behaviors,
which leads to the rise of sequential recommenda- Geng et al. 2022 formalize various recommen-
tions. Sequential recommenders based on Convolu- dation tasks as natural language instructions and
tional Neural Networks (CNNs) (Tang and Wang, fine-tune a unified recommender with T5 (Raffel
2018), Gated Recurrent Units (GRUs) (Hidasi et al., et al., 2020) backbone. Zhang et al. 2023 fur-
2016), and self-attention (Sun et al., 2019; Zhang ther supplement the tuning data with user prefer-
et al., 2019; Kang and McAuley, 2018; Zhou et al., ences/intentions deduced by GPT-3.5 1 to accom-
2020; Rajput et al., 2023) have become prevalent modate instructions of free forms. Bao et al. 2023
in the era of deep learning. Notably, leveraging explore instruction tuning LLMs with limited data.
a T5-like backbone, Rajput et al. 2023 formal- In contrast to the existing studies, our work fo-
ize recommendation as generative retrieval, i.e., cuses on introducing new recommendation knowl-
autoregressively decode the identifiers of the tar- edge into LLMs, which we believe is the key for im-
get items, and achieve the current SOTA. While proving recommenders with LLM backbones. We
structurally resembling LLMs, it lacks their pre- create auxiliary tasks that improve the recommen-
training knowledge and the accompanying natural dation tasks, including retrieval, ranking, and rat-
language reasoning potential. Our proposed ap- ing prediction. Our proposed recommendation-task
proach adopts self-attention for sequential recom- and auxiliary-task data samples include raw user
mendation, specifically harnessing LLMs as back- purchase sequences in addition to natural language
bones. We compare against various baselines from instructions. These data samples supplement each
all the classes discussed above. other in encoding the target recommendation do-
main knowledge. We experiment under restricted
LLMs for Recommendation. LLMs have recently
settings. Compared to the previous studies (Zhang
been explored for recommendation tasks due to
et al., 2023), we consider larger candidate pools
their ability to understand, generate, and reason
(e.g., our retrieval and ranking experiments con-
with natural language. Several studies focus on
sider the entire dataset and 99 hard negatives, re-
incorporating LLMs’ natural language capabilities
spectively). Unlike Bao et al. 2023, we fully train
into existing recommendation techniques. E.g.,
all models to maximize their performances.
Hou et al. 2022 and Cao et al. 2023 encode item
contents (title, description, etc.) with BERT (De-
3 Methodology
vlin et al., 2019), which enables learning seman-
tically informed embeddings even for zero-shot We propose designing data samples that encode rec-
items. Moreover, pre-trained LLM backbones have ommendation knowledge to align LLMs with the
also been used for recommendation through zero- target recommendation domain. Sections 3.1 and
shot learning (Kang et al., 2023), in-context learn- 3.2 discuss our auxiliary-task and recommendation-
ing (Kang et al., 2023), fine tuning (Cui et al., task data, respectively. Section 3.3 introduces a
2022; Kang et al., 2023), and instruction tuning simple multi-task learning framework that we use
(Geng et al., 2022; Zhang et al., 2023; Bao et al., to fine-tune LLMs.
2023). Besides helping classic recommendation
tasks, LLMs also enable novel recommendation 3.1 Auxiliary-task Data Generation
use cases. Geng et al. 2022 leverage LLMs to Conventional recommenders acquire recommen-
explain the recommendation results. Gao et al. dation knowledge via classic operations such as
2023; Wang and Lim 2023 utilize GPT-3 (Brown masked item modeling (Sun et al., 2019) and BPR
et al., 2020) for conversational recommendation. loss reduction (Rendle et al., 2009). We mimic
Christakopoulou et al. 2023 extract persistent user these operations with natural language prompts. In
interests with LLMs for deeper user understand- addition, we sample sub-sequences of the raw user
ing. Carranza et al. 2023 generate private synthetic purchase sequences. The resulting data, which we
representations of the original data with LLMs for refer to as auxiliary-task data samples, encode item
privacy-preserving recommendation. correlations contained in users’ preferences 2 .
Recommendation as Instruction-following. The 1
https://platform.openai.com/docs/models/
success of instruction tuning, i.e., fine-tune on data overview
2
described via instructions (Mishra et al., 2022; Wei As a side note, we also explored encoding item correla-
tions contained in item contents (categories, descriptions, etc.).
et al., 2022), has inspired attempts that instruction- Observing no noticeable performance increase, we present our
tune LLM backbones for recommendation tasks. approach and results in Appendix D
3.1.1 Masked Item Modeling (MIM) correlations contained in the users’ purchase se-
Conventional sequential recommenders (Sun et al., quences. This process resembles masked language
2019) learn item correlations from users’ interac- modeling (MLM) (Devlin et al., 2019).
tion sequences. Specifically, they predict randomly As shown in Figure 1(b5), given a user sequence,
masked items in the sequences by jointly condition- we sample a sub-sequence by randomly decid-
ing on the unmasked items. We mimic this process, ing a starting item and a sub-sequence length Ls ,
which we refer to as masked item modeling (MIM), where 2 ≤ Ls ≤ w and w is the sliding win-
with natural language prompts. dow for accommodating long sequences. These
MIM applies a Cloze objective (Sun et al., 2019). sub-sequences, referred to as MLM data samples,
At each training step, random items in the input supplement the MIM data samples: through span
user sequence are replaced with a special token corruption (Raffel et al., 2020), i.e., masking and re-
"[mask]", and the model learns to recover the covering consecutive spans of tokens, LLMs learn
masked items based on its surrounding context. An to model more fine-grained correlations across mul-
example of the masking process: tiple continuous items from the MLM data samples.

random masking 3.1.3 Bayesian Personalized Ranking (BPR)


Input: [i1 , i2 , i3 , i4 , i5 ] −−−−−−−−−→
Besides correlating similar items, we explore con-
[i1 , [mask]1 , i3 , [mask]2 , i5 ] (1)
trasting dissimilar items. BPR loss (Rendle et al.,
Label: [mask]1 = i2 , [mask]2 = i4 2009) is adopted by conventional recommenders
(Rendle and Freudenthaler, 2014; Koren et al.,
The MIM loss is computed as follows in conven- 2009; Cheng et al., 2016) for personalized rank-
tional sequential recommenders: ing, i.e., learning users’ preferences for some items
1 X ′ over the others. Inspired by this, we imitate BPR
LMIM = m
−logP (im |Su ), (2) loss reduction with natural language prompts for
|Su | m
im ∈Su training LLMs.
′ The objective of BPR loss reduction in conven-
where Su is the masked version of user sequence
tional recommenders is:
Su , Sum stands for the masked items in Su . P (·),

the probability of observing im given Su , is calcu- LBPR = E − log σ(s(u, i+ ) − s(u, i− )),
lated from deep bidirectional self-attention (Devlin (u,i+ )∼ppos
et al., 2019). (3)
Our natural language imitation of MIM loss where (u, i+ ) is a pair of a user u and an item
(Equation 2) is described in Figure 1(b4). Given i+ sampled from the distribution of positive pairs
purchase sequence: [i1 , i2 , i3 , i4 , i5 ], we generate ppos , i.e., u interacted with i+ . i− is a randomly
prompts, e.g., Input: “A user has purchased the sampled negative item that u has not interacted
following products: Item ID: [ID]i1 , Title: [Title]i1 ; with. The similarity between u and i+ , denoted by
[masked item]; Item ID: [ID]i3 , Title: [Title]i3 ; s(u, i+ ), is calculated by taking the dot product of
[masked item]; Item ID: [ID]i5 , Title: [Title]i5 . their representations. σ(·) is the Sigmoid function.
What are the masked items, in chronological or- Figure 1(b6) shows our natural language imi-
der?”, and Output: “Item ID: [ID]i2 , Title: [Title]i2 ; tation. We elicit user preferences by generating
Item ID: [ID]i4 , Title: [Title]i4 ;”. To accommodate prompts with binary choices that contrast a posi-
long sequences, we introduce a sliding window tive item and a negative item. Each prompt takes
w and each prompt considers one sub-sequence: the form of a binary decision, e.g., Input: “A user
has purchased ... Which of the following two prod-
 ..., ik+w−1 ], where 1 ≤ k ≤ max 1,(L-
[ik , ik+1
ucts would the user buy next? Item ID: [ID]i− ,
w+1) and L is the total length of the user sequence.
The resulting MIM data samples encodes the cor- Title: [Title]i− ; Item ID: [ID]i+ , Title: [Title]i+ .”,
relations between the masked items and the rest of and Output: “Item ID: [ID]i+ , Title: [Title]i+ ”. Fol-
the sequences. lowing Section 3.1.1, we adopt a sliding window
w to accommodate long user sequences and the
3.1.2 Masked Language Modeling (MLM) positive item is always the one next to the sliding
In addition to MIM that considers a single item for window. These BPR data samples encode dissimi-
each mask, we also mask out and recover a con- larities between the purchased items and the rest of
secutive span of tokens to encode fine-grained item the items in the dataset.
Generate recommendation-task & auxiliary-task data samples sulting model. We first generate recommendation-
Training Testing
task and auxiliary-task data samples using the train-
Retrieval data MIM data Retrieval data
samples samples samples
ing set. Next, we tune the LLM backbone with
these data samples in a multi-task learning man-
Ranking data MLM data Ranking data
samples samples samples ner. Finally, we evaluate the recommendation tasks
using the recommendation-task data samples gen-
Rating prediction BPR data Rating prediction
data samples samples data samples erated from the test set.

4 Experiments
Multi-task
fine-tuning Evaluation
LLM backbone
We evaluate the proposed method and compare it
with conventional as well as LLM-based recom-
Figure 2: Fine-tuning and evaluation framework. menders. We aim to answer the following research
3.2 Recommendation-task Data Generation questions: RQ1. Can our method introduce knowl-
edge into LLMs from new recommendation do-
As shown in Figure 1(a), the existing recom-
mains? RQ2. How does our model perform com-
menders with LLM backbones adopt prompts that
pared to the conventional as well as LLM-based
primarily convey the recommendation tasks by pro-
recommenders in retrieval, ranking, and rating pre-
viding directions on how to perform them. Such
diction? RQ3. How beneficial are the individual
information is essential, yet insufficient for repre-
proposed tasks? RQ4. What’s the effect of varying
senting the target recommendation domain.
the size of the backbone LLM?
We propose prompts that help LLMs compre-
hend the target recommendation domain in addi- 4.1 Experimental Setting
tion to the recommendation tasks. Specifically, we
Datasets. We experiment on three real-world
reduce the complexity of the input/output spaces.
datasets: Amazon Toys & Games, Beauty, and
In contrast to Geng et al. 2022, we eliminate user
Sports & Outdoors 4 . Following Zhou et al. 2020;
IDs and represent the users by their historical pur-
Geng et al. 2022, we keep 5-core data and apply
chases. Consequently, we relieve LLMs from mem-
leave-one-out evaluation, i.e., for each user pur-
orizing a substantial volume of user IDs (e.g., Ama-
chase sequence (where the interactions are sorted
zon Sports & Outdoors has 35,598 users). More-
by timestamp in ascending order), the last, the sec-
over, compared to Geng et al. 2022 that represent
ond to the last, and the prior interactions are used
user sequences solely by item IDs, we include
for testing, validation, and training, respectively.
both the IDs and the titles of the items, which
We present data statistics in Appendix B.
makes it easier for LLMs to recognize the items.
Notably, ranking candidates and items in the out- Recommendation Tasks. We evaluate on three es-
put are represented solely by their IDs to reduce tablished recommendation tasks: retrieval, which
the length of the prompts and maintain a smaller retrieves the ground truth item that a user inter-
output space. Figures 1(b1)-(b3) show examples acted with from the entire dataset; ranking, which
of our retrieval, ranking, and rating prediction chooses the ground truth item that a user interacted
recommendation-task data samples. The raw item with from a candidate pool of size 100 (1 posi-
IDs (e.g., ‘0000031852’) are mapped into shorter tive item and 99 negative items sampled based on
ones (e.g., ‘I123’) 3 to reduce input/output space popularity); rating prediction, which classifies an
complexity. To fully present the users’ historical interaction as either "like" or "dislike" (interactions
purchases to LLMs, we adopt a sliding window w with ratings > 3 are considered as "like"). We leave
similar to Section 3.1.1. the exploration and evaluation of novel recommen-
dation tasks (e.g., explanation generation) to the
3.3 Fine-tuning and Evaluation Framework future, due to a lack of ground-truth data.
As shown in Figure 2, we adopt a simple framework Evaluation Metrics. For retrieval and ranking, we
to fine-tune the LLM backbones and evaluate the re- report top-k Hit Ratio (HR@k) and Normalized
3
Discounted Cumulative Gain (NDCG@k), where
We adopt random mapping, i.e., similar-looking IDs
may not imply any connection or semantic similarity. We k is set to 5/10 and 1/5/10, respectively. For rat-
acknowledge that using semantic-rich IDs (Rajput et al., 2023) ing prediction, we report Area Under the Receiver
could enhance performance and leave the exploration to the
4
future. https://nijianmo.github.io/amazon/
Toys & Games Beauty Sports & Outdoors
Methods
NDCG NDCG HR HR NDCG NDCG HR HR NDCG NDCG HR HR
@5 @10 @5 @10 @5 @10 @5 @10 @5 @10 @5 @10
Caser1 0.0107 0.0141 0.0166 0.0270 0.0131 0.0176 0.0205 0.0347 0.0072 0.0097 0.0116 0.0194
HGN1 0.0221 0.0277 0.0321 0.0497 0.0206 0.0266 0.0325 0.0512 0.0120 0.0159 0.0189 0.0313
GRU4Rec1 0.0059 0.0084 0.0097 0.0176 0.0099 0.0137 0.0164 0.0283 0.0086 0.0110 0.0129 0.0204
BERT4Rec1 0.0071 0.0099 0.0116 0.0203 0.0124 0.0170 0.0203 0.0347 0.0075 0.0099 0.0115 0.0191
FDSA1 0.0140 0.0189 0.0228 0.0381 0.0163 0.0208 0.0267 0.0407 0.0122 0.0156 0.0182 0.0288
SASRec1 0.0306 0.0374 0.0463 0.0675 0.0249 0.0318 0.0387 0.0605 0.0154 0.0192 0.0233 0.0350
S3 -Rec1 0.0294 0.0376 0.0443 0.0700 0.0244 0.0327 0.0387 0.0647 0.0161 0.0204 0.0251 0.0385
TIGER2 0.0371 0.0432 0.0521 0.0712 0.0321 0.0384 0.0454 0.0648 0.0181 0.0225 0.0264 0.0400
P52 0.0050 0.0066 0.0070 0.0121 0.0107 0.0136 0.0163 0.0254 0.0041 0.0052 0.0061 0.0095
P5-XL 0.0023 0.0031 0.0035 0.0061 0.0036 0.0050 0.0063 0.0104 0.0029 0.0035 0.0040 0.0060
FLAN-T5-Base 0.0000 2e−5 0.0000 5e−5 0.0000 0.0000 0.0000 0.0000 0.0000 9e−6 0.0000 3e−5
FLAN-T5-XL 2e−5 2e−5 5e−5 5e−5 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
ReAT [Ours] 0.0390 0.0461 0.0558 0.0776 0.0382 0.0442 0.0535 0.0722 0.0188 0.0232 0.0285 0.0422
UT [Ours] 0.0166 0.0202 0.0252 0.0362 0.0188 0.0231 0.0292 0.0425 0.0079 0.0101 0.0118 0.0187
UT+AT [Ours] 0.0392 0.0459 0.0563 0.0772 0.0329 0.0397 0.0482 0.0693 0.0178 0.0219 0.0268 0.0393
∆ (%) +5.66 +6.71 +8.06 +8.99 +19.00 +15.10 +17.84 +11.42 +3.87 +3.11 +7.95 +5.50

2
Table 1: Retrieval results. 1 marks results from Zhou et al. 2020; marks results from Rajput et al. 2023. ∆
compares the best [Ours] with the best baseline.

Toys & Games Beauty Sports & Outdoors


Methods
NDCG NDCG HR HR HR NDCG NDCG HR HR HR NDCG NDCG HR HR HR
@5 @10 @1 @5 @10 @5 @10 @1 @5 @10 @5 @10 @1 @5 @10
BPR-MF1 0.0641 0.0940 0.0233 0.1066 0.2003 0.0857 0.1224 0.0311 0.1426 0.2573 0.0848 0.1220 0.0314 0.1404 0.2563
BPR-MLP1 0.0688 0.0988 0.0252 0.1142 0.2077 0.0848 0.1215 0.0317 0.1392 0.2542 0.0927 0.1296 0.0351 0.1520 0.2671
SimpleX1 0.1244 0.1469 0.0268 0.1958 0.2662 0.1441 0.1711 0.0325 0.2247 0.3090 0.1505 0.1800 0.0331 0.2362 0.3290
P5-XL 0.0290 0.0444 0.0097 0.0494 0.0977 0.0298 0.0456 0.0110 0.0498 0.0992 0.0286 0.0436 0.0097 0.0486 0.0957
FLAN-T5-Base 0.0107 0.0127 0.0057 0.0156 0.0217 0.0097 0.0113 0.0052 0.0137 0.0189 0.0069 0.0082 0.0035 0.0102 0.0144
FLAN-T5-XL 0.0160 0.0312 0.0026 0.0315 0.0793 0.0152 0.0296 0.0022 0.0301 0.0753 0.0097 0.0193 0.0014 0.0192 0.0491
RaAT [Ours] 0.1714 0.2034 0.0956 0.2464 0.3453 0.1376 0.1691 0.0702 0.2036 0.3013 0.0933 0.1199 0.0424 0.1448 0.2272
UT [Ours] 0.1536 0.1867 0.0831 0.2233 0.3259 0.1236 0.1537 0.0609 0.1863 0.2798 0.0867 0.1137 0.0381 0.1362 0.2202
UT+AT [Ours] 0.1703 0.2064 0.0938 0.2443 0.3562 0.1441 0.1758 0.0742 0.2126 0.3112 0.0997 0.1281 0.0468 0.1526 0.2404
∆ (%) +37.78 +40.50 +256.72 +25.84 +33.81 0.00 +2.75 +128.31 -5.38 +0.71 -33.75 -28.83 +33.33 -35.39 -26.93

Table 2: Ranking results. 1 marks results from Geng et al. 2022. ∆ compares the best [Ours] with the best baseline.

Methods Toys & Games Beauty Sports & Outdoors and McAuley, 2018), S3 -Rec (Zhou et al., 2020),
History 66.59 64.80 62.78 and TIGER (Rajput et al., 2023), which lever-
DMF 51.82 51.23 51.38
Wide&Deep 70.93 67.10 67.60 age self-attention, with TIGER being the current
P5-XL 51.04 50.63 50.36 SOTA. For ranking, we consider BPR-MF (Ren-
FLAN-T5-Base 57.85 56.04 55.00
FLAN-T5-XL 55.23 53.77 52.01 dle et al., 2009), BPR-MLP (Cheng et al., 2016),
RpAT [Ours] 71.16 68.27 65.87 and SimpleX (Mao et al., 2021), which are col-
UT [Ours] 70.79 67.45 65.35 laborative filtering-based method. For rating pre-
UT+AT [Ours] 71.08 67.55 65.18
diction, we consider History, a naive method that
∆ (%) +0.32 +1.74 -2.56
always predicts based on how likely a user likes
Table 3: Rating prediction AUC-ROC. ∆ compares the the training items they purchased, DMF (Xue
best [Ours] with the best baseline. et al., 2017), a neural matrix factorization model,
Operating Characteristic Curve (AUC-ROC). and Wide&Deep (Cheng et al., 2016), a context-
aware method. Beside, we also consider LLM-
Models. We compare to non LLM-based recom-
based methods including P5 (Geng et al., 2022),
menders. For retrieval, we consider sequential
which fine-tunes T5 (Raffel et al., 2020) with multi-
recommenders including Caser (Tang and Wang,
task recommendation prompts, P5-XL, which fine-
2018), which leverages CNNs, HGN (Ma et al.,
tunes FLAN-T5-XL with P5 prompts, FLAN-T5-
2019), which adopts hierarchical gating networks,
Base/XL (Wei et al., 2022), which make zero-shot
GRU4Rec (Hidasi et al., 2016), which leverages
predictions with FLAN-T5-Base or FLAN-T5-XL.
GRUs (Cho et al., 2014), BERT4Rec (Sun et al.,
We query them with our proposed recommendation-
2019), FDSA (Zhang et al., 2019), SASRec (Kang
task data samples generated from the test set 5 . et al. 2020; Geng et al. 2022; Rajput et al. 2023. We
ReAT/ RaAT/ RpAT, which fine-tune FLAN-T5- implement DMF and Wide&Deep with RecBole 7 .
XL with our proposed retrieval (Re), ranking (Ra), We adopt the default configurations, except the
or rating prediction (Rp) task data samples along data split, mapping (ratings to "like"s or "dislike"s),
with the auxiliary-task (AT) data samples 6 , uni- and metric are adjusted to follow our experiment
fied training (UT), which fine-tunes FLAN-T5-XL settings as reported earlier. The pseudo code for
with a combination of our proposed Re, Ra, Rp generating our proposed data samples can be found
data samples, unified training w/ auxiliary tasks in Appendix C.
(UT+AT), which fine-tunes FLAN-T5-XL with a
combination of our proposed Re, Ra, Rp, MIM, 4.2 Overall Performance (RQ1 & RQ2)
MLM data samples. Tables 1, 2, and 3 show the results of retrieval,
Implementation Details. We adopt the 3B FLAN- ranking, and rating prediction, respectively. FLAN-
T5-XL (Wei et al., 2022) as the backbone. We T5-Base/XL exhibit suboptimal performance on
also use the 223M FLAN-T5-Base for the ablation retrieval and ranking. For retrieval, they show near
studies in Section 4.3. Meanwhile, it’s crucial to zero NDCGs and HRs. For ranking, they are signif-
emphasize that the proposed method is not tied icantly inferior to the conventional baselines. For
to a specific backbone architecture and is easily rating prediction, they perform much higher than
adaptable to other LLMs, such as LLaMA (Tou- random guessing (50.00), outperforming DMF, but
vron et al., 2023). We set the sliding window size still fall behind History and Wide&Deep. This
w to 20. For the BPR data samples, we sample the shows that FLAN-T5 models lack recommenda-
negative items based on popularity. For the ranking tion knowledge, which is unsurprising considering
and BPR data samples, the position of the positive they were not trained on recommendation tasks dur-
item in the candidate pool is always determined ing pre-training or instruction-tuning and are evalu-
randomly. For the MIM and MLM data samples, ated in a zero-shot setting. Moreover, we find that
we adopt a masking ratio of 20%. To fully fine- our proposed method effectively aligns LLMs with
tune the LLM backbone, we apply dynamic sam- new recommendation domains (RQ1). In particu-
pling for the BPR and MIM/MLM data samples lar, by fine-tuning FLAN-T5-XL with our proposed
(we present details about the dynamic sampling data samples, our models significantly outperform
and the statistics of our data samples in Appendix FLAN-T5-XL on all three tasks across the datasets.
C). To reduce cost, we validate on 3,000 users. When compared to the baselines, our models
Meanwhile, testing is performed on all users. We show remarkable performance, especially on re-
fine-tune FLAN-T5-XL and FLAN-T5-Base for trieval (RQ2). For retrieval, our ReAT outperforms
70, 000 and 10, 000 steps, with batch sizes 16 and TIGER, the current SOTA, by large margins across
64, respectively. We set the learning rate to 0.001 datasets and metrics. Additionally, it is essential
and warm-up steps to 1,000. During prediction, we to highlight that our method possesses natural lan-
set the width of the beam search for retrieval and guage reasoning potentials of LLMs, which are
ranking to 20. For unified models, i.e., UT and absent in TIGER. For ranking, our RaAT greatly
UT+AT, model selections are based on retrieval outperforms SimpleX, the best baseline, on Toys
validation performance. We present the detailed & Games. On Beauty, RaAT performs on par with
settings of P5-XL experiments in Appendix A. We SimpleX. On Sports & Outdoors, RaAT is infe-
cite the results of some baseline models from Zhou rior to the conventional recommenders on metrics
such as NDCG/HR@10, yet still greatly outper-
5
We acknowledge that our retrieval and ranking data sam- forms the LLM-based baselines. Notably, the @1
ples (examples are shown in Figure 1 and Appendix C) utilize
item IDs for matching prediction results, whereas the FLAN- performance of RaAT is always much higher than
T5-Base/XL models, when queried in the zero-shot setting, the conventional recommenders. For rating predic-
do not inherently predict item IDs. Addressing this discrep- tion, our RpAT outperforms Wide&Deep, the best
ancy, text-based methods could be employed to extract item
titles, descriptions, etc., from the FLAN-T5-Base/XL predic- baseline, on Toys & Games and Beauty while lags
tions to enhance their performance. However, employing such slightly behind it on Sports & Outdoors. These re-
approaches requires an additional model for text matching,
which falls beyond the scope of this work
sults verify that our method introduces substantial
6
BPR data samples are used only by RaAT as we observe recommendation domain knowledge into LLMs
that they help ranking but not retrieval and rating prediction.
7
MIM/ MLM data samples are used by ReAT, RaAT, and RpAT. https://recbole.io
for outperforming strong baselines. The relative # Methods
NDCG NDCG HR HR
@5 @10 @5 @10
ineffectiveness of our method on Sports & Out-
1 TIGER 0.0371 0.0432 0.0521 0.0712
doors for the ranking and rating prediction tasks
2 FLAN-T5-XL 2e−5 2e−5 5e−5 5e−5
could be due to the nature of the data. Specifically, 3 2+retrieval 0.0182 0.0219 0.0273 0.0388
our model, as a sequential recommender, relies on 4 3+MLM 0.0306 0.0369 0.0443 0.0641
5 4+MIM 0.0390 0.0461 0.0558 0.0776
the sequential item correlations conveyed by the
6 FLAN-T5-Base 0.0000 2e−5 0.0000 5e−5
user sequences. Such signals may be relatively 7 6+retrieval 0.0149 0.0183 0.0219 0.0325
weak in Sports & Outdoors (e.g., the average se- 8 7+MLM 0.0219 0.0271 0.0334 0.0495
9 8+MIM 0.0242 0.0304 0.0376 0.0566
quence length of Sports & Outdoors is 8.32 ± 6.07,
whereas that of Beauty and Toys & Games are Table 4: Retrieval ablation study on Toys & Games.
8.88 ± 8.16 and 8.63 ± 8.51, respectively, suggest- Rows 1, 2, 5 (equivalent to ReAT), and 6 are copied
ing that Sports & Outdoors sequences are shorter from Table 1.
and less diverse), causing our method to perform
suboptimally. The best baselines, on the other hand, NDCG NDCG HR HR HR
# Methods
do not rely on such information. E.g., SimpleX is @5 @10 @1 @5 @10
1 SimpleX 0.1244 0.1469 0.0268 0.1958 0.2662
based on collaborative filtering and Wide&Deep
2 FLAN-T5-XL 0.0160 0.0312 0.0026 0.0315 0.0793
is a context-based model. Therefore, their perfor- 3 2+ranking 0.1520 0.1864 0.0807 0.2218 0.3284
mances are not impacted. 4 3+MLM 0.1580 0.1912 0.0854 0.2303 0.3333
5 4+MIM 0.1677 0.1976 0.0938 0.2391 0.3317
Moreover, our UT greatly outperforms P5 and 6 5+BPR 0.1714 0.2034 0.0956 0.2464 0.3453
P5-XL across datasets and metrics. This shows 7 FLAN-T5-Base 0.0107 0.0127 0.0057 0.0156 0.0217
that our proposed recommendation task prompts 8 7+ranking 0.1349 0.1654 0.0720 0.1957 0.2901
9 8+MLM 0.1481 0.1782 0.0820 0.2119 0.3051
better preserve item correlations as compared to the 10 9+MIM 0.1489 0.1811 0.0817 0.2141 0.3136
P5 ones. Specifically, we enhance user sequence 11 10+BPR 0.1534 0.1844 0.0844 0.2196 0.3153

modeling by introducing helpful details such as


Table 5: Ranking ablation study on Toys & Games.
item titles while excluding less informative details Rows 1, 2, 6 (equivalent to RaAT), and 7 are copied
such as user IDs and explanation data. Additional from Table 2.
results of P5-XL as well as a comparison between
P5-XL and P5 can be found in Appendix A. (row 1, the current SOTA). Further adding MIM
We also compare our UT+AT model with our data samples (row 5) surpasses TIGER. This shows
task-specific models, i.e., ReAT/ RaAT/ RpAT. that the item-level and token-level item correlations
We show that our method allows fine-tuning a introduced by MIM and MLM are essential and
unified model that addresses all recommendation complement each other. Similarly, in Table 5 rows
tasks without sacrificing per-task performance by 2-6, the ranking performance improves as we in-
much. For retrieval, UT+AT is slightly worse than corporate our proposed ranking, MLM, MIM, and
ReAT but still outperforms all baselines, except BPR data samples into fine tuning. Among these
that UT+AT performs comparably with TIGER on data samples, ranking task data samples are the
Sports & Outdoors. For ranking, UT+AT performs most helpful. BPR data samples, which contrast the
on par with or slightly better than our task-specific positive items with the negative ones, provide the
RaAT model. For rating prediction, UT+AT is least assistance. For rating predictions, as shown
slightly worse than RpAT. in Table 6 rows 2-5, our proposed rating predic-
tion data samples greatly increase the performance.
4.3 Ablation Studies (RQ3 & RQ4) MLM and MIM do help, but only marginally.
Tables 4, 5, and 6 show ablation studies on Toys We also find that our proposed method is effec-
& Games for retrieval, ranking, and rating predic- tive regardless of the size of the backbone model
tion, respectively. We observe that all the proposed
tasks are beneficial (RQ3). In Table 4 rows 2-5, # Methods AUC-ROC # Methods AUC-ROC
successively adding our proposed retrieval, MLM, 1 Wide&Deep 70.93
6 FLAN-T5-Base 57.85
2 FLAN-T5-XL 55.23
and MIM data samples into the fine-tuning data 3 2+rating-prediction 70.38
7 6+rating-prediction 69.17
8 7+MLM 67.31
increases the retrieval performance. All three tasks 4
5
3+MLM
4+MIM
71.08
71.16 9 8+MIM 68.24
are essential. E.g., row 4, which fine-tunes FLAN-
T5-XL using retrieval and MLM data samples per- Table 6: Rating-prediction ablation study on Toys &
forms on par with S3 -Rec and worse than TIGER Games. Rows 1, 2, 5 (equivalent to RpAT), and 6 are
copied from Table 3.
(RQ4). In Tables 4, 5, and 6, we apply our method References
on FLAN-T5-Base and observe significant perfor- Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang,
mance increases on all three recommendation tasks. Fuli Feng, and Xiangnan He. 2023. Tallrec: An ef-
In terms of overall performance, our best retrieval fective and efficient tuning framework to align large
model with FLAN-T5-Base (Table 4 row 9) falls language model with recommendation. In Proceed-
ings of the 17th ACM Conference on Recommender
behind TIGER but still outperforms all baselines Systems.
except TIGER, S3 -Rec, and SASRec. In Table 5,
our best ranking model with FLAN-T5-Base (row Tom Brown, Benjamin Mann, Nick Ryder, Melanie
11) outperforms SimpleX by large margins, though Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
Neelakantan, Pranav Shyam, Girish Sastry, Amanda
falls behind our best ranking model with FLAN-T5- Askell, et al. 2020. Language models are few-shot
XL (row 6). In Table 6, our best rating prediction learners. Advances in neural information processing
model with FLAN-T5-Base (row 7) is slightly in- systems, 33:1877–1901.
ferior to the best model with FLAN-T5-XL (row Yuwei Cao, Liangwei Yang, Chen Wang, Zhiwei Liu,
5) and Wide&Deep. The effectiveness of the indi- Hao Peng, Chenyu You, and Philip S Yu. 2023. Multi-
vidual tasks remains roughly consistent with the task item-attribute graph pre-training for strict cold-
previous results with FLAN-T5-XL (except that start item recommendation. In Proceedings of the
17th ACM Conference on Recommender Systems.
MLM does not help rating prediction). E.g., in Ta-
ble 5 rows 7-11, our ranking task, MLM, MIM, and Aldo Gael Carranza, Rezsa Farahani, Natalia Pono-
BPR data samples all contribute to the ranking per- mareva, Alex Kurakin, Matthew Jagielski, and Milad
formance, with the ranking task data samples being Nasr. 2023. Privacy-preserving recommender sys-
tems with synthetic query generation using differen-
the most beneficial and BPR the least beneficial. tially private large language models. arXiv preprint
arXiv:2305.05973.
5 Conclusion
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal
We propose to align LLMs with the recommen- Shaked, Tushar Chandra, Hrishi Aradhye, Glen An-
dation domain by fine-tuning with data samples derson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.
2016. Wide & deep learning for recommender sys-
that encode recommendation knowledge. We pro- tems. In Proceedings of the 1st workshop on deep
pose auxiliary-task data samples that encode item learning for recommender systems, pages 7–10.
correlations contained in users’ preferences. We
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-
further design recommendation-task data samples
cehre, Dzmitry Bahdanau, Fethi Bougares, Holger
that are more informative than ones in existing stud- Schwenk, and Yoshua Bengio. 2014. Learning
ies. Experiments on retrieval, ranking, and rating phrase representations using rnn encoder-decoder for
prediction show that our method effectively intro- statistical machine translation. In Proceedings of the
duces recommendation knowledge into FLAN-T5- 2014 Conference on Empirical Methods in Natural
Language Processing.
Base/XL from three domains. Our method greatly
outperforms both conventional and LLM-based Konstantina Christakopoulou, Alberto Lalama,
baselines in retrieval, achieving the new SOTA. Cj Adams, Iris Qu, Yifat Amir, Samer Chucri, Pierce
Vollucci, Fabio Soldo, Dina Bseiso, Sarah Scodel,
et al. 2023. Large language models for user interest
6 Limitations journeys. arXiv preprint arXiv:2305.15498.
Our proposed method utilizes LLMs as the back- Zeyu Cui, Jianxin Ma, Chang Zhou, Jingren Zhou, and
bones. The substantial parameter size of the LLMs Hongxia Yang. 2022. M6-rec: Generative pretrained
results in increased computational resource con- language models are open-ended recommender sys-
tems. arXiv preprint arXiv:2205.08084.
sumption and extended training and inference times
compared to conventional recommenders. Never- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
theless, adopting LLM backbones is beneficial due Kristina Toutanova. 2019. Bert: Pre-training of deep
to their significant potential. In addition to the ex- bidirectional transformers for language understand-
ing. In Proceedings of NAACL-HLT 2019, pages
ceptional performance demonstrated in this study, 4171–4186.
we anticipate that future research will continue to
augment existing recommendation tasks and ad- Yunfan Gao, Tao Sheng, Youlin Xiang, Yun Xiong,
Haofen Wang, and Jiawei Zhang. 2023. Chat-
dress novel recommendation scenarios by leverag- rec: Towards interactive and explainable llms-
ing the diverse capabilities of LLM backbones. augmented recommender system. arXiv preprint
arXiv:2303.14524.
Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Shashank Rajput, Nikhil Mehta, Anima Singh, Raghu-
and Yongfeng Zhang. 2022. Recommendation as nandan H Keshavan, Trung Vu, Lukasz Heldt, Lichan
language processing (rlp): A unified pretrain, person- Hong, Yi Tay, Vinh Q Tran, Jonah Samost, et al. 2023.
alized prompt & predict paradigm (p5). In Proceed- Recommender systems with generative retrieval. In
ings of the 16th ACM Conference on Recommender Advances in Neural Information Processing Systems.
Systems, pages 299–315.
Steffen Rendle and Christoph Freudenthaler. 2014. Im-
Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, proving pairwise learning for item recommendation
Xia Hu, and Tat-Seng Chua. 2017. Neural collabora- from implicit feedback. In Proceedings of the 7th
tive filtering. In Proceedings of the 26th international ACM international conference on Web search and
conference on world wide web, pages 173–182. data mining, pages 273–282.

Balázs Hidasi, Alexandros Karatzoglou, Linas Bal- Steffen Rendle, Christoph Freudenthaler, Zeno Gant-
trunas, and Domonkos Tikk. 2016. Session-based ner, and Lars Schmidt-Thieme. 2009. Bpr: Bayesian
recommendations with recurrent neural networks. In personalized ranking from implicit feedback. In Pro-
Proceedings of the 4th International Conference on ceedings of the Twenty-Fifth Conference on Uncer-
Learning Representations. tainty in Artificial Intelligence.

Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin,
Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards Wenwu Ou, and Peng Jiang. 2019. Bert4rec: Se-
universal sequence representation learning for recom- quential recommendation with bidirectional encoder
mender systems. In Proceedings of the 28th ACM representations from transformer. In Proceedings of
SIGKDD Conference on Knowledge Discovery and the 28th ACM international conference on informa-
Data Mining, pages 585–593. tion and knowledge management, pages 1441–1450.

Wang-Cheng Kang and Julian McAuley. 2018. Self- Jiaxi Tang and Ke Wang. 2018. Personalized top-n se-
attentive sequential recommendation. In 2018 IEEE quential recommendation via convolutional sequence
international conference on data mining (ICDM), embedding. In Proceedings of the eleventh ACM
pages 197–206. IEEE. international conference on web search and data
mining, pages 565–573.
Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Mah-
eswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Derek Zhiyuan Cheng. 2023. Do llms understand Martinet, Marie-Anne Lachaux, Timothée Lacroix,
user preferences? evaluating llms on user rating pre- Baptiste Rozière, Naman Goyal, Eric Hambro,
diction. arXiv preprint arXiv:2305.06474. Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. arXiv:2302.13971.
Matrix factorization techniques for recommender sys-
tems. Computer, 42(8):30–37. Lei Wang and Ee-Peng Lim. 2023. Zero-shot next-item
recommendation using large pretrained language
Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical models. arXiv preprint arXiv:2304.03153.
gating networks for sequential recommendation. In
Proceedings of the 25th ACM SIGKDD international Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
conference on knowledge discovery & data mining, Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
pages 825–833. drew M Dai, and Quoc V Le. 2022. Finetuned lan-
guage models are zero-shot learners. In Proceedings
Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, of the 10th International Conference on Learning
Zhenhua Dong, Xi Xiao, and Xiuqiang He. 2021. Representations.
Simplex: A simple and strong baseline for collab-
orative filtering. In Proceedings of the 30th ACM Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian
International Conference on Information & Knowl- Huang, and Jiajun Chen. 2017. Deep matrix factor-
edge Management, pages 1243–1252. ization models for recommender systems. In IJCAI,
volume 17, pages 3203–3209. Melbourne, Australia.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
Hannaneh Hajishirzi. 2022. Cross-task generaliza- Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin
tion via natural language crowdsourcing instructions. Zhao, Leyu Lin, and Ji-Rong Wen. 2023. Recom-
In Proceedings of the 60th Annual Meeting of the mendation as instruction following: A large language
Association for Computational Linguistics. model empowered recommendation approach. arXiv
preprint arXiv:2305.07001.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S
Wei Li, and Peter J Liu. 2020. Exploring the limits Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, Xi-
of transfer learning with a unified text-to-text trans- aofang Zhou, et al. 2019. Feature-level deeper self-
former. The Journal of Machine Learning Research, attention network for sequential recommendation. In
21(1):5485–5551. IJCAI, pages 4320–4326.
Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu,
Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and
Ji-Rong Wen. 2020. S3-rec: Self-supervised learning
for sequential recommendation with mutual informa-
tion maximization. In Proceedings of the 29th ACM
international conference on information & knowl-
edge management, pages 1893–1902.
Dataset # Users # Items # Interactions Sparsity (%)
Toys & Games 19,412 11,924 167,597 99.93 A.2 P5-XL vs. P5
Beauty 22,363 12,101 198,502 99.93
Sports & Outdoors 35,598 18,357 296,337 99.95 Please note that the retrieval results of P5 in Ta-
ble 1 are cited from Rajput et al. 2023 rather than
Table 7: Statistics of the datasets.
the original P5 paper (Geng et al., 2022). This
is because the original P5 experiments cannot be
A P5-XL Experimental Setting and reproduced upon fixing the information leakage
Additional Results issues as discussed in the previous section. Mean-
while, Rajput et al. 2023 does not report the ranking
A.1 Experimental Setting and rating prediction performances of P5. To fully
We generate P5 prompts using the source code pro- evaluate P5, we train a P5-XL model following the
vided by the P5 authors 8 . However, for a fair com- experimental setting as detailed in the previous sec-
parison, we update the data pre-processing to be tion, and report its performance on all three tasks
consistent with our method and the other baselines. in Tables 1 to 3.
Specifically, we apply random instead of sequential P5-XL performs worse than P5 in Table 1, which
indexing when mapping the item IDs. As pointed is likely owing to the differences in their training
out by Rajput et al. 2023, the sequential indexing of data. Specifically, P5 was only trained on retrieval
items (e.g., the purchase sequence of the first user prompts (as indicated in Appendix D of Rajput
in Toys & Games is mapped into ‘1, 2, 3, 4, 5, 6, 7’) et al. 2023). While following the original P5 pa-
in the original P5 pre-processing leads to data leak- per, P5-XL is trained on all five task families of
age (e.g., given the train items, i.e., ‘1, 2, 3, 4, 5, P5 prompts, including explanation generation and
6’, the LLM can easily infer the test item, i.e., ‘7’). review summarization tasks. We hypothesize that
Therefore, we adopt random mapping (i.e., con- these additional data samples are very different
secutive or similar-looking IDs may not imply any from the evaluated tasks (retrieval, ranking and
connection), which is consistent with our method. rating prediction), causing negative transfer to the
In addition, the original P5 pre-processing adopts evaluated tasks.
leave-one-out split for retrieval and ranking, while
A.3 Additional Results
splitting the dataset by 0.8:0.1:0.1 for the training,
validation, and testing of rating prediction. This In Table 8, we report the ranking results of P5-XL
could result in data leakage, as the test interactions evaluated with prompt template 5-5. We can tell
of one task might be included in the training set of that P5-XL (5-5) slightly fall behind P5-XL. Our
another task. We instead adopt leave-one-out data proposed UT greatly outperforms both P5-XL and
split for all three recommendation tasks, which is P5-XL (5-5), which again verifies that our proposed
consistent with our proposed method as well as the recommendation task prompts are more informa-
other baselines. tive than the P5 ones.
For a fair comparison, We apply the same back-
B Dataset Statistics
bone (FLAN-T5-XL), fine-tuning steps (70,000),
batch size (16), and learning rate (0.001) as adopted Table 7 presents the statistics of the Amazon
by our proposed method. Following the original P5 datasets, i.e., Toys & Games, Beauty, and Sports
code, we fine-tune a unified model with prompts of & Outdoors, that we used to evaluate our proposed
their proposed five task families (rating, sequential method as well as all the baselines.
recommendation, explanation, review, and direct
recommendation. The sequential recommendation C Pseudo Code, Statistics, and Examples
and direct recommendation families are weighted 5 of the Proposed Data Samples
times higher than the rest families). In Tables 1, 2, C.1 Pseudo Code for Data Sample Generation
and 3, we adopt prompt templates 2-1, 2-7, and 1-4
Algorithm 1 presents the pseudo code for gen-
for evaluating the retrieval, ranking, and rating pre-
erating our proposed recommendation-task and
diction performance of the P5-XL model, as these
auxiliary-task data samples.
templates better suit the forms of the recommenda-
tion tasks (introduced in the second subsection of C.2 Statistics of the Data Samples
Section 4.1) than the other templates.
Table 10 presents the statistics of our proposed
8
https://github.com/jeykigung/P5 recommendation-task and auxiliary-task data sam-
Toys & Games Beauty Sports & Outdoors
Methods
NDCG NDCG HR HR HR NDCG NDCG HR HR HR NDCG NDCG HR HR HR
@5 @10 @1 @5 @10 @5 @10 @1 @5 @10 @5 @10 @1 @5 @10
P5-XL 0.0290 0.0444 0.0097 0.0494 0.0977 0.0298 0.0456 0.0110 0.0498 0.0992 0.0286 0.0436 0.0097 0.0486 0.0957
P5-XL (5-5) 0.0274 0.0428 0.0089 0.0467 0.0948 0.0289 0.0443 0.0093 0.0497 0.0982 0.0275 0.0426 0.0091 0.0470 0.0943
UT [Ours] 0.1536 0.1867 0.0831 0.2233 0.3259 0.1236 0.1537 0.0609 0.1863 0.2798 0.0867 0.1137 0.0381 0.1362 0.2202

Table 8: Additional P5-XL Ranking results. Rows 1 and 3 are copied from Table 2.

NDCG NDCG HR HR Input: What’s the title of I1014? Output: Women’s Dry-fit Tempo Shorts
Methods
@5 @10 @5 @10 Input: What’s the brand of I1014? Output: Nike
UT [Ours] 0.0079 0.0101 0.0118 0.0187 Input: What’s the price of I1014? Output: $31.8
UT+IE [Ours] 0.0076 0.0097 0.0121 0.0185 …

Table 9: Retrieval results on Sports & Outdoors with Figure 3: Item embedding (IE) data samples.
(UT+IE) or without (UT) IE data samples. Row 1 is
copied from Table 1.
We Step 1: Generate
observe thatRecommendation & Auxiliary context-aware
the conventional Task Data Samples

recommenders commonly integrateStep


Step 3: Prediction item contents
2: Multi-task
ples. Consider the recommendation-task data sam- Fine-tuning
toA user
help the model better
has purchased … What would
understand the items and
ples, the training data samples are generated by I123
the user buyenhanced
achieve next? performance. E.g., Hou et al.
swiping a sliding window of size w = 20 over
2022 embed
A user has purchasedthe concatenations
… Which of the of item content
the training split of the user sequence. The vali- following candidate items would you LLM
I123
fields withtheBERT
recommend (Devlin
user to buy next? … et al., Backbone
2019). The learned
dation data samples consider only 3,000 users for N ×d
item embeddings, X ∈ R , where N is the
each dataset for cost-efficient validation. We test A user likes … The user dislikes …
number of the items and
Predict whether the user would like … d is the dimension ofNothe
on all users, therefore the counts of the testing data
vector space, serve as initial representations of the
samples equal to the total number of users in the
items.
datasets. The auxiliary-task data samples, on the
We mimic this item embedding (IE) process with
other hand, are generated using only the training
natural language prompts. As shown in Figure 3,
splits. Notably, during training, we apply dynamic
by asking questions about the properties of an item
sampling that decide the negative items in the BPR
in the input and answering them in the output, we
data samples as well as the masked items/tokens
can generate item embedding data samples such as
in the MIM/MLM data samples on the fly. Such
‘Input: What’s the brand of I1014? Output: Nike’.
dynamic sampling helps to fully fine-tune the LLM
We repeat such question answering process for var-
backbones.
ious available item content fields, including title,
C.3 Examples of the Data Samples categories, brand, price, attributes, and descriptions.
These data samples represent knowledge about the
In Table 11, we present examples of our proposed
items, but with natural language rather than nu-
data samples. These data samples are generated
merical vectors. We expect that tuning LLMs with
with the training data split of an Amazon - Toys
IE data samples can help them to comprehend the
& Games user whose ID is ‘A12HF3UBDV34RR’.
items in the target recommendation domain and
Note that to fully fine-tune the LLM backbone,
enhance their performance.
we apply dynamic sampling for the BPR and
To evaluate the IE data samples, we tune a
MIM/MLM data samples and decide the negative
UT+IE model, which augments the fine-tuning
items and masked items/tokens on the fly. Here,
data of our UT model with IE data samples (the
we only present the BPR, MIM, and MLM data
rest experimental settings of UT+IE and UT remain
samples resulted from a single sampling.
the same). We present its retrieval performance on
D Mimicking Item Embedding Sports & Outdoors in Table 9. We observe no no-
ticeable performance increase when incorporate
Our proposed data samples introduced in the main IE data samples. The reason might be, the raw
paper encode item correlations encompassed in item content fields are noisy. E.g., the description
users’ preferences. We also explore encoding item field is long and can contain noise such as hash-
correlations encompassed in item contents, i.e., cat- tags and URLs. It has been shown (Cao et al.,
egories, descriptions, etc. 2023) pre-processing the raw fields to extract fine-
Toys & Games Beauty Sports & Outdoors
Task
# Train # Valid # Test # Train # Valid # Test # Train # Valid # Test
Retrieval 30,761 3,000 19,412 36,582 3,000 22,363 47,320 3,000 35,598
Ranking 30,761 3,000 19,412 36,582 3,000 22,363 47,320 3,000 35,598
Rating prediction 30,761 3,000 19,412 36,582 3,000 22,363 47,320 3,000 35,598
MIM DS 0 0 DS 0 0 DS 0 0
MLM DS 0 0 DS 0 0 DS 0 0
BPR DS 0 0 DS 0 0 DS 0 0

Table 10: Statistics of our proposed data samples. DS stands for dynamic sampling.

grained features helps to enhance context-aware


recommenders. Inspired by this, in the future, we
plan to improve the IE data samples by refining the
item content fields.
Task Data sample
Retrieval Input: A user has purchased the following Amazon products (arranged in chronological order, from earliest to most recent): Item ID: I9762, Title:
Winstonia’s 8 Wheels Combo Set Nail Art Polymer Slices Fimo Decal Pieces Accessories - Butterflies, Bows, Animals, Fruit, Flowers, Dragonflies,
Cupcakes, Hearts; Item ID: I8123, Title: MASH Rhinestones 2400 Piece 12 Color Nail Art Nailart Manicure Wheels; Item ID: I158, Title: Aveeno Clear
Complexion Daily Moisturizer, 4 Ounce; Item ID: I5324, Title: Bdellium Tools Professional Antibacterial Makeup Brush Studio Line - Precision Kabuki
Airbrushed Effect 957; Item ID: I7522, Title: Bdellium Tools Professional Makeup Brush Green Bambu Series Smoky Eyes 5pc. Brush Set; Item ID:
I7647, Title: real Techniques Stippling Brush; Item ID: I7811, Title: Maybelline New York Color Sensational High Shine Lipcolor, Coral Lustre 840, 0.12
Ounce; Item ID: I9440, Title: Bed Head BH313 Orange Crush 1-inch Styler; Item ID: I5046, Title: Herstyler Baby Curl Curling Iron, Purple; What would
the user buy next?
Output: I3977
Ranking Input: A user has purchased the following Amazon products (arranged in chronological order, from earliest to most recent): Item ID: I9762, Title:
Winstonia’s 8 Wheels Combo Set Nail Art Polymer Slices Fimo Decal Pieces Accessories - Butterflies, Bows, Animals, Fruit, Flowers, Dragonflies,
Cupcakes, Hearts; Item ID: I8123, Title: MASH Rhinestones 2400 Piece 12 Color Nail Art Nailart Manicure Wheels; Item ID: I158, Title: Aveeno Clear
Complexion Daily Moisturizer, 4 Ounce; Item ID: I5324, Title: Bdellium Tools Professional Antibacterial Makeup Brush Studio Line - Precision Kabuki
Airbrushed Effect 957; Item ID: I7522, Title: Bdellium Tools Professional Makeup Brush Green Bambu Series Smoky Eyes 5pc. Brush Set; Item ID:
I7647, Title: real Techniques Stippling Brush; Item ID: I7811, Title: Maybelline New York Color Sensational High Shine Lipcolor, Coral Lustre 840, 0.12
Ounce; Item ID: I9440, Title: Bed Head BH313 Orange Crush 1-inch Styler; Item ID: I5046, Title: Herstyler Baby Curl Curling Iron, Purple; Which of the
following candidate items would you recommend the user to buy next? Candidate items are: I10537, I11849, I2647, I10506, I377, I8136, I3598, I2316,
I114, I10379, I6767, I2801, I4687, I3446, I7222, I5925, I4608, I2226, I2279, I11708, I4376, I8771, I6502, I8650, I7006, I11350, I6716, I4690, I11303,
I3446, I8704, I4001, I9816, I1498, I6896, I1598, I7653, I2086, I12019, I3235, I12052, I27, I5786, I9936, I697, I10050, I447, I10898, I2093, I2618, I2044,
I2618, I6924, I2769, I8117, I10772, I9252, I4668, I6982, I2234, I9894, I9441, I6514, I5519, I8620, I710, I10212, I8654, I7648, I11054, I1419, I10958,
I334, I576, I1537, I8278, I3181, I189, I3510, I7974, I6010, I11187, I6465, I9596, I9356, I311, I2313, I7117, I9249, I643, I6732, I8803, I5499, I2434,
I3977, I10691, I10707, I5553, I7999, I8672.
Output: I3977
Rating prediction Input: A user likes the following Amazon products: Item ID: I7522, Title: Bdellium Tools Professional Makeup Brush Green Bambu Series Smoky Eyes
5pc. Brush Set; Item ID: I7811, Title: Maybelline New York Color Sensational High Shine Lipcolor, Coral Lustre 840, 0.12 Ounce; The user dislikes the
following Amazon products: Item ID: I7647, Title: real Techniques Stippling Brush; Item ID: I9440, Title: Bed Head BH313 Orange Crush 1-inch Styler;
Item ID: I5046, Title: Herstyler Baby Curl Curling Iron, Purple; Predict whether the user would like the following item. Answer yes or no. Item ID: I3977,
Title: L’Oreal Paris HiP Studio Secrets Professional Color Truth Cream Eyeliner, Brown, 0.159 Ounce
Output: no
MIM Input: A user has purchased the following Amazon products (arranged in chronological order, from earliest to most recent): Item ID: I9762, Title:
Winstonia’s 8 Wheels Combo Set Nail Art Polymer Slices Fimo Decal Pieces Accessories - Butterflies, Bows, Animals, Fruit, Flowers, Dragonflies,
Cupcakes, Hearts; [masked item]; Item ID: I158, Title: Aveeno Clear Complexion Daily Moisturizer, 4 Ounce; Item ID: I5324, Title: Bdellium Tools
Professional Antibacterial Makeup Brush Studio Line - Precision Kabuki Airbrushed Effect 957; Item ID: I7522, Title: Bdellium Tools Professional
Makeup Brush Green Bambu Series Smoky Eyes 5pc. Brush Set; [masked item]; Item ID: I7811, Title: Maybelline New York Color Sensational High
Shine Lipcolor, Coral Lustre 840, 0.12 Ounce; Item ID: I9440, Title: Bed Head BH313 Orange Crush 1-inch Styler; Item ID: I5046, Title: Herstyler Baby
Curl Curling Iron, Purple; Item ID: I3977, Title: L’Oreal Paris HiP Studio Secrets Professional Color Truth Cream Eyeliner, Brown, 0.159 Ounce; What
are the masked items, in chronological order?
Output: Item ID: I8123, Title: MASH Rhinestones 2400 Piece 12 Color Nail Art Nailart Manicure Wheels; Item ID: I7647, Title: real Techniques
Stippling Brush;
MLM Input: Item ID: I7811, Title: Maybelline New York Color Sensational High Shine Lipcolor, Coral Lustre 840, 0.12 Ounce; Item ID: I9440, Title: Bed
Head BH313 Orange Crush 1-inch Styler;
BPR Input: A user has purchased the following Amazon products (arranged in chronological order, from earliest to most recent): Item ID: I9762, Title:
Winstonia’s 8 Wheels Combo Set Nail Art Polymer Slices Fimo Decal Pieces Accessories - Butterflies, Bows, Animals, Fruit, Flowers, Dragonflies,
Cupcakes, Hearts; Item ID: I8123, Title: MASH Rhinestones 2400 Piece 12 Color Nail Art Nailart Manicure Wheels; Item ID: I158, Title: Aveeno Clear
Complexion Daily Moisturizer, 4 Ounce; Item ID: I5324, Title: Bdellium Tools Professional Antibacterial Makeup Brush Studio Line - Precision Kabuki
Airbrushed Effect 957; Item ID: I7522, Title: Bdellium Tools Professional Makeup Brush Green Bambu Series Smoky Eyes 5pc. Brush Set; Item ID:
I7647, Title: real Techniques Stippling Brush; Item ID: I7811, Title: Maybelline New York Color Sensational High Shine Lipcolor, Coral Lustre 840, 0.12
Ounce; Item ID: I9440, Title: Bed Head BH313 Orange Crush 1-inch Styler; Item ID: I5046, Title: Herstyler Baby Curl Curling Iron, Purple; Which of the
following two items would the user buy next? Item ID: I4168, Title: Sulfur Soap with Lanolin; Item ID: I3977, Title: L’Oreal Paris HiP Studio Secrets
Professional Color Truth Cream Eyeliner, Brown, 0.159 Ounce;
Output: Item ID: I3977, Title: L’Oreal Paris HiP Studio Secrets Professional Color Truth Cream Eyeliner, Brown, 0.159 Ounce;

Table 11: Examples of our proposed data samples.


Algorithm 1: Generate Data Samples
Input: Raw interactions, data sample
templates for recommendation and
auxiliary tasks, data_split ∈ {Train,
Valid, Test}, window size w,
candidate pool size c
Output: Data samples D
1 I ← a set of unique items (shuffled and
mapped to short IDs)
2 S ← a list of chronologically ordered user
purchase sequences
3 D ← {}
4 for s ∈ S do
5 if data_split = Train then
6 ssub ← all subsequences of the
training split of s, each is of length
up to w
7 if data_split = Valid then
8 ssub ← a subsequence of s that ends
with the validation item,
proceeding items beyond w are
truncated
9 if data_split = Test then
10 ssub ← a subsequence of s that ends
with the test item, proceeding items
beyond w are truncated
11 for ss ∈ ssub do
12 for task ∈ {Retrieval, Ranking,
Rating prediction} do
13 if task = Ranking then
14 neg ← sample c − 1
negative items from I\s
15 Generate a data sample d with
ss, task template, and neg (for
Ranking only)
16 Add d to D
17 if data_split = Train then
18 for task ∈ {MIM, MLM, BPR}
do
19 if task = BPR then
20 neg ← sample 1
negative item from I\s
21 Generate a data sample d
with ss, task template, and
neg (for BPR only)
22 Add d to D

23 return D

You might also like