Aligning Large Language Models With Recommendation Knowledge
Aligning Large Language Models With Recommendation Knowledge
a) Recommendation-task data samples of the existing studies b) Our recommendation-task and auxiliary-task data samples
Figure 1: Data samples adopted by the existing studies and this work. (a) shows the recommendation-task data
samples of the existing studies. Specifically, (a1)-(a3) demonstrate the retrieval, ranking, and rating prediction data
samples of P5 (Geng et al., 2022); (a4) shows a ranking (type <P1, I0, T3>) data sample of InstructRec (Zhang et al.,
2023); (a5) is a rating prediction data sample of TALLRec (Bao et al., 2023). (b) shows our recommendation-task
(blue boxes) and auxiliary-task (purple boxes) data samples (we present more samples in Appendix C).
are typically used to train conventional recom- ples in a simple multi-task learning framework.
mender systems, namely, masked item modeling Experiments on various recommendation tasks,
(MIM) (Sun et al., 2019) and Bayesian Personal- i.e., retrieval, ranking, and rating-prediction,
ized Ranking (BPR) (Rendle et al., 2009). Our key across three target domains, i.e., Amazon Toys
innovation lies in converting the MIM and BPR & Games, Beauty, and Sports & Outdoors, show
tasks into natural language tasks that can be used the effectiveness of our proposed method and its
to train the LLMs. We also incorporate the masked components. For retrieval, our model outperforms
language modeling (MLM) (Devlin et al., 2019) both conventional and LLM-based baselines, in-
task for the user’s past interactions to supplement cluding the current SOTA, by large margins.
the MIM task with fine-grained item correlations.
Our contributions can be summarized as follows: 2 Related Work
Recommender Systems. Recommender systems
• We propose a novel method to align LLMs with
help users in discovering items of interest. As a
new recommendation domains, i.e., supplement-
practical approach, Collaborative Filtering (CF)
ing the fine-tuning of the LLMs with auxiliary-
(Mao et al., 2021) explores historical user-item in-
task data samples that mimic the classical opera-
teractions, assuming that users with similar behav-
tions in training conventional recommender sys-
iors have similar preferences for items. Among
tems with natural language prompts.
various CF methods, Matrix Factorization (MF)
• We propose recommendation-task data samples methods (Rendle et al., 2009; Mao et al., 2021)
that are more informative as compared to the ex- project users and items into a shared vector space
isting work (Geng et al., 2022). Specifically, we and estimate a user’s preference for an item through
reduce the complexity of the input/output spaces the inner product of their vectors and are widely
by eliminating the user IDs. We further enhance adopted. Context-aware approaches (Cheng et al.,
the user sequences by providing item titles. 2016) further include additional information, such
• We fine-tune the open-source 3B FLAN-T5-XL as user and contextual features, to improve rec-
and 223M FLAN-T5-Base with our proposed ommendation quality. However, CF fails to cap-
recommendation-task and auxiliary-task data sam- ture the sequential patterns in users’ behaviors,
which leads to the rise of sequential recommenda- Geng et al. 2022 formalize various recommen-
tions. Sequential recommenders based on Convolu- dation tasks as natural language instructions and
tional Neural Networks (CNNs) (Tang and Wang, fine-tune a unified recommender with T5 (Raffel
2018), Gated Recurrent Units (GRUs) (Hidasi et al., et al., 2020) backbone. Zhang et al. 2023 fur-
2016), and self-attention (Sun et al., 2019; Zhang ther supplement the tuning data with user prefer-
et al., 2019; Kang and McAuley, 2018; Zhou et al., ences/intentions deduced by GPT-3.5 1 to accom-
2020; Rajput et al., 2023) have become prevalent modate instructions of free forms. Bao et al. 2023
in the era of deep learning. Notably, leveraging explore instruction tuning LLMs with limited data.
a T5-like backbone, Rajput et al. 2023 formal- In contrast to the existing studies, our work fo-
ize recommendation as generative retrieval, i.e., cuses on introducing new recommendation knowl-
autoregressively decode the identifiers of the tar- edge into LLMs, which we believe is the key for im-
get items, and achieve the current SOTA. While proving recommenders with LLM backbones. We
structurally resembling LLMs, it lacks their pre- create auxiliary tasks that improve the recommen-
training knowledge and the accompanying natural dation tasks, including retrieval, ranking, and rat-
language reasoning potential. Our proposed ap- ing prediction. Our proposed recommendation-task
proach adopts self-attention for sequential recom- and auxiliary-task data samples include raw user
mendation, specifically harnessing LLMs as back- purchase sequences in addition to natural language
bones. We compare against various baselines from instructions. These data samples supplement each
all the classes discussed above. other in encoding the target recommendation do-
main knowledge. We experiment under restricted
LLMs for Recommendation. LLMs have recently
settings. Compared to the previous studies (Zhang
been explored for recommendation tasks due to
et al., 2023), we consider larger candidate pools
their ability to understand, generate, and reason
(e.g., our retrieval and ranking experiments con-
with natural language. Several studies focus on
sider the entire dataset and 99 hard negatives, re-
incorporating LLMs’ natural language capabilities
spectively). Unlike Bao et al. 2023, we fully train
into existing recommendation techniques. E.g.,
all models to maximize their performances.
Hou et al. 2022 and Cao et al. 2023 encode item
contents (title, description, etc.) with BERT (De-
3 Methodology
vlin et al., 2019), which enables learning seman-
tically informed embeddings even for zero-shot We propose designing data samples that encode rec-
items. Moreover, pre-trained LLM backbones have ommendation knowledge to align LLMs with the
also been used for recommendation through zero- target recommendation domain. Sections 3.1 and
shot learning (Kang et al., 2023), in-context learn- 3.2 discuss our auxiliary-task and recommendation-
ing (Kang et al., 2023), fine tuning (Cui et al., task data, respectively. Section 3.3 introduces a
2022; Kang et al., 2023), and instruction tuning simple multi-task learning framework that we use
(Geng et al., 2022; Zhang et al., 2023; Bao et al., to fine-tune LLMs.
2023). Besides helping classic recommendation
tasks, LLMs also enable novel recommendation 3.1 Auxiliary-task Data Generation
use cases. Geng et al. 2022 leverage LLMs to Conventional recommenders acquire recommen-
explain the recommendation results. Gao et al. dation knowledge via classic operations such as
2023; Wang and Lim 2023 utilize GPT-3 (Brown masked item modeling (Sun et al., 2019) and BPR
et al., 2020) for conversational recommendation. loss reduction (Rendle et al., 2009). We mimic
Christakopoulou et al. 2023 extract persistent user these operations with natural language prompts. In
interests with LLMs for deeper user understand- addition, we sample sub-sequences of the raw user
ing. Carranza et al. 2023 generate private synthetic purchase sequences. The resulting data, which we
representations of the original data with LLMs for refer to as auxiliary-task data samples, encode item
privacy-preserving recommendation. correlations contained in users’ preferences 2 .
Recommendation as Instruction-following. The 1
https://platform.openai.com/docs/models/
success of instruction tuning, i.e., fine-tune on data overview
2
described via instructions (Mishra et al., 2022; Wei As a side note, we also explored encoding item correla-
tions contained in item contents (categories, descriptions, etc.).
et al., 2022), has inspired attempts that instruction- Observing no noticeable performance increase, we present our
tune LLM backbones for recommendation tasks. approach and results in Appendix D
3.1.1 Masked Item Modeling (MIM) correlations contained in the users’ purchase se-
Conventional sequential recommenders (Sun et al., quences. This process resembles masked language
2019) learn item correlations from users’ interac- modeling (MLM) (Devlin et al., 2019).
tion sequences. Specifically, they predict randomly As shown in Figure 1(b5), given a user sequence,
masked items in the sequences by jointly condition- we sample a sub-sequence by randomly decid-
ing on the unmasked items. We mimic this process, ing a starting item and a sub-sequence length Ls ,
which we refer to as masked item modeling (MIM), where 2 ≤ Ls ≤ w and w is the sliding win-
with natural language prompts. dow for accommodating long sequences. These
MIM applies a Cloze objective (Sun et al., 2019). sub-sequences, referred to as MLM data samples,
At each training step, random items in the input supplement the MIM data samples: through span
user sequence are replaced with a special token corruption (Raffel et al., 2020), i.e., masking and re-
"[mask]", and the model learns to recover the covering consecutive spans of tokens, LLMs learn
masked items based on its surrounding context. An to model more fine-grained correlations across mul-
example of the masking process: tiple continuous items from the MLM data samples.
4 Experiments
Multi-task
fine-tuning Evaluation
LLM backbone
We evaluate the proposed method and compare it
with conventional as well as LLM-based recom-
Figure 2: Fine-tuning and evaluation framework. menders. We aim to answer the following research
3.2 Recommendation-task Data Generation questions: RQ1. Can our method introduce knowl-
edge into LLMs from new recommendation do-
As shown in Figure 1(a), the existing recom-
mains? RQ2. How does our model perform com-
menders with LLM backbones adopt prompts that
pared to the conventional as well as LLM-based
primarily convey the recommendation tasks by pro-
recommenders in retrieval, ranking, and rating pre-
viding directions on how to perform them. Such
diction? RQ3. How beneficial are the individual
information is essential, yet insufficient for repre-
proposed tasks? RQ4. What’s the effect of varying
senting the target recommendation domain.
the size of the backbone LLM?
We propose prompts that help LLMs compre-
hend the target recommendation domain in addi- 4.1 Experimental Setting
tion to the recommendation tasks. Specifically, we
Datasets. We experiment on three real-world
reduce the complexity of the input/output spaces.
datasets: Amazon Toys & Games, Beauty, and
In contrast to Geng et al. 2022, we eliminate user
Sports & Outdoors 4 . Following Zhou et al. 2020;
IDs and represent the users by their historical pur-
Geng et al. 2022, we keep 5-core data and apply
chases. Consequently, we relieve LLMs from mem-
leave-one-out evaluation, i.e., for each user pur-
orizing a substantial volume of user IDs (e.g., Ama-
chase sequence (where the interactions are sorted
zon Sports & Outdoors has 35,598 users). More-
by timestamp in ascending order), the last, the sec-
over, compared to Geng et al. 2022 that represent
ond to the last, and the prior interactions are used
user sequences solely by item IDs, we include
for testing, validation, and training, respectively.
both the IDs and the titles of the items, which
We present data statistics in Appendix B.
makes it easier for LLMs to recognize the items.
Notably, ranking candidates and items in the out- Recommendation Tasks. We evaluate on three es-
put are represented solely by their IDs to reduce tablished recommendation tasks: retrieval, which
the length of the prompts and maintain a smaller retrieves the ground truth item that a user inter-
output space. Figures 1(b1)-(b3) show examples acted with from the entire dataset; ranking, which
of our retrieval, ranking, and rating prediction chooses the ground truth item that a user interacted
recommendation-task data samples. The raw item with from a candidate pool of size 100 (1 posi-
IDs (e.g., ‘0000031852’) are mapped into shorter tive item and 99 negative items sampled based on
ones (e.g., ‘I123’) 3 to reduce input/output space popularity); rating prediction, which classifies an
complexity. To fully present the users’ historical interaction as either "like" or "dislike" (interactions
purchases to LLMs, we adopt a sliding window w with ratings > 3 are considered as "like"). We leave
similar to Section 3.1.1. the exploration and evaluation of novel recommen-
dation tasks (e.g., explanation generation) to the
3.3 Fine-tuning and Evaluation Framework future, due to a lack of ground-truth data.
As shown in Figure 2, we adopt a simple framework Evaluation Metrics. For retrieval and ranking, we
to fine-tune the LLM backbones and evaluate the re- report top-k Hit Ratio (HR@k) and Normalized
3
Discounted Cumulative Gain (NDCG@k), where
We adopt random mapping, i.e., similar-looking IDs
may not imply any connection or semantic similarity. We k is set to 5/10 and 1/5/10, respectively. For rat-
acknowledge that using semantic-rich IDs (Rajput et al., 2023) ing prediction, we report Area Under the Receiver
could enhance performance and leave the exploration to the
4
future. https://nijianmo.github.io/amazon/
Toys & Games Beauty Sports & Outdoors
Methods
NDCG NDCG HR HR NDCG NDCG HR HR NDCG NDCG HR HR
@5 @10 @5 @10 @5 @10 @5 @10 @5 @10 @5 @10
Caser1 0.0107 0.0141 0.0166 0.0270 0.0131 0.0176 0.0205 0.0347 0.0072 0.0097 0.0116 0.0194
HGN1 0.0221 0.0277 0.0321 0.0497 0.0206 0.0266 0.0325 0.0512 0.0120 0.0159 0.0189 0.0313
GRU4Rec1 0.0059 0.0084 0.0097 0.0176 0.0099 0.0137 0.0164 0.0283 0.0086 0.0110 0.0129 0.0204
BERT4Rec1 0.0071 0.0099 0.0116 0.0203 0.0124 0.0170 0.0203 0.0347 0.0075 0.0099 0.0115 0.0191
FDSA1 0.0140 0.0189 0.0228 0.0381 0.0163 0.0208 0.0267 0.0407 0.0122 0.0156 0.0182 0.0288
SASRec1 0.0306 0.0374 0.0463 0.0675 0.0249 0.0318 0.0387 0.0605 0.0154 0.0192 0.0233 0.0350
S3 -Rec1 0.0294 0.0376 0.0443 0.0700 0.0244 0.0327 0.0387 0.0647 0.0161 0.0204 0.0251 0.0385
TIGER2 0.0371 0.0432 0.0521 0.0712 0.0321 0.0384 0.0454 0.0648 0.0181 0.0225 0.0264 0.0400
P52 0.0050 0.0066 0.0070 0.0121 0.0107 0.0136 0.0163 0.0254 0.0041 0.0052 0.0061 0.0095
P5-XL 0.0023 0.0031 0.0035 0.0061 0.0036 0.0050 0.0063 0.0104 0.0029 0.0035 0.0040 0.0060
FLAN-T5-Base 0.0000 2e−5 0.0000 5e−5 0.0000 0.0000 0.0000 0.0000 0.0000 9e−6 0.0000 3e−5
FLAN-T5-XL 2e−5 2e−5 5e−5 5e−5 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
ReAT [Ours] 0.0390 0.0461 0.0558 0.0776 0.0382 0.0442 0.0535 0.0722 0.0188 0.0232 0.0285 0.0422
UT [Ours] 0.0166 0.0202 0.0252 0.0362 0.0188 0.0231 0.0292 0.0425 0.0079 0.0101 0.0118 0.0187
UT+AT [Ours] 0.0392 0.0459 0.0563 0.0772 0.0329 0.0397 0.0482 0.0693 0.0178 0.0219 0.0268 0.0393
∆ (%) +5.66 +6.71 +8.06 +8.99 +19.00 +15.10 +17.84 +11.42 +3.87 +3.11 +7.95 +5.50
2
Table 1: Retrieval results. 1 marks results from Zhou et al. 2020; marks results from Rajput et al. 2023. ∆
compares the best [Ours] with the best baseline.
Table 2: Ranking results. 1 marks results from Geng et al. 2022. ∆ compares the best [Ours] with the best baseline.
Methods Toys & Games Beauty Sports & Outdoors and McAuley, 2018), S3 -Rec (Zhou et al., 2020),
History 66.59 64.80 62.78 and TIGER (Rajput et al., 2023), which lever-
DMF 51.82 51.23 51.38
Wide&Deep 70.93 67.10 67.60 age self-attention, with TIGER being the current
P5-XL 51.04 50.63 50.36 SOTA. For ranking, we consider BPR-MF (Ren-
FLAN-T5-Base 57.85 56.04 55.00
FLAN-T5-XL 55.23 53.77 52.01 dle et al., 2009), BPR-MLP (Cheng et al., 2016),
RpAT [Ours] 71.16 68.27 65.87 and SimpleX (Mao et al., 2021), which are col-
UT [Ours] 70.79 67.45 65.35 laborative filtering-based method. For rating pre-
UT+AT [Ours] 71.08 67.55 65.18
diction, we consider History, a naive method that
∆ (%) +0.32 +1.74 -2.56
always predicts based on how likely a user likes
Table 3: Rating prediction AUC-ROC. ∆ compares the the training items they purchased, DMF (Xue
best [Ours] with the best baseline. et al., 2017), a neural matrix factorization model,
Operating Characteristic Curve (AUC-ROC). and Wide&Deep (Cheng et al., 2016), a context-
aware method. Beside, we also consider LLM-
Models. We compare to non LLM-based recom-
based methods including P5 (Geng et al., 2022),
menders. For retrieval, we consider sequential
which fine-tunes T5 (Raffel et al., 2020) with multi-
recommenders including Caser (Tang and Wang,
task recommendation prompts, P5-XL, which fine-
2018), which leverages CNNs, HGN (Ma et al.,
tunes FLAN-T5-XL with P5 prompts, FLAN-T5-
2019), which adopts hierarchical gating networks,
Base/XL (Wei et al., 2022), which make zero-shot
GRU4Rec (Hidasi et al., 2016), which leverages
predictions with FLAN-T5-Base or FLAN-T5-XL.
GRUs (Cho et al., 2014), BERT4Rec (Sun et al.,
We query them with our proposed recommendation-
2019), FDSA (Zhang et al., 2019), SASRec (Kang
task data samples generated from the test set 5 . et al. 2020; Geng et al. 2022; Rajput et al. 2023. We
ReAT/ RaAT/ RpAT, which fine-tune FLAN-T5- implement DMF and Wide&Deep with RecBole 7 .
XL with our proposed retrieval (Re), ranking (Ra), We adopt the default configurations, except the
or rating prediction (Rp) task data samples along data split, mapping (ratings to "like"s or "dislike"s),
with the auxiliary-task (AT) data samples 6 , uni- and metric are adjusted to follow our experiment
fied training (UT), which fine-tunes FLAN-T5-XL settings as reported earlier. The pseudo code for
with a combination of our proposed Re, Ra, Rp generating our proposed data samples can be found
data samples, unified training w/ auxiliary tasks in Appendix C.
(UT+AT), which fine-tunes FLAN-T5-XL with a
combination of our proposed Re, Ra, Rp, MIM, 4.2 Overall Performance (RQ1 & RQ2)
MLM data samples. Tables 1, 2, and 3 show the results of retrieval,
Implementation Details. We adopt the 3B FLAN- ranking, and rating prediction, respectively. FLAN-
T5-XL (Wei et al., 2022) as the backbone. We T5-Base/XL exhibit suboptimal performance on
also use the 223M FLAN-T5-Base for the ablation retrieval and ranking. For retrieval, they show near
studies in Section 4.3. Meanwhile, it’s crucial to zero NDCGs and HRs. For ranking, they are signif-
emphasize that the proposed method is not tied icantly inferior to the conventional baselines. For
to a specific backbone architecture and is easily rating prediction, they perform much higher than
adaptable to other LLMs, such as LLaMA (Tou- random guessing (50.00), outperforming DMF, but
vron et al., 2023). We set the sliding window size still fall behind History and Wide&Deep. This
w to 20. For the BPR data samples, we sample the shows that FLAN-T5 models lack recommenda-
negative items based on popularity. For the ranking tion knowledge, which is unsurprising considering
and BPR data samples, the position of the positive they were not trained on recommendation tasks dur-
item in the candidate pool is always determined ing pre-training or instruction-tuning and are evalu-
randomly. For the MIM and MLM data samples, ated in a zero-shot setting. Moreover, we find that
we adopt a masking ratio of 20%. To fully fine- our proposed method effectively aligns LLMs with
tune the LLM backbone, we apply dynamic sam- new recommendation domains (RQ1). In particu-
pling for the BPR and MIM/MLM data samples lar, by fine-tuning FLAN-T5-XL with our proposed
(we present details about the dynamic sampling data samples, our models significantly outperform
and the statistics of our data samples in Appendix FLAN-T5-XL on all three tasks across the datasets.
C). To reduce cost, we validate on 3,000 users. When compared to the baselines, our models
Meanwhile, testing is performed on all users. We show remarkable performance, especially on re-
fine-tune FLAN-T5-XL and FLAN-T5-Base for trieval (RQ2). For retrieval, our ReAT outperforms
70, 000 and 10, 000 steps, with batch sizes 16 and TIGER, the current SOTA, by large margins across
64, respectively. We set the learning rate to 0.001 datasets and metrics. Additionally, it is essential
and warm-up steps to 1,000. During prediction, we to highlight that our method possesses natural lan-
set the width of the beam search for retrieval and guage reasoning potentials of LLMs, which are
ranking to 20. For unified models, i.e., UT and absent in TIGER. For ranking, our RaAT greatly
UT+AT, model selections are based on retrieval outperforms SimpleX, the best baseline, on Toys
validation performance. We present the detailed & Games. On Beauty, RaAT performs on par with
settings of P5-XL experiments in Appendix A. We SimpleX. On Sports & Outdoors, RaAT is infe-
cite the results of some baseline models from Zhou rior to the conventional recommenders on metrics
such as NDCG/HR@10, yet still greatly outper-
5
We acknowledge that our retrieval and ranking data sam- forms the LLM-based baselines. Notably, the @1
ples (examples are shown in Figure 1 and Appendix C) utilize
item IDs for matching prediction results, whereas the FLAN- performance of RaAT is always much higher than
T5-Base/XL models, when queried in the zero-shot setting, the conventional recommenders. For rating predic-
do not inherently predict item IDs. Addressing this discrep- tion, our RpAT outperforms Wide&Deep, the best
ancy, text-based methods could be employed to extract item
titles, descriptions, etc., from the FLAN-T5-Base/XL predic- baseline, on Toys & Games and Beauty while lags
tions to enhance their performance. However, employing such slightly behind it on Sports & Outdoors. These re-
approaches requires an additional model for text matching,
which falls beyond the scope of this work
sults verify that our method introduces substantial
6
BPR data samples are used only by RaAT as we observe recommendation domain knowledge into LLMs
that they help ranking but not retrieval and rating prediction.
7
MIM/ MLM data samples are used by ReAT, RaAT, and RpAT. https://recbole.io
for outperforming strong baselines. The relative # Methods
NDCG NDCG HR HR
@5 @10 @5 @10
ineffectiveness of our method on Sports & Out-
1 TIGER 0.0371 0.0432 0.0521 0.0712
doors for the ranking and rating prediction tasks
2 FLAN-T5-XL 2e−5 2e−5 5e−5 5e−5
could be due to the nature of the data. Specifically, 3 2+retrieval 0.0182 0.0219 0.0273 0.0388
our model, as a sequential recommender, relies on 4 3+MLM 0.0306 0.0369 0.0443 0.0641
5 4+MIM 0.0390 0.0461 0.0558 0.0776
the sequential item correlations conveyed by the
6 FLAN-T5-Base 0.0000 2e−5 0.0000 5e−5
user sequences. Such signals may be relatively 7 6+retrieval 0.0149 0.0183 0.0219 0.0325
weak in Sports & Outdoors (e.g., the average se- 8 7+MLM 0.0219 0.0271 0.0334 0.0495
9 8+MIM 0.0242 0.0304 0.0376 0.0566
quence length of Sports & Outdoors is 8.32 ± 6.07,
whereas that of Beauty and Toys & Games are Table 4: Retrieval ablation study on Toys & Games.
8.88 ± 8.16 and 8.63 ± 8.51, respectively, suggest- Rows 1, 2, 5 (equivalent to ReAT), and 6 are copied
ing that Sports & Outdoors sequences are shorter from Table 1.
and less diverse), causing our method to perform
suboptimally. The best baselines, on the other hand, NDCG NDCG HR HR HR
# Methods
do not rely on such information. E.g., SimpleX is @5 @10 @1 @5 @10
1 SimpleX 0.1244 0.1469 0.0268 0.1958 0.2662
based on collaborative filtering and Wide&Deep
2 FLAN-T5-XL 0.0160 0.0312 0.0026 0.0315 0.0793
is a context-based model. Therefore, their perfor- 3 2+ranking 0.1520 0.1864 0.0807 0.2218 0.3284
mances are not impacted. 4 3+MLM 0.1580 0.1912 0.0854 0.2303 0.3333
5 4+MIM 0.1677 0.1976 0.0938 0.2391 0.3317
Moreover, our UT greatly outperforms P5 and 6 5+BPR 0.1714 0.2034 0.0956 0.2464 0.3453
P5-XL across datasets and metrics. This shows 7 FLAN-T5-Base 0.0107 0.0127 0.0057 0.0156 0.0217
that our proposed recommendation task prompts 8 7+ranking 0.1349 0.1654 0.0720 0.1957 0.2901
9 8+MLM 0.1481 0.1782 0.0820 0.2119 0.3051
better preserve item correlations as compared to the 10 9+MIM 0.1489 0.1811 0.0817 0.2141 0.3136
P5 ones. Specifically, we enhance user sequence 11 10+BPR 0.1534 0.1844 0.0844 0.2196 0.3153
Balázs Hidasi, Alexandros Karatzoglou, Linas Bal- Steffen Rendle, Christoph Freudenthaler, Zeno Gant-
trunas, and Domonkos Tikk. 2016. Session-based ner, and Lars Schmidt-Thieme. 2009. Bpr: Bayesian
recommendations with recurrent neural networks. In personalized ranking from implicit feedback. In Pro-
Proceedings of the 4th International Conference on ceedings of the Twenty-Fifth Conference on Uncer-
Learning Representations. tainty in Artificial Intelligence.
Yupeng Hou, Shanlei Mu, Wayne Xin Zhao, Yaliang Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin,
Li, Bolin Ding, and Ji-Rong Wen. 2022. Towards Wenwu Ou, and Peng Jiang. 2019. Bert4rec: Se-
universal sequence representation learning for recom- quential recommendation with bidirectional encoder
mender systems. In Proceedings of the 28th ACM representations from transformer. In Proceedings of
SIGKDD Conference on Knowledge Discovery and the 28th ACM international conference on informa-
Data Mining, pages 585–593. tion and knowledge management, pages 1441–1450.
Wang-Cheng Kang and Julian McAuley. 2018. Self- Jiaxi Tang and Ke Wang. 2018. Personalized top-n se-
attentive sequential recommendation. In 2018 IEEE quential recommendation via convolutional sequence
international conference on data mining (ICDM), embedding. In Proceedings of the eleventh ACM
pages 197–206. IEEE. international conference on web search and data
mining, pages 565–573.
Wang-Cheng Kang, Jianmo Ni, Nikhil Mehta, Mah-
eswaran Sathiamoorthy, Lichan Hong, Ed Chi, and Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Derek Zhiyuan Cheng. 2023. Do llms understand Martinet, Marie-Anne Lachaux, Timothée Lacroix,
user preferences? evaluating llms on user rating pre- Baptiste Rozière, Naman Goyal, Eric Hambro,
diction. arXiv preprint arXiv:2305.06474. Faisal Azhar, et al. 2023. Llama: Open and effi-
cient foundation language models. arXiv preprint
Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. arXiv:2302.13971.
Matrix factorization techniques for recommender sys-
tems. Computer, 42(8):30–37. Lei Wang and Ee-Peng Lim. 2023. Zero-shot next-item
recommendation using large pretrained language
Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical models. arXiv preprint arXiv:2304.03153.
gating networks for sequential recommendation. In
Proceedings of the 25th ACM SIGKDD international Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin
conference on knowledge discovery & data mining, Guu, Adams Wei Yu, Brian Lester, Nan Du, An-
pages 825–833. drew M Dai, and Quoc V Le. 2022. Finetuned lan-
guage models are zero-shot learners. In Proceedings
Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, of the 10th International Conference on Learning
Zhenhua Dong, Xi Xiao, and Xiuqiang He. 2021. Representations.
Simplex: A simple and strong baseline for collab-
orative filtering. In Proceedings of the 30th ACM Hong-Jian Xue, Xinyu Dai, Jianbing Zhang, Shujian
International Conference on Information & Knowl- Huang, and Jiajun Chen. 2017. Deep matrix factor-
edge Management, pages 1243–1252. ization models for recommender systems. In IJCAI,
volume 17, pages 3203–3209. Melbourne, Australia.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and
Hannaneh Hajishirzi. 2022. Cross-task generaliza- Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin
tion via natural language crowdsourcing instructions. Zhao, Leyu Lin, and Ji-Rong Wen. 2023. Recom-
In Proceedings of the 60th Annual Meeting of the mendation as instruction following: A large language
Association for Computational Linguistics. model empowered recommendation approach. arXiv
preprint arXiv:2305.07001.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S
Wei Li, and Peter J Liu. 2020. Exploring the limits Sheng, Jiajie Xu, Deqing Wang, Guanfeng Liu, Xi-
of transfer learning with a unified text-to-text trans- aofang Zhou, et al. 2019. Feature-level deeper self-
former. The Journal of Machine Learning Research, attention network for sequential recommendation. In
21(1):5485–5551. IJCAI, pages 4320–4326.
Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu,
Sirui Wang, Fuzheng Zhang, Zhongyuan Wang, and
Ji-Rong Wen. 2020. S3-rec: Self-supervised learning
for sequential recommendation with mutual informa-
tion maximization. In Proceedings of the 29th ACM
international conference on information & knowl-
edge management, pages 1893–1902.
Dataset # Users # Items # Interactions Sparsity (%)
Toys & Games 19,412 11,924 167,597 99.93 A.2 P5-XL vs. P5
Beauty 22,363 12,101 198,502 99.93
Sports & Outdoors 35,598 18,357 296,337 99.95 Please note that the retrieval results of P5 in Ta-
ble 1 are cited from Rajput et al. 2023 rather than
Table 7: Statistics of the datasets.
the original P5 paper (Geng et al., 2022). This
is because the original P5 experiments cannot be
A P5-XL Experimental Setting and reproduced upon fixing the information leakage
Additional Results issues as discussed in the previous section. Mean-
while, Rajput et al. 2023 does not report the ranking
A.1 Experimental Setting and rating prediction performances of P5. To fully
We generate P5 prompts using the source code pro- evaluate P5, we train a P5-XL model following the
vided by the P5 authors 8 . However, for a fair com- experimental setting as detailed in the previous sec-
parison, we update the data pre-processing to be tion, and report its performance on all three tasks
consistent with our method and the other baselines. in Tables 1 to 3.
Specifically, we apply random instead of sequential P5-XL performs worse than P5 in Table 1, which
indexing when mapping the item IDs. As pointed is likely owing to the differences in their training
out by Rajput et al. 2023, the sequential indexing of data. Specifically, P5 was only trained on retrieval
items (e.g., the purchase sequence of the first user prompts (as indicated in Appendix D of Rajput
in Toys & Games is mapped into ‘1, 2, 3, 4, 5, 6, 7’) et al. 2023). While following the original P5 pa-
in the original P5 pre-processing leads to data leak- per, P5-XL is trained on all five task families of
age (e.g., given the train items, i.e., ‘1, 2, 3, 4, 5, P5 prompts, including explanation generation and
6’, the LLM can easily infer the test item, i.e., ‘7’). review summarization tasks. We hypothesize that
Therefore, we adopt random mapping (i.e., con- these additional data samples are very different
secutive or similar-looking IDs may not imply any from the evaluated tasks (retrieval, ranking and
connection), which is consistent with our method. rating prediction), causing negative transfer to the
In addition, the original P5 pre-processing adopts evaluated tasks.
leave-one-out split for retrieval and ranking, while
A.3 Additional Results
splitting the dataset by 0.8:0.1:0.1 for the training,
validation, and testing of rating prediction. This In Table 8, we report the ranking results of P5-XL
could result in data leakage, as the test interactions evaluated with prompt template 5-5. We can tell
of one task might be included in the training set of that P5-XL (5-5) slightly fall behind P5-XL. Our
another task. We instead adopt leave-one-out data proposed UT greatly outperforms both P5-XL and
split for all three recommendation tasks, which is P5-XL (5-5), which again verifies that our proposed
consistent with our proposed method as well as the recommendation task prompts are more informa-
other baselines. tive than the P5 ones.
For a fair comparison, We apply the same back-
B Dataset Statistics
bone (FLAN-T5-XL), fine-tuning steps (70,000),
batch size (16), and learning rate (0.001) as adopted Table 7 presents the statistics of the Amazon
by our proposed method. Following the original P5 datasets, i.e., Toys & Games, Beauty, and Sports
code, we fine-tune a unified model with prompts of & Outdoors, that we used to evaluate our proposed
their proposed five task families (rating, sequential method as well as all the baselines.
recommendation, explanation, review, and direct
recommendation. The sequential recommendation C Pseudo Code, Statistics, and Examples
and direct recommendation families are weighted 5 of the Proposed Data Samples
times higher than the rest families). In Tables 1, 2, C.1 Pseudo Code for Data Sample Generation
and 3, we adopt prompt templates 2-1, 2-7, and 1-4
Algorithm 1 presents the pseudo code for gen-
for evaluating the retrieval, ranking, and rating pre-
erating our proposed recommendation-task and
diction performance of the P5-XL model, as these
auxiliary-task data samples.
templates better suit the forms of the recommenda-
tion tasks (introduced in the second subsection of C.2 Statistics of the Data Samples
Section 4.1) than the other templates.
Table 10 presents the statistics of our proposed
8
https://github.com/jeykigung/P5 recommendation-task and auxiliary-task data sam-
Toys & Games Beauty Sports & Outdoors
Methods
NDCG NDCG HR HR HR NDCG NDCG HR HR HR NDCG NDCG HR HR HR
@5 @10 @1 @5 @10 @5 @10 @1 @5 @10 @5 @10 @1 @5 @10
P5-XL 0.0290 0.0444 0.0097 0.0494 0.0977 0.0298 0.0456 0.0110 0.0498 0.0992 0.0286 0.0436 0.0097 0.0486 0.0957
P5-XL (5-5) 0.0274 0.0428 0.0089 0.0467 0.0948 0.0289 0.0443 0.0093 0.0497 0.0982 0.0275 0.0426 0.0091 0.0470 0.0943
UT [Ours] 0.1536 0.1867 0.0831 0.2233 0.3259 0.1236 0.1537 0.0609 0.1863 0.2798 0.0867 0.1137 0.0381 0.1362 0.2202
Table 8: Additional P5-XL Ranking results. Rows 1 and 3 are copied from Table 2.
NDCG NDCG HR HR Input: What’s the title of I1014? Output: Women’s Dry-fit Tempo Shorts
Methods
@5 @10 @5 @10 Input: What’s the brand of I1014? Output: Nike
UT [Ours] 0.0079 0.0101 0.0118 0.0187 Input: What’s the price of I1014? Output: $31.8
UT+IE [Ours] 0.0076 0.0097 0.0121 0.0185 …
Table 9: Retrieval results on Sports & Outdoors with Figure 3: Item embedding (IE) data samples.
(UT+IE) or without (UT) IE data samples. Row 1 is
copied from Table 1.
We Step 1: Generate
observe thatRecommendation & Auxiliary context-aware
the conventional Task Data Samples
Table 10: Statistics of our proposed data samples. DS stands for dynamic sampling.
23 return D