0% found this document useful (0 votes)
38 views19 pages

Recommendation As Language Processing (RLP) : A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)

Uploaded by

cynorr rain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views19 pages

Recommendation As Language Processing (RLP) : A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)

Uploaded by

cynorr rain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Recommendation as Language Processing (RLP):

A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5)


Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang
Department of Computer Science, Rutgers University, NJ 08854, US
{sg1309,shuchang.syt.liu,zuohui.fu,yingqiang.ge,yongfeng.zhang}@rutgers.edu

ABSTRACT 1 INTRODUCTION
For a long time, different recommendation tasks typically require For the past decades, recommender systems have witnessed signifi-
designing task-specific architectures and training objectives. As a cant advancements and played an essential role in people’s daily
result, it is hard to transfer the learned knowledge and representa- life, helping their micro decisions and fulfilling their demands with
tions from one task to another, thus restricting the generalization outstanding accuracy. In retrospect, we can summarize the devel-
arXiv:2203.13366v7 [cs.IR] 2 Jan 2023

ability of existing recommendation approaches, e.g., a sequential opment trend of modern recommender systems – towards a more
recommendation model can hardly be applied or transferred to a re- comprehensive system that accommodates diverse features and a
view generation method. To deal with such issues, considering that wide spectrum of application scenarios.
language can describe almost anything and language grounding On one hand, feature engineering and learning in recommender
is a powerful medium to represent various problems or tasks, we systems has evolved greatly from simple to complex. In early ages,
present a flexible and unified text-to-text paradigm called “Pretrain, recommender systems typically adopt logistic regression or collab-
Personalized Prompt, and Predict Paradigm” (P5) for recommen- orative filtering [25, 35, 50, 52] which utilize user-item interaction
dation, which unifies various recommendation tasks in a shared records to model users’ behavioral patterns. Later on, the contextual
framework. In P5, all data such as user-item interactions, user de- features such as user profile and item metadata are further inte-
scriptions, item metadata, and user reviews are converted to a grated into the system through more sophisticated models such as
common format — natural language sequences. The rich informa- factorization machines [48] and GBDT [20]. Recently, deep neural
tion from natural language assists P5 to capture deeper semantics network models [3, 5, 19, 74] facilitate crossing and combination
for personalization and recommendation. Specifically, P5 learns among even more diverse and sophisticated features. As a result,
different tasks with the same language modeling objective during these models gain better representation ability compared with tra-
pretraining. Thus, it serves as the foundation model for various ditional feature engineering based approaches.
downstream recommendation tasks, allows easy integration with On the other hand, more recommendation tasks have emerged.
other modalities, and enables instruction-based recommendation Except for classical rating prediction and direct user-item matching-
based on prompts. P5 advances recommender systems from shal- based recommendation tasks, recent works are broadening the spec-
low model to deep model to large model, and will revolutionize the trum to new tasks and scenarios such as sequential recommendation
technical form of recommender systems towards universal recom- [21, 60, 63, 80], conversational recommendation [8, 61, 76], explain-
mendation engine. With adaptive personalized prompt for different able recommendation [17, 31, 62, 70, 75, 77] and so on. While the
users, P5 is able to make predictions in a zero-shot or few-shot approaches to the aforementioned recommendation tasks are often
manner and largely reduces the necessity for extensive fine-tuning. proposed separately, there is an evident trend of utilizing multiple
On several recommendation benchmarks, we conduct experiments recommendation tasks to jointly learn the transferable represen-
to show the effectiveness of P5. To help advance future research tations [31, 56, 57, 72]. Although existing recommender systems
on Recommendation as Language Processing (RLP), Personalized achieved great success, there is still a considerable gap between
Foundation Models (PFM), and Universal Recommendation Engine current solutions and the foreseeable intersection of the aforemen-
(URE), we release the source code, dataset, prompts, and pretrained tioned trends – a comprehensive recommender system that can
P5 model at https://github.com/jeykigung/P5. Meanwhile, P5 is also accommodate diverse features and different types of tasks. Since
hosted on Hugging Face at https://huggingface.co/makitanikaze/P5. recommendation tasks usually share a common user–item pool and
have overlapping contextual features, we believe it is promising to
KEYWORDS merge even more recommendation tasks into a unified framework
Recommender Systems; Natural Language Processing; Multitask so that they can implicitly transfer knowledge to benefit each other
Learning; Personalized Prompt; Language Modeling; Unified Model and enable generalization to other unseen tasks.
Inspired by the recent progress in multitask prompt-based train-
Permission to make digital or hard copies of all or part of this work for personal or ing [1, 51, 67], in this work, we propose a unified “Pretrain, Personal-
classroom use is granted without fee provided that copies are not made or distributed ized Prompt & Predict Paradigm” (denoted as P5). We show that P5
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
is possible to learn multiple recommendation related tasks together
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or through a unified sequence-to-sequence framework by formulat-
republish, to post on servers or to redistribute to lists, requires prior specific permission ing these problems as prompt-based natural language tasks, where
and/or a fee. Request permissions from [email protected].
RecSys ’22, September 18–23, 2022, Seattle, WA, USA
user–item information and corresponding features are integrated
© 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. with personalized prompt templates as model inputs. P5 sheds light
ACM ISBN 978-1-4503-9278-5/22/09. . . $15.00
https://doi.org/10.1145/3523227.3546767
1
RecSys ’22, September 18–23, 2022, Seattle, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

Sequential Recommendation

I find the purchase history list of user_15466:


4110 -> 4467 -> 4468 -> 4472
I wonder what is the next item to recommend to the user. Can you help 1581
me decide?

Rating Prediction

What star rating do you think user_23 will give item_7391? 5.0

Explanation Generation

P5
Help Hong "Old boy" generate a 5-star explanation about this product: you can protect your prescious
OtterBox Defender Case for iPhone 3G, 3GS (Black) [Retail Packaging] iphone more safe

Review Summarization

Give a short sentence describing the following product review from


Mom of 3 yo girl:
First it came with the packaging open and then as soon as my son broke immediately
took it out it was so easily broken. Hopefully a little glue will fix it.

Direct Recommendation
Pick the most suitable item from the following list and recommend
to user_250 : \n 4915 , 1823 , 3112 , 3821 , 3773 , 520 , 7384 ,
7469 , 9318 , 3876 , 1143 , 789 , 595 , 3824 , 3587 , 10396 , 2766 , 520
7498 , 2490 , 3232 , 9711 , 2975 , 1427 , 9923 , 3097 , 3594 ,
6469 , 9460 , 6956 , 9154

Multi-task Pretraining with Personalized Prompt Collection

Zero-shot Generalization to New Product & Personalized Prompt

Predict user_14456 's preference about the new product


( 1 being lowest and 5 being highest ) : \n title : Hugg-A-Moon 4.7
\n price : 13.22 \n brand : Hugg-A-Planet

Figure 1: P5 pretrains on an encoder–decoder Transformer model that takes in textual inputs and produces target responses.
We trained P5 on a multitask collection of personalized prompts. After multitask prompt-based pretraining on recommenda-
tion datasets, P5 achieves the capability of zero-shot generalization to unseen personalized prompts and new items.

on a promising technical route for unified and instruction-based • According to the experimental results, P5 achieves promising
recommendation. It has three main advantages: performances on the five task families when taking seen prompt
1) P5 deeply immerses recommendation models into a full language templates as model inputs.
environment, where all recommendation tasks are reformulated to • P5 shows sufficient zero-shot generalization ability for novel
NLP tasks with the help of personalized prompts. Since language personalized prompts and new items in unseen domains.
grounding is sufficiently flexible and powerful to express various
kinds of features in text templates, so there is no need to design 2 RELATED WORK
feature-specific encoders. As a result, P5 can exploit the abundant
Unified Frameworks. Many prior works have pursued to solve
semantics and knowledge inside the training corpora;
various tasks in a unified model. As early pioneers, T5 [47] and GPT-
2) P5 integrates multiple recommendation tasks into a shared text- 3 [2] unifies NLP downstream tasks through text-to-text encoder–
to-text encoder-decoder architecture and trains them with the same decoder framework and autoregressive language modeling, respec-
language modeling loss rather than designing task-specific archi- tively. They both allow effective knowledge sharing among different
tectures and objective functions. In other words, P5 treats all per- tasks based on a common pretrained language model. Following
sonalized tasks as a conditional text generation problem; this trend, recent advances started to focus on unifying large-scale
3) Trained with instruction-based prompts, P5 attains sufficient language tasks [1, 51, 67] or cross-modality applications [6, 66, 71]
zero-shot performance when generalizing to novel personalized through a shared sequence-to-sequence framework, where different
prompts or unseen items in other domains. types of tasks and modalities are all expressed in the format of natu-
In our experiments, we study how P5 performs compared with ral language. However, aforementioned methods never consider per-
task-specific approaches on all five task families as well as evaluat- sonalization in their sequence-to-sequence models. Recently, a line
ing P5’s zero-shot generalization ability. We also conduct several of work [56, 57, 72] attempt to learn universal user representations
ablation studies to justify the design details of P5 framework. Over- which are easily transferrable to downstream tasks. One limitation
all, our main contributions can be outlined as follows: of these methods is that they still require additional finetuning on
• To the best of our knowledge, this is the first work to propose downstream datasets. In contrast, our P5 first takes personalization
a unified “Pretrain, Personalized Prompt & Predict Paradigm” into an encoder-decoder Transformer model that can generalize to
which integrates various recommendation related tasks into a a wide spectrum of recommendation related application scenarios –
shared conditional language generation framework. tasks that naturally require personalization. Moreover, with the help
• We create a collection of personalized prompts that cover five of prompt-based pretraining, P5 acquires zero-shot generalization
different recommendation task families. ability when transferring to unseen prompts and items.
2
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Seattle, WA, USA

Rating / Review / Explanation raw data for Beauty

user_id: 7641 user_name: stephanie Which star rating will user_{{user_id}} give item_{{item_id}}?
{{star_rating}}
(1 being lowest and 5 being highest)
item_id: 2051
item_title: SHANY Nail Art Set (24 Famouse Colors
Nail Art Polish, Nail Art Decoration)
review: Absolutely great product. I bought this for my fourteen year Based on the feature word {{feature_word}}, generate an
old niece for Christmas and of course I had to try it out, then I explanation for user_{{user_id}} about this product: {{explanation}}
tried another one, and another one and another one. So much fun! {{item_title}}
I even contemplated keeping a few for myself!
star_rating: 5
summary: Perfect!
explanation: Absolutely great product feature_word: product Give a short sentence describing the following product review
{{summary}}
from {{user_name}}: {{review}}

(a)

Sequential Recommendation raw data for Beauty

user_id: 7641 user_name: Victor


purchase_history: 652 -> 460 -> 447 -> 653 -> 654 -> 655 -> 656 -> 8
-> 657 Here is the purchase history of user_{{user_id}}:
next_item: 552 {{purchase_history}} {{next_item}}
candidate_items: 4885 , 4280 , 4886 , 1907 , 870 , 4281 , 4222 , What to recommend next for the user?
4887 , 2892 , 4888 , 2879 , 3147 , 2195 , 3148 , 3179 , 1951 ,
…… , 1982 , 552 , 2754 , 2481 , 1916 , 2822 , 1325

(b)

Direct Recommendation raw data for Beauty

user_id: 250 user_name: moriah rose


target_item: 520
random_negative_item: 9711 Choose the best item from the candidates to recommend for
{{target_item}}
{{user_name}}? \n {{candidate_items}}
candidate_items: 4915 , 1823 , 3112 , 3821 , 3773 , 520 , 7384 ,
7469 , 9318 , 3876 , 1143 , 789 , 595 , 3824 , 3587 , 10396 ,
…… , 2766 , 7498 , 2490 , 3232 , 9711 , 2975 , 1405 , 8051

(c)

Figure 2: Building input–target pairs from raw data according to our designed personalized prompt templates – simply substi-
tuting the fields in the prompts with the corresponding information in raw data. The raw data for the five task families of P5
are from three separate sources. Specifically, rating/review/explanation prompts (a) have shared raw data. Sequential recom-
mendation (b) and direct recommendation (c) uses similar raw data, but the former particularly requires the user interaction
history. The complete collection of P5 personalized prompts are provided in the Appendix.

Prompt Learning. The success of GPT series especially GPT-3 [2] NLP for Recommendation. Recommendation has been interact-
marked the beginning of prompt’s popularization on NLP tasks. ing with NLP techniques for a long time. The main work mostly
Trained with huge language data from the Web, GPT-3 exhibited the address four lines of research: 1) explainable recommendation [4,
capability of solving NLP tasks when provided a number of input- 10, 30–32, 75, 77] where NLP models help generating text explana-
output examples as exemplar prompts. Besides exemplar prompts, tions for a given recommendation; 2) sequential recommendation as
many prompt design methods have proliferated following the “pre- language modeling [9, 60, 80] which considers user interaction his-
train, prompt, and predict” paradigm [37]. One type of the meth- tories as word token sequences; 3) text feature extraction [69, 74, 79]
ods [16, 23, 36, 40, 58] explored prompt search for proper discrete which aims to extract informative text encodings that can improve
prompts. Meanwhile, another line of work [18, 28, 33, 38, 45, 81] the performance of recommendation; and 4) conversational recom-
exploited continuous vector embeddings as prompts. Compared mendation [8, 12–14, 22, 61, 76] that reasons the intent of users and
with the aforementioned prompt types, instruction-based prompts gives recommendation in an interactive dialog format. In our work,
contain detailed task descriptions and adhere more to the natural we explicitly covers the tasks of sequential recommendation and
language format. Since instruction-based prompts are flexible and explanation generation, and additionally offers insights on how
close to how humans communicate with each other, several pio- to formulate a unified NLP framework for other recommendation
neer works [11, 68] claim that learning from crowd-sourced NLP problems including rating prediction, top-k recommendation, and
datasets is a promising route for general purpose NLP systems. review summarization. Furthermore, pretrained with instruction-
Recent works such as FLAN [67] and T0 [51] finetuned pretrained based prompts that share similarity with conversational recommen-
language models on large-scale NLP datasets verbalized via human- dation, our P5 benefits from the natural language environment and
readable prompts. As a result, such multitask prompt-based tuning improves the performance on a series of recommendation tasks.
brings powerful models that exhibit strong zero-shot ability on un- Zero-shot and Cold Start Recommendation. Recommender
seen tasks. Inspired by the success of these approaches, we create systems’ performances heavily rely on the available training data,
a collection of personalized prompts and then train a sequence- but there are always zero-shot cases where the history records are
to-sequence model on a variety of recommendation related tasks limited. The evidences of performing well on such startup cases
verbalized according to the constructed personalized prompts. signal a good generalization ability of recommendation models.

3
RecSys ’22, September 18–23, 2022, Seattle, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

<t1> <t2> <t3> <t4> <t5> <t6> <t7> <t8> <t9> <t10> <t11> <t12> <t13> <t14> <t15> <t16> 5.0 </s>

Bidirectional Text Encoder Autoregressive Text Decoder

Token Emb. what star rating do you think user _ 23 will give item _ 73 91 ?
<s> 5.0
+ + + + + + + + + + + + + + + +

Position Emb. <p1> <p2> <p3> <p4> <p5> <p6> <p7> <p8> <p9> <p10> <p11> <p12> <p13> <p14> <p15> <p16>
+ + + + + + + + + + + + + + + +

Whole-word Emb. <w1> <w2> <w3> <w4> <w5> <w6> <w7> <w8> <w9> <w10> <w11>

Figure 3: An illustration of the P5 architecture. For the example prompt input “What star rating do you think user_23 will
give item_7391?”, P5 adopts an encoder–decoder framework: first encodes the input with a bidirectional text encoder, and
then generates the answer through a text decoder autoregressively. In contrast to task-specific recommendation models, our
P5 relies on multitask prompt-based pretraining on a large-scale personalized prompt collection, which makes P5 able to adapt
to different task families and even generalize to novel personalized prompts.

One widely studied problem under this setting is the cold-start like preference of the user, whereas lower scores indicate a dislike
recommendation where users [26] or items [53] are new to the sys- preference. For sequential recommendation task family, we create
tem with no previous interaction records. Solutions to this problem three types of prompts: 1) Directly predict the next item based on
either learn to model content features [15, 29, 44, 55] so that infer- user interaction history; 2) Given user interaction history, choose
ence can be made without interaction records or learn to transfer the possible next item from a candidate list, where only one item
representations from auxiliary domains [42, 56, 59, 72, 82]. Another is positive; 3) Based on user interaction history, predict whether a
line of work for zero-shot or few-shot recommendation discusses given item will be interacted next by the user. For explanation task
the quick adaptation to the new domain instead of providing rec- family, we ask P5 model to generate a textual explanation to justify
ommendation for cold-start cases only. Solutions typically follow a user’s preference towards a given item. There are two prompt
the meta learning [27, 64] or causal learning [34] frameworks that categories in this task family: 1) Directly generate an explanation
make the model robust to domain adaptations. In our work, we ask sentence with user/item information; 2) Generate explanation based
P5 model pretrained on an auxiliary domain to solve tasks on target on a feature word as hint [31]. For each category, there could be
domains, where the users are known to P5 but the items have never other auxiliary information included such as the review headline
been seen by the model before. and the star rating. For review related task family, we create two
types of prompts: 1) Summarize review comment to a shorter review
3 PERSONALIZED PROMPT COLLECTION title; 2) Predict the corresponding rating score based on the given
To facilitate the multitask prompt-based pretraining for recommen- review comment. For direct recommendation, we also create two
dation, we create a collection of personalized prompt templates. The types of prompts: 1) Predict whether to recommend an item to a
collection covers five different task families – rating, sequential user, the answer should be yes or no; 2) Select the most suitable item
recommendation, explanation, review, and direct recommen- from a list of candidate items to recommend to the user. We provide
dation. Each of these task families contains multiple personalized some example prompts in Figure 2, and the complete collection of
prompts to help P5 discover various aspects about users and items. personalized prompts are provided in the Appendix.
As mentioned in [51], a prompt is considered as consisting of an With the prompts, we can directly build input–target pairs from
input template and a target template, along with a collection of raw data. As illustrated in Figure 2, we can simply substitute the
associated metadata. In this work, we further define a personalized fields in braces with the corresponding information in the raw data
prompt as a prompt that includes personalized fields for different and thus create training input–target pairs or zero-shot testing per-
users and items. For example, a user’s preference can be indicated sonalized prompts. The training data and pre-training tasks will dis-
through either an ID number or a description of the user such as till the rich semantics from diverse modalities into the user and item
name, gender, age, etc. Moreover, the expected model output of a tokens for preference understanding and personalization. Note that
given personalized prompt should also vary according to its item we divide the raw data into three parts—rating/review/explanation
field. This implies the change of user’s preferences towards differ- share the same raw data, while sequential and direct recommenda-
ent items. Such item fields can be represented by either item ID tion differ in terms of whether to use interaction history as input
numbers or item metadata that contains detailed descriptions. information. During pretraining, we mix the input–target pairs
We designed basic P5 personalized prompt collection for each from different task families together to serve as the training data.
task family. For rating prediction task family, we divide the prompts To enhance P5’s robustness and zero-shot generalization, for each
into three categories: 1) Given the information about a user and raw datum, we only sample a portion of rather than all of the per-
an item, directly predict the rating score ranging from 1 to 5; 2) sonalized prompts in each task family. In sequential and direct
Predict whether a user will rate an item a given score. The expected recommendation task families, we also randomly select a group of
output is yes or no; 3) Predict if a user likes or dislikes an item. negative items for those prompts that require a candidate list.
Here we consider a star rating equal to or greater than 4 to be a
4
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Seattle, WA, USA

4 THE P5 PARADIGM AND MODEL Table 1: Basic statistics of the experimental datasets.

4.1 The P5 Architecture Dataset Sports Beauty Toys Yelp

The collection of personalized prompts introduced in the previous #Users 35,598 22,363 19,412 30,431
section makes it convenient to create a large amount of available #Items 18,357 12,101 11,924 20,033
pretraining data that covers a wide range of recommendation re- #Reviews 296,337 198,502 167,597 316,354
#Sparsity (%) 0.0453 0.0734 0.0724 0.0519
lated tasks. Thanks to the prompt templates, all pretraining data
shares a unified format of input–target token sequences, which
breaks the boundaries among different tasks. We claim that pre- require an item list as target output. In view of this, for sequential
training multiple recommendation tasks under a unified framework recommendation, we apply beam search to generate a list of poten-
of conditional generation can facilitate all involving tasks together. tial next items and evaluate it under the all-item setting. For direct
By immersing P5 in the full language environment throughout the recommendation, we predict the recommended items from a can-
pretraining stage, we also expect its zero-shot generalization capa- didate set S = {𝑆 1, · · · , 𝑆𝑚 }, where only one of the 𝑚 candidates is
bility of understanding unseen personalized prompts with detailed positive. Here, we also use beam search to decode a list of potential
item descriptions. That is the reason why P5 is called a unified target items with the highest scores and then conduct evaluations.
“Pretrain, Personalized Prompt, and Predict Paradigm”. Both of the above decoding processes can be written as:
In terms of the model architecture, our P5 is established upon a C = [𝐶 1, · · · , 𝐶𝐵 ] = Beam_Search(D, t, 𝐵) (2)
basic encoder–decoder framework. We employ Transformer [65]
where 𝐵 denotes the beam size and C is the output item list.
blocks to build both the encoder and decoder. Suppose the embed-
dings of an input token sequence is x = [𝑥 1, · · · , 𝑥𝑛 ]. As depicted 5 EXPERIMENTS
in Figure 3, before feeding the embedding sequence into the bidirec- In this section, we evaluate the performance of the proposed P5
tional text encoder E (·), we add positional encodings P to the raw approach on real-world data and compare it with various repre-
embeddings to capture their position information in the sequence. sentative methods targeting at different task families. Through the
Furthermore, to make P5 aware of the personalized information performance comparison and ablation studies, we aim to answer
contained in the input sequence, we also apply whole-word em- the following research questions regarding our unified “Pretrain,
beddings W to indicate whether consecutive sub-word tokens are Personalized Prompt, and Predict Pargadigm” (P5):
from the same original word. For instance, if we directly represent
the item with ID number 7391 as “item_7391”, then the word will be • RQ1: How does our unified P5 framework perform compared
split into 4 separate tokens (i.e., “item”, “_”, “73”, “91”) by Sentence- with task-specific methods on all five task families?
Piece tokenizer [54]. With the assistance of the shared whole-word • RQ2: Does P5 have enough zero-shot generalization ability when
embedding “⟨w10⟩” (e.g., in Figure 3), P5 can better recognize the transferring to unseen personalized prompts for either existing
important field with personalized information. Another alternative or new items?
is to represent each user/item by an independent extra token (e.g., • RQ3: How do scaling factors such as model size, number of task
“⟨item_7391⟩”). However, this may incur huge amounts of additional families, and number of prompts affect the performance of P5?
tokens when there is a large pool of users and items. Hence, in this • RQ4: Which is a better way to implement personalization in P5:
paper, we adopt multiple sub-word units to represent a user or item. adopting an independent extra token for each user or item (e.g.,
Afterwards, the text encoder takes the sum of the aforemen- “⟨user_23⟩”) or the default setting, i.e., tokenizing each user or
tioned three embeddings e = [𝑒 1, · · · , 𝑒𝑛 ] and outputs their con- item into multiple sub-word units (e.g., “user”, “_”, “23”)?
textualized representations t = [𝑡 1, · · · , 𝑡𝑛 ] = E (e). The decoder • RQ5: How long does it take for P5 to conduct pretraining? Is it
D (·) then attends to both the previously generated tokens y< 𝑗 and efficient to make inference with the pretrained P5 model? We
the encoder output t and predicts the probability distribution of provide statistics on training and inference time in the Appendix.
future tokens: 𝑃𝜃 y 𝑗 | y< 𝑗 , x = D (y< 𝑗 , t). During the pretraining

stage, P5 learns the model parameters 𝜃 by minimizing the negative 5.1 Experimental Setup
log-likelihood of label tokens y conditioned on input text x in an Datasets. We conduct extensive experiments over four real-world
end-to-end manner: datasets. The Amazon1 datasets are collected from Amazon.com
|y | platform with user ratings and reviews on 29 categories of prod-
L𝜃P5 = − log 𝑃𝜃 y 𝑗 | y< 𝑗 , x (1) ucts. In this paper, we adopt three of them to evaluate our method,
∑︁ 
𝑗=1 namely Sports & Outdoors, Beauty, as well as Toys & Games. Besides,
This same objective function is shared by all recommendation tasks Yelp2 dataset contains a large number of user ratings and reviews
under P5. As a result, we unify recommendation tasks with one for business recommendation. We follow [80] and use transaction
model, one loss, and one data format. records between January 1, 2019 to December 31, 2019. Due to space
limit and that the results on Yelp show similar trends with other
4.2 Recommendation with Pretrained P5 datasets, we put the experimental results on Yelp dataset in the
After pretraining, P5 can directly perform different tasks with either Appendix. The detailed statistics of these datasets are presented in
seen or unseen personalized prompts. For rating, explanation, and Table 1.
review tasks, we simply use greedy decoding to generate answers. 1 https://nijianmo.github.io/amazon/

In contrast, sequential and direct recommendation tasks usually 2 https://www.yelp.com/dataset

5
RecSys ’22, September 18–23, 2022, Seattle, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

Task splits. For rating, explanation, and review task families, we Table 2: Performance comparison on rating prediction.
randomly split each dataset into training (80%), validation (10%) Sports Beauty Toys
Methods
and testing (10%) sets, and ensure that there is at least one instance RMSE MAE RMSE MAE RMSE MAE
included in the training set for each user and item. To obtain the MF 1.0234 0.7935 1.1973 0.9461 1.0123 0.7984
ground-truth explanations, following the natural language expla- MLP 1.1277 0.7626 1.3078 0.9597 1.1215 0.8097
nation works [30, 31], we first extract item feature words from P5-S (1-6) 1.0594 0.6639 1.3128 0.8428 1.0746 0.7054
P5-B (1-6) 1.0357 0.6813 1.2843 0.8534 1.0544 0.7177
the reviews with the help of the Sentires toolkit3 [77, 78], and then P5-S (1-10) 1.0522 0.6698 1.2989 0.8473 1.0550 0.7173
extract the sentences from reviews that comment on one or more P5-B (1-10) 1.0292 0.6864 1.2870 0.8531 1.0245 0.6931
item feature words as users’ explanation about their preference.
In terms of sequential recommendation task family, for each user 5.2 Baselines for Multiple Tasks
interaction sequence, the last item is used as the test data, the item To demonstrate P5’s competence on a wide range of recommen-
before the last one is used as the validation data, and the remaining dation related tasks, we gather a collection of representative ap-
data is used for training. To avoid data leakage during pretraining, proaches for difference task families.
we follow the training split of sequential recommendation to build Rating Prediction and Direct Recommendation. These tasks
the training set for direct recommendation task family. take the user–item rating/interaction data, but no content or side
Implementation Details. Our P5 model utilizes the pretrained information is provided. We aim to justify whether the models are
T5 checkpoints [47] as backbone. According to the size of T5 back- able to provide accurate rating prediction or recommendation lists
bone, we create two versions of P5, namely P5-small (P5-S) and that align with the user preferences. We use MF [25] and MLP [5]
P5-base (P5-B). For P5-small, there are 6 layers for both encoder under mean square root loss as rating prediction baselines. For
and decoder, the model dimensionality is 512 with 8-headed atten- direct recommendation, we use BPR-MF [49], BPR-MLP [5], and
tion, and the number of parameters is 60.75 million. For P5-base, a state-of-the-art contrastive learning-based collaborative filtering
encoder and decoder both have 12 Transformer blocks. The model model SimpleX [43] as baselines.
has an embedding dimensionality of 768 and a 12-headed attention, Sequential Recommendation. We adopt several representative
and the number of parameters is 223.28 million. For tokenization, sequential recommendation approaches as our baselines. Caser [63]
we use the SentencePiece [54] tokenizer with a vocabulary size of treats sequential recommendation as a Markov Chain and employs
32,128 for parsing sub-word units. We pretrain P5 for 10 epochs convolutional neural networks to model user interests. HGN [41]
with AdamW optimization [39] on four NVIDIA RTX A5000 GPUs. adopts a hierarchical gating networks to learn user behaviors from
The batch size is set to 16 for P5-base and 32 for P5-small. We choose the perspectives of both long and short terms. GRU4Rec [21] is
1 × 10−3 as the peak learning rate and set the maximum length originally proposed for session-based recommendation. It utilizes
of input tokens to 512. The warmup strategy is used to adjust the GRU [7] to model the user click history sequence. BERT4Rec [60]
learning rate during training, the warmup stage is set to be the first mimics the BERT-style masked language modeling and learns a bidi-
5% of all iterations. When negative sampling is needed for training, rectional representation for sequential recommendation. FDSA [73]
we use 1:1 positive vs. negative sampling for both P5 and baselines. focuses on the feature transition patterns by modeling feature
Our default pretrain–predict combination adopts the last prompt sequence with a self-attention module. SASRec [24] adopts self-
in each task family for zero-shot evaluation while all remaining attention mechanism in a sequential recommendation model, which
prompts are utilized for multitask prompted pretraining. For rat- reconciles the properties of Markov Chains and RNN-based ap-
ing prediction, we use Gaussian sampling to convert the original proaches. S3 -Rec [80] leverages self-supervised objectives to help
integer scores to float numbers rounded to 1 decimal place. In this sequential recommendation model better discover the correlations
way, we can avoid overfitting the limited score types. After this among different items and their attributes. We use the implementa-
change, we increase the number of score classes from 5 to 41. For tion of S3 -Rec and its baselines for comparison4 .
sequential recommendation, we set the beam size 𝐵 to 20. For direct Explanation Generation. For performance comparison, we con-
recommendation, the beam size is also 20 and the candidate pool sider several baselines with regard to the task of explanation gener-
contains 100 items, which consist of one ground-truth item and 99 ation. Attn2Seq [10] learns to encode attributes into vectors, and
sampled negative ones that the user has not interacted with. then invokes an attention mechanism to generate reviews condi-
Metrics. For rating prediction, we adopt Root Mean Square Error tioned on the attribute vector. NRT [32] utilizes GRU [7] to generate
(RMSE) and Mean Absolute Error (MAE). For sequential recommen- explanations based on user and item IDs. PETER [31] is a simple
dation and direct recommendation tasks, we employ top-𝑘 Hit Ratio and effective framework that attempts to utilize user and item IDs
(HR@𝑘) and Normalized Discounted Cumulative Gain (NDCG@𝑘) to generate explanations. It is built upon a modified attention mask
to evaluate the performance and report HR@1, 5, 10 and NGCG@5, of the Transformer architecture. There is also a variant PETER+,
10. For explanation generation and review summarization, we evalu- which takes a hint feature word to assist the explanation generation.
ate different methods with BLEU-4, as well as ROUGE-1, ROUGE-2, Review Related. For review summarization, we adopt pretrained
and ROUGE-L. RMSE and MAE are “the lower, the better”, while all T0 [51] and GPT-2 [46] checkpoints hosted by Hugging Face5 as
other metrics are “the higher, the better”. For all tables in the follow- baselines. For review preference prediction, we only use T0 to make
ing, bold numbers refer to the best performance, while underlined comparisons because GPT-2 cannot perform this task.
numbers indicate the second best performance.
4 https://github.com/RUCAIBox/CIKM2020-S3Rec
3 https://github.com/evison/Sentires 5 https://huggingface.co/

6
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Seattle, WA, USA

Table 3: Performance comparison on sequential recommendation.


Sports Beauty Toys
Methods
HR@5 NDCG@5 HR@10 NDCG@10 HR@5 NDCG@5 HR@10 NDCG@10 HR@5 NDCG@5 HR@10 NDCG@10
Caser 0.0116 0.0072 0.0194 0.0097 0.0205 0.0131 0.0347 0.0176 0.0166 0.0107 0.0270 0.0141
HGN 0.0189 0.0120 0.0313 0.0159 0.0325 0.0206 0.0512 0.0266 0.0321 0.0221 0.0497 0.0277
GRU4Rec 0.0129 0.0086 0.0204 0.0110 0.0164 0.0099 0.0283 0.0137 0.0097 0.0059 0.0176 0.0084
BERT4Rec 0.0115 0.0075 0.0191 0.0099 0.0203 0.0124 0.0347 0.0170 0.0116 0.0071 0.0203 0.0099
FDSA 0.0182 0.0122 0.0288 0.0156 0.0267 0.0163 0.0407 0.0208 0.0228 0.0140 0.0381 0.0189
SASRec 0.0233 0.0154 0.0350 0.0192 0.0387 0.0249 0.0605 0.0318 0.0463 0.0306 0.0675 0.0374
S3 -Rec 0.0251 0.0161 0.0385 0.0204 0.0387 0.0244 0.0647 0.0327 0.0443 0.0294 0.0700 0.0376
P5-S (2-3) 0.0272 0.0169 0.0361 0.0198 0.0503 0.0370 0.0659 0.0421 0.0648 0.0567 0.0709 0.0587
P5-B (2-3) 0.0364 0.0296 0.0431 0.0318 0.0508 0.0379 0.0664 0.0429 0.0608 0.0507 0.0688 0.0534
P5-S (2-13) 0.0258 0.0159 0.0346 0.0188 0.0490 0.0358 0.0646 0.0409 0.0647 0.0566 0.0705 0.0585
P5-B (2-13) 0.0387 0.0312 0.0460 0.0336 0.0493 0.0367 0.0645 0.0416 0.0587 0.0486 0.0675 0.0536

Table 4: Performance comparison on explanation generation (%).


Sports Beauty Toys
Methods
BLUE4 ROUGE1 ROUGE2 ROUGEL BLUE4 ROUGE1 ROUGE2 ROUGEL BLUE4 ROUGE1 ROUGE2 ROUGEL
Attn2Seq 0.5305 12.2800 1.2107 9.1312 0.7889 12.6590 1.6820 9.7481 1.6238 13.2245 2.9942 10.7398
NRT 0.4793 11.0723 1.1304 7.6674 0.8295 12.7815 1.8543 9.9477 1.9084 13.5231 3.6708 11.1867
PETER 0.7112 12.8944 1.3283 9.8635 1.1541 14.8497 2.1413 11.4143 1.9861 14.2716 3.6718 11.7010
P5-S (3-3) 1.0447 14.9048 2.1297 11.1778 1.2237 17.6938 2.2489 12.8606 2.2892 15.4505 3.6974 12.1718
P5-B (3-3) 1.0407 14.1589 2.1220 10.6096 0.9742 16.4530 1.8858 11.8765 2.3185 15.3474 3.7209 12.1312
PETER+ 2.4627 24.1181 5.1937 18.4105 3.2606 25.5541 5.9668 19.7168 4.7919 28.3083 9.4520 22.7017
P5-S (3-9) 1.4101 23.5619 5.4196 17.6245 1.9788 25.6253 6.3678 19.9497 4.1222 28.4088 9.5432 22.6064
P5-B (3-9) 1.4689 23.5476 5.3926 17.5852 1.8765 25.1183 6.0764 19.4488 3.8933 27.9916 9.5896 22.2178
P5-S (3-12) 1.3212 23.2474 5.3461 17.3780 1.9425 25.1474 6.0551 19.5601 4.2764 28.1897 9.1327 22.2514
P5-B (3-12) 1.4303 23.3810 5.3239 17.4913 1.9031 25.1763 6.1980 19.5188 3.5861 28.1369 9.7562 22.3056

5.3 Performance Comparison on Different Table 5: Performance on review preference prediction.


Task Families (RQ1) Sports Beauty Toys
Methods
In this section, we pretrain P5 with prompts from all five task RMSE MAE RMSE MAE RMSE MAE
families to verify its multitask learning ability. According to the T0 (4-2) 0.6728 0.3140 0.6925 0.3324 0.8282 0.4201
default pretrain–predict task combination, we leave Prompt 1-10, T0 (4-4) 0.6503 0.2984 0.7066 0.3663 0.8148 0.4230
P5-S (4-2) 0.7293 0.3529 0.6233 0.3051 0.6464 0.3125
Prompt 2-13, Prompt 3-12, Prompt 4-4, and Prompt 5-8 for zero- P5-B (4-2) 0.6487 0.2847 0.6449 0.3168 0.6785 0.3342
shot evaluation and pretrain P5 with the remaining personalized P5-S (4-4) 0.7565 0.3395 0.6262 0.3113 0.6577 0.3174
prompts. The performances of P5 and relevant baselines on the P5-B (4-4) 0.6563 0.2921 0.6515 0.3106 0.6730 0.3342
five task families are presented in Table 2 to Table 7. For each task
family, we choose one or more seen prompts as supplement to the ranking. From the table, we can see that P5-B surpasses all com-
aforementioned zero-shot unseen prompts to perform evaluations. petitive baselines with a relatively large gap on both seen (Prompt
2-3) and unseen (Prompt 2-13) prompts. On Toys, P5-S can get even
5.3.1 Rating Prediction. Prompt 1-6 and Prompt 1-10 are used better performance than P5-B. While on Beauty and Sports, P5-B
for evaluating P5’s performance on rating prediction. The perfor- achieves the advantage over P5-S. The results show that the P5
mance comparison is presented in Table 2. We can see that when architecture is effective in modeling the user interaction history
testing with seen Prompt 1-6, P5-B gets better MAE and slightly and conducting next item prediction with the help of beam search.
higher RMSE on all three datasets compared with MF. When testing 5.3.3 Explanation Generation. In Table 4, Prompt 3-9 and Prompt
with unseen Prompt 1-10, P5-B can achieve similar performance 3-12 are used to evaluate P5’s performance on explanation genera-
as Prompt 1-6. Moreover, P5-S usually has better MAE but higher tion under feature-based setup, while Prompt 3-3 is used for direct
RMSE. It seems that P5 is overfitting these data since the task com- explanation generation without providing a hint word. We can see
plexity of rating prediction is relatively lower than other recommen- that for Prompt 3-3, P5 achieves the best performances against all
dation tasks. Overall, these results show that it is feasible to perform baselines. For feature-based prompts (Prompts 3-9 & 3-12), P5 can
rating prediction on a conditional text generation framework. outperform PETER+ on most cases, especially for Beauty and Toys.
5.3.2 Sequential Recommendation. As illustrated in Table 3, 5.3.4 Review Related. We take Prompts 4-2 and 4-4 to com-
Prompt 2-3 and Prompt 2-13 are employed for the evaluation of pare P5’s performance with T0 on review preference prediction, as
sequential recommendation under all-item setting, i.e., using all shown in Table 5. We can see that P5-S achieves better RMSE and
items as candidates rather than sampling 100 or 1,000 items for MAE on Beauty and Toys, while P5-B shows better performance
7
RecSys ’22, September 18–23, 2022, Seattle, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

Table 6: Performance comparison on review summarization (%).


Sports Beauty Toys
Methods
BLUE2 ROUGE1 ROUGE2 ROUGEL BLUE2 ROUGE1 ROUGE2 ROUGEL BLUE2 ROUGE1 ROUGE2 ROUGEL
T0 (4-1) 2.1581 2.2695 0.5694 1.6221 1.2871 1.2750 0.3904 0.9592 2.2296 2.4671 0.6482 1.8424
GPT-2 (4-1) 0.7779 4.4534 1.0033 1.9236 0.5879 3.3844 0.6756 1.3956 0.6221 3.7149 0.6629 1.4813
P5-S (4-1) 2.4962 11.6701 2.7187 10.4819 2.1225 8.4205 1.6676 7.5476 2.4752 9.4200 1.5975 8.2618
P5-B (4-1) 2.6910 12.0314 3.2921 10.7274 1.9325 8.2909 1.4321 7.4000 1.7833 8.7222 1.3210 7.6134

Table 7: Performance comparison on direct recommendation.


Sports Beauty Toys
Methods
HR@1 HR@5 NDCG@5 HR@10 NDCG@10 HR@1 HR@5 NDCG@5 HR@10 NDCG@10 HR@1 HR@5 NDCG@5 HR@10 NDCG@10
BPR-MF 0.0314 0.1404 0.0848 0.2563 0.1220 0.0311 0.1426 0.0857 0.2573 0.1224 0.0233 0.1066 0.0641 0.2003 0.0940
BPR-MLP 0.0351 0.1520 0.0927 0.2671 0.1296 0.0317 0.1392 0.0848 0.2542 0.1215 0.0252 0.1142 0.0688 0.2077 0.0988
SimpleX 0.0331 0.2362 0.1505 0.3290 0.1800 0.0325 0.2247 0.1441 0.3090 0.1711 0.0268 0.1958 0.1244 0.2662 0.1469
P5-S (5-1) 0.0638 0.2096 0.1375 0.3143 0.1711 0.0600 0.2021 0.1316 0.3121 0.1670 0.0405 0.1538 0.0969 0.2405 0.1248
P5-B (5-1) 0.0245 0.0816 0.0529 0.1384 0.0711 0.0224 0.0904 0.0559 0.1593 0.0780 0.0187 0.0827 0.0500 0.1543 0.0729
P5-S (5-4) 0.0701 0.2241 0.1483 0.3313 0.1827 0.0862 0.2448 0.1673 0.3441 0.1993 0.0413 0.1411 0.0916 0.2227 0.1178
P5-B (5-4) 0.0299 0.1026 0.0665 0.1708 0.0883 0.0506 0.1557 0.1033 0.2350 0.1287 0.0435 0.1316 0.0882 0.2000 0.1102
P5-S (5-5) 0.0574 0.1503 0.1050 0.2207 0.1276 0.0601 0.1611 0.1117 0.2370 0.1360 0.0440 0.1282 0.0865 0.2011 0.1098
P5-B (5-5) 0.0641 0.1794 0.1229 0.2598 0.1488 0.0588 0.1573 0.1089 0.2325 0.1330 0.0386 0.1122 0.0756 0.1807 0.0975
P5-S (5-8) 0.0567 0.1514 0.1049 0.2196 0.1269 0.0571 0.1566 0.1078 0.2317 0.1318 0.0451 0.1322 0.0889 0.2023 0.1114
P5-B (5-8) 0.0726 0.1955 0.1355 0.2802 0.1627 0.0608 0.1564 0.1096 0.2300 0.1332 0.0389 0.1147 0.0767 0.1863 0.0997

on Sports. Additionally, we take Prompt 4-1 to evaluate P5’s ability Table 8: Statistics on domain transfer evaluation sets.
on review summarization, as shown in Table 6. For this task, P5-S Dataset Sports Beauty Toys
clearly outperforms T0 and GPT-2 on both Beauty and Toys datasets.
It is worth noting that GPT-2 and T0 has 1.5B and 11B parameters, #Users 290 439 487
respectively. This shows that P5 can achieve better performances #Items 381 586 886
than these competitive baselines with a much smaller model size. #Reviews 478 1,237 1,183

5.3.5 Direct Recommendation. Finally, Prompts 5-1, 5-4, 5-5


and 5-8 are applied to evaluate the direct recommendation task un- surpass seen prompts, e.g., P5-B gets the best performance under
der the 1-out-of-100 evaluation setting. For binary question prompts Prompt 2-13 on Sports. These results show that multitask prompted
(5-1 & 5-4), which are discriminative prompts, we use the softmax pretraining empowers P5 enough robustness to understand unseen
generation probability of “yes” to rank the candidate items. For prompts with wording variations.
open question prompts (5-5 & 5-8), which are generative prompts,
5.4.2 Transfer to Items in New Domain. Next, we increase the
we use beam-search (Eq.(2)) to generate the top-𝑘 list. The results
difficulty level of zero-shot transfer. We collect a group of 741
are presented in Table 7. From the table, we can see that P5-B and
users that exist in all the three domains with their interaction
P5-S have great advantages over BPR-MF and BPR-MLP on all three
and review histories in other domains. The detailed statistics of
datasets. Comparing with SimpleX, we can see that P5 works es-
these domain transfer evaluation sets are illustrated in Table 8.
pecially well on top-1 item ranking, which is more than two times
We then challenge P5-B pretrained on one domain with unseen
better than SimpleX on HR@1. Besides, P5 also achieves the best
prompts from the Task Family Z, whose item fields are filled with
result on most of the other metrics. The success of P5 on direct rec-
the information from a new product domain. For example, we ask
ommendation shows the competence of the sequence-to-sequence
the P5 model pretrained on the Toys domain about an existing user’s
generation framework in recommendation domain.
preference towards an item in the Beauty domain. The full results on
all six directions are reported in Table 9. From the table, we notice
5.4 Zero-shot Generalization to Unseen P5 still maintains sufficient performances for rating prediction
Prompts and Items in New Domain (RQ2) (Prompts Z-2 & Z-3), like/dislike prediction (Prompts Z-1 & Z-
5.4.1 Transfer to Unseen Personalized Prompts. In this sec- 4), as well as explanation generation with feature word (Prompt
tion, we transfer the pretrained P5 models to the previously held- Z-6). In contrast, direct explanation generation without feature
out prompts during pretraining. These unseen prompts are from word (Prompts Z-5 & Z-7) is very difficult for P5 because it lacks
the same task families, and the testing items have been seen by awareness of relevant knowledge in the new domain. In Figure 4,
P5 during pretraining at least once. The experimental results are we provide some example explanations generated by P5-B under
also reported in Table 2 to Table 7. As previously discussed in Sec- the setup of zero-shot domain transfer (Prompt Z-6). We can see
tion 5.3, P5 achieves surprisingly good performances on various that P5 is able to catch different users’ rating preferences and hint
task families when being challenged by unseen prompts. On some feature words, then integrate them with the knowledge learned
specific datasets, the performances of P5 on unseen prompts even from previous domain to generate plausible explanations.
8
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Seattle, WA, USA

Toys -> Beauty

Case 1 Input: Based on the word color , help user_4549 write a 5-star explanation for this new product : \n Target Output: I have color treated fine hair but lots of it
title : Bain De Terre Serum Anti-Frizz Recovery Complex 1.7 oz. \n price : 3.86 \n brand : Bain de
P5’s Output: The colors are bright
Terre

Case 2 Input: Based on the word shampoo , help user_1491 write a 1-star explanation for this new product : Target Output: I tried the shampoo and it hardly foams up
\n title : CLEAR SCALP &amp; HAIR BEAUTY Volumizing Root Boost Nourishing Shampoo, 12.9 Fluid Ounce
P5’s Output: Not worth the price
\n price : 1.66 \n brand : Clear

Beauty -> Sports

Case 3 Input: Based on the word tool , help user_3877 write a 5-star explanation for this new product : \n Target Output: this is an excellent tool
title : SOG Specialty Knives &amp; Tools TF6-CP Trident Knife with Straight Edge Assisted Folding
P5’s Output: This is a great tool tool
3.75-Inch Steel Tanto Shape Blade and GRN Handle, Satin Finish \n price : 57.6 \n brand : SOG
Specialty Knives

Input: Based on the word saw , help user_326 write a 2-star explanation for this new product : \n Target Output: After looking closely we saw 2 of the rivots
Case 4
title : Rome's #1705 Square Pie Iron with Steel and Wood Handles \n price : 20.26 \n brand : Rome that are supposed to hold it on were missing
Industries P5’s Output: I saw no visible improvement

Sports -> Toys

Case 5 Input: Based on the word game , help user_29782 write a 3-star explanation for this new product : \n Target Output: I think this game could prove interesting if
title : Deal Breaker Coaster Card Game \n price : 13.95 \n brand : Zobmondo!! played while drinking
P5’s Output: I'm a big fan of this game and it is a great
game

Case 6 Input: Based on the word cars , help user_2121 write a 5-star explanation for this new product : \n Target Output: These are great looking little plastic cars
title : Ridemakerz Marvel Universe Spiderman and Venom Xtreme Customz Swap Set \n price : 14.89 \n P5’s Output: I have several cars and this one is the best
brand : RIDEMAKERZ

Figure 4: Example cases of zero-shot domain transfer on Z-6 task. We demonstrate three transfer directions: Toys to Beauty,
Beauty to Sports, and Sports to Toys.

(a) (b) (c) (d)

Figure 5: Performance comparison among P5-S, P5-SN, and P5-PS on Beauty.

5.5 Ablation on Model Size (RQ3) 5.6 Ablation on Task Scaling (RQ3)
In this section, we will discuss the influence of model size on the Moreover, we explore whether multitask prompted pretraining is
performance of P5 on different recommendation tasks. Here, we superior than pretraining on each task family alone. We pretrain
train two size variants of P5, namely P5-small and P5-base. The P5-small on Beauty dataset with prompts from every single task
parameter numbers of these two P5 models are 60.75M and 223.28M, family, resulting in five models – P5-S1, P5-S2, P5-S3, P5-S4, and
respectively. From Table 2 to Table 7, we can see that although P5-S P5-S5. We then compare P5-S on various recommendation tasks
is only 1/4 of the size of P5-B, P5-S can beats P5-B on a series of with the corresponding single task P5 model. The performance com-
tasks and datasets. For example, P5-S achieves better sequential parison between P5-S and P5-SN (𝑁 ∈ [1, 2, 3, 4, 5]) is illustrated
recommendation, review preference prediction, and direct recom- in Figure 5. As shown in the figure, P5-S achieves comparable or
mendation (Prompts 5-5 & 5-8) performances than P5-B on Toys. In better performance than P5-SN on rating prediction, sequential
contrast, P5-B shows advantages on sequential recommendation recommendation and direct recommendation tasks, while on text
and review preference prediction tasks for Sports. Since Sports con- generation tasks such as explanation generation (Prompts 3-9 &
tains more users, items and reviews and has a lower sparsity, it 3-12) and review summarization (Prompt 4-1), P5-SN is better than
requires a model with higher capacity to discover latent correlation P5-S. This indicates that multitask modeling (P5-S) seeks a good
among different personalized factors. The findings indicate that balance among tasks and improves recommendation performance
larger P5 models may be needed when the dataset is large, while for by leveraging the power of language understanding. Besides, both
smaller datasets, smaller P5 models could be enough. As a result, P5-S and P5-SN perform better than or comparable with state-of-
we should decide an appropriate model size that matches the scale the-art baselines on all tasks, as shown in Table 2 through Table 7,
of the training data. which demonstrates the power of P5 for recommendation.

9
RecSys ’22, September 18–23, 2022, Seattle, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

(a) (b) (c) (d)

Figure 6: Performance of P5-S and P5-I on Beauty showing the influence of how to implement personalization.

Table 9: Performance on zero-shot domain transfer. While the former utilizes collaborative learning to implicitly opti-
Z-1 & Z-4 Z-2 & Z-3 Z-5 & Z-7 (%) Z-6 (%) mize the latent correlations among different sub-word tokens, the
Directions latter learns a unique personalized representation for every extra to-
Accuracy MAE BLUE2 ROUGE1 BLUE2 ROUGE1
ken. The performance comparison between P5-S and P5-I is shown
Toys -> Beauty 0.7922 0.8244 0.1940 3.5441 0.8623 13.8487 in Figure 6. We can see that P5-I achieves similar performances as
Toys -> Sports 0.8682 0.6644 0.1203 3.7684 0.3972 13.7654 P5-S on regression tasks (Prompts 1-6 & 1-10 for rating prediction,
Beauty -> Toys 0.8073 0.7792 0.0309 1.4904 2.7606 15.1632 Prompts 4-2 & 4-4 for review-based rating regression) and review
Beauty -> Sports 0.8676 0.6838 0.0264 1.7033 1.6530 13.9460 summarization tasks (Prompt 4-1). Also, P5-I is slightly better than
Sports -> Toys 0.8230 0.7443 0.0060 1.7313 4.2334 17.1248 P5-S on explanation generation tasks (Prompts 3-3, 3-9 & 3-12).
Sports -> Beauty 0.8057 0.8102 0.0080 2.0195 3.5059 18.5577 However, P5-I significantly underperforms P5-S by a large margin
on both sequential and direct recommendation tasks (all prompts
in Figure 6 (c) & (d)). The reason behind P5-I’s lower performance
5.7 Ablation on Prompt Scaling (RQ3) lies in that the newly introduced huge number of extra tokens and
embeddings cannot be well trained compared with the original
As mentioned in implementation details, our default pretrain–predict
sub-word units initialized from T5. This shows that our default set-
task combination follows the leave-one-out strategy. However, do
ting can achieve better recommendation and overall performances
we need so many prompts during pretraining to enable P5’s zero-
with the help of collaborative learning while keeping a small and
shot generalization ability? In this section, we explore to reduce the
constant amount of learnable tokens.
number of pretraining prompts and then make comparisons with
the P5 model pretrained under default setup. To this end, we choose 6 CONCLUSIONS AND FUTURE WORK
a collection of pretraining prompts that has the minimum number In this paper, we present P5 which unifies different recommendation
of prompts to cover all important personalized fields. Specifically, tasks into a shared language modeling and natural language genera-
this combination contains the following 18 personalized prompts: tion framework. By designing a collection of personalized prompts
{1-5, 1-6, 1-8, 1-9, 2-1, 2-3, 2-8, 2-11, 3-2, 3-3, 3-6, 3-9, 4-1, 4-2, 4-3, 5-2, covering five recommendation task families, we transfer all raw data
5-5, 5-7}. Similar to the default pretrain–predict combination, the such as the user-item interactions, user descriptions, item metadata,
last prompt in each task family is for zero-shot evaluation. We name and user reviews to the same format – input-target text pairs. We
this prompt scaling variant of P5-small as P5-PS and then pretrain then pretrain P5 in a full language environment to help it discover
P5-PS on Beauty dataset. The performance comparison between deeper semantics for various recommendation tasks. According
P5-S and P5-PS is also presented in Figure 5. From the figure, we to our experiments, P5 can beat or achieve similar performance
can observe that P5-S beats P5-PS on most tasks except for some with several representative approaches on all five task families.
generation tasks (i.e., Prompts 3-3, 3-9 & 4-1). Interestingly, P5-S Moreover, P5 shows the generalization ability on performing zero-
outperforms P5-PS on Prompt 3-12 – a zero-shot explanation gener- shot transfer to new items, new domains, and new personalized
ation task. In fact, P5-S also shows its superiority on other zero-shot prompts. In the future, we will continue exploring to further enlarge
tasks such as Prompts 1-10, 2-13, and 5-8. Overall, we can find that the model size of P5 and employ more powerful base models such
larger number of high quality personalized prompts can generally as GPT-3, OPT, and BLOOM. Besides, P5 is a very flexible paradigm
help P5 achieve better performances on various recommendation and it is promising to further extend P5 to diverse modalities and
tasks especially zero-shot tasks with unseen prompts. more tasks such as conversational recommendation, comparative
recommendation, cross-platform recommendation, or even various
5.8 How to Implement Personalization (RQ4) search tasks by incorporating user queries into P5. Finally, in this
In this section, we discuss different strategies to implement per- work, we designed explicit prompts since they are intuitive, flex-
sonalization in P5. The default practice is using SentencePiece to- ible, and close to the natural way of how humans communicate
kenizer to split personalized fields into multiple sub-word units with each other, which enables instruction-based recommendation,
and meanwhile using whole-word embedding to preserve the field while in the future, we will also investigate prompt search and/or la-
information (Figure 3). A straightforward alternative is creating tent prompt techniques to achieve instruction prompts or leverage
an independent extra token for each user and item. Here we name retrieval-enhanced generation to further boost P5’s performance
this P5-small variant as P5-I and also pretrain it on Beauty dataset. on downstream tasks.
10
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Seattle, WA, USA

ACKNOWLEDGMENT [20] Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine
Atallah, Ralf Herbrich, Stuart Bowers, et al. 2014. Practical lessons from predicting
This work was supported in part by NSF IIS 1910154, 2007907, and clicks on ads at facebook. In Proceedings of the Eighth International Workshop on
2046457. Any opinions, findings, conclusions or recommendations Data Mining for Online Advertising. 1–9.
[21] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.
expressed in this material are those of the authors and do not 2016. Session-based Recommendations with Recurrent Neural Networks. In
necessarily reflect those of the sponsors. ICLR.
[22] Dietmar Jannach, Ahtsham Manzoor, Wanling Cai, and Li Chen. 2021. A survey
on conversational recommender systems. ACM Computing Surveys (CSUR) 54, 5
REFERENCES (2021), 1–36.
[1] Vamsi Aribandi, Yi Tay, Tal Schuster, Jinfeng Rao, Huaixiu Steven Zheng, San- [23] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we
ket Vaibhav Mehta, Honglei Zhuang, Vinh Q. Tran, Dara Bahri, Jianmo Ni, Jai know what language models know? Transactions of the Association for Computa-
Gupta, Kai Hui, Sebastian Ruder, and Donald Metzler. 2022. ExT5: Towards tional Linguistics 8 (2020), 423–438.
Extreme Multi-Task Scaling for Transfer Learning. In International Conference on [24] Wang-Cheng Kang and Julian McAuley. 2018. Self-attentive sequential recom-
Learning Representations. mendation. In 2018 IEEE International Conference on Data Mining (ICDM). IEEE,
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, 197–206.
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda [25] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-
Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, niques for recommender systems. Computer 42, 8 (2009), 30–37.
Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris [26] Xuan Nhat Lam, Thuc Vu, Trong Duc Le, and Anh Duc Duong. 2008. Addressing
Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack cold-start problem in recommendation systems. In Proceedings of the 2nd inter-
Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and national conference on Ubiquitous information management and communication.
Dario Amodei. 2020. Language Models are Few-Shot Learners. In NeurIPS. 208–211.
[3] Hanxiong Chen, Shaoyun Shi, Yunqi Li, and Yongfeng Zhang. 2021. Neural [27] Hoyeop Lee, Jinbae Im, Seongwon Jang, Hyunsouk Cho, and Sehee Chung. 2019.
collaborative reasoning. In Proceedings of the Web Conference 2021. 1516–1527. Melu: Meta-learned user preference estimator for cold-start recommendation.
[4] Xu Chen, Hanxiong Chen, Hongteng Xu, Yongfeng Zhang, Yixin Cao, Zheng In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge
Qin, and Hongyuan Zha. 2019. Personalized fashion recommendation with Discovery & Data Mining. 1073–1082.
visual explanations based on multimodal attention network: Towards visually [28] Brian Lester, Rami Al-Rfou, and Noah Constant. 2021. The power of scale for
explainable recommendation. In Proceedings of the 42nd International ACM SIGIR parameter-efficient prompt tuning. In EMNLP.
Conference on Research and Development in Information Retrieval. 765–774. [29] Jingjing Li, Mengmeng Jing, Ke Lu, Lei Zhu, Yang Yang, and Zi Huang. 2019.
[5] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, From zero-shot learning to cold-start recommendation. In Proceedings of the
Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. AAAI conference on artificial intelligence, Vol. 33. 4189–4196.
2016. Wide & deep learning for recommender systems. In Proceedings of the 1st [30] Lei Li, Yongfeng Zhang, and Li Chen. 2020. Generate neural template explanations
workshop on deep learning for recommender systems. 7–10. for recommendation. In Proceedings of the 29th ACM International Conference on
[6] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying Vision-and- Information & Knowledge Management. 755–764.
Language Tasks via Text Generation. In ICML. [31] Lei Li, Yongfeng Zhang, and Li Chen. 2021. Personalized Transformer for Ex-
[7] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, plainable Recommendation. In Proceedings of the 59th Annual Meeting of the
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning Phrase Association for Computational Linguistics and the 11th International Joint Confer-
Representations using RNN Encoder–Decoder for Statistical Machine Translation. ence on Natural Language Processing (Volume 1: Long Papers). 4947–4957.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language [32] Piji Li, Zihao Wang, Zhaochun Ren, Lidong Bing, and Wai Lam. 2017. Neural
Processing (EMNLP). 1724–1734. rating regression with abstractive tips generation for recommendation. In Proceed-
[8] Konstantina Christakopoulou, Filip Radlinski, and Katja Hofmann. 2016. Towards ings of the 40th International ACM SIGIR conference on Research and Development
conversational recommender systems. In Proceedings of the 22nd ACM SIGKDD in Information Retrieval. 345–354.
international conference on knowledge discovery and data mining. 815–824. [33] Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous
[9] Gabriel de Souza Pereira Moreira, Sara Rabhi, Jeong Min Lee, Ronay Ak, and prompts for generation. In ACL.
Even Oldridge. 2021. Transformers4Rec: Bridging the Gap between NLP and [34] Yunqi Li, Hanxiong Chen, Juntao Tan, and Yongfeng Zhang. 2022. Causal factor-
Sequential/Session-Based Recommendation. In Fifteenth ACM Conference on ization machine for robust recommendation. In Proceedings of the 22nd ACM/IEEE
Recommender Systems. 143–153. Joint Conference on Digital Libraries. 1–9.
[10] Li Dong, Shaohan Huang, Furu Wei, Mirella Lapata, Ming Zhou, and Ke Xu. 2017. [35] Greg Linden, Brent Smith, and Jeremy York. 2003. Amazon. com recommenda-
Learning to generate product reviews from attributes. In EACL. tions: Item-to-item collaborative filtering. IEEE Internet computing 7, 1 (2003),
[11] Avia Efrat and Omer Levy. 2020. The Turking Test: Can Language Models 76–80.
Understand Instructions? arXiv preprint arXiv:2010.11982 (2020). [36] Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and
[12] Zuohui Fu, Yikun Xian, Shijie Geng, Gerard De Melo, and Yongfeng Zhang. Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3? arXiv
2021. Popcorn: Human-in-the-loop Popularity Debiasing in Conversational preprint arXiv:2101.06804 (2021).
Recommender Systems. In Proceedings of the 30th ACM International Conference [37] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Gra-
on Information & Knowledge Management. 494–503. ham Neubig. 2021. Pre-train, prompt, and predict: A systematic survey of prompt-
[13] Zuohui Fu, Yikun Xian, Yongfeng Zhang, and Yi Zhang. 2020. Tutorial on conver- ing methods in natural language processing. arXiv preprint arXiv:2107.13586
sational recommendation systems. In fourteenth ACM conference on recommender (2021).
systems. 751–753. [38] Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and
[14] Zuohui Fu, Yikun Xian, Yaxin Zhu, Shuyuan Xu, Zelong Li, Gerard De Melo, Jie Tang. 2021. GPT Understands, Too. arXiv preprint arXiv:2103.10385 (2021).
and Yongfeng Zhang. 2021. HOOPS: Human-in-the-Loop Graph Reasoning [39] Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization.
for Conversational Recommendation. In Proceedings of the 44th International In International Conference on Learning Representations.
ACM SIGIR Conference on Research and Development in Information Retrieval. [40] Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp.
2415–2421. 2022. Fantastically ordered prompts and where to find them: Overcoming few-
[15] Zeno Gantner, Lucas Drumond, Christoph Freudenthaler, Steffen Rendle, and shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the
Lars Schmidt-Thieme. 2010. Learning attribute-to-feature mappings for cold-start Association for Computational Linguistics.
recommendations. In 2010 IEEE International Conference on Data Mining. IEEE, [41] Chen Ma, Peng Kang, and Xue Liu. 2019. Hierarchical gating networks for
176–185. sequential recommendation. In Proceedings of the 25th ACM SIGKDD international
[16] Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language conference on knowledge discovery & data mining. 825–833.
models better few-shot learners. In ACL-IJCNLP. [42] Tong Man, Huawei Shen, Xiaolong Jin, and Xueqi Cheng. 2017. Cross-domain
[17] Yingqiang Ge, Shuchang Liu, Zuohui Fu, Juntao Tan, Zelong Li, Shuyuan Xu, recommendation: An embedding and mapping approach.. In IJCAI, Vol. 17. 2464–
Yunqi Li, Yikun Xian, and Yongfeng Zhang. 2022. A survey on trustworthy 2470.
recommender systems. arXiv:2207.12515 (2022). [43] Kelong Mao, Jieming Zhu, Jinpeng Wang, Quanyu Dai, Zhenhua Dong, Xi Xiao,
[18] Yuxian Gu, Xu Han, Zhiyuan Liu, and Minlie Huang. 2021. PPT: Pre-trained and Xiuqiang He. 2021. SimpleX: A Simple and Strong Baseline for Collaborative
Prompt Tuning for Few-shot Learning. arXiv preprint arXiv:2109.04332 (2021). Filtering. In Proceedings of the 30th ACM International Conference on Information
[19] Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. & Knowledge Management. 1243–1252.
DeepFM: a factorization-machine based neural network for CTR prediction. In [44] Michael J Pazzani and Daniel Billsus. 2007. Content-based recommendation
Proceedings of the 26th International Joint Conference on Artificial Intelligence. systems. In The adaptive web. Springer, 325–341.
1725–1731.
11
RecSys ’22, September 18–23, 2022, Seattle, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

[45] Guanghui Qin and Jason Eisner. 2021. Learning How to Ask: Querying LMs with [66] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin
Mixtures of Soft Prompts. In Proceedings of the 2021 Conference of the North Amer- Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. 2022. Unifying Architec-
ican Chapter of the Association for Computational Linguistics: Human Language tures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning
Technologies. 5203–5212. Framework. arXiv preprint arXiv:2202.03052 (2022).
[46] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, [67] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian
et al. 2019. Language models are unsupervised multitask learners. OpenAI blog Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models
(2019). are Zero-Shot Learners. In International Conference on Learning Representations.
[47] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, [68] Orion Weller, Nicholas Lourie, Matt Gardner, and Matthew E Peters. 2020. Learn-
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits ing from Task Descriptions. In Proceedings of the 2020 Conference on Empirical
of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Methods in Natural Language Processing (EMNLP). 1361–1375.
Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html [69] Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang, and Xing Xie. 2019.
[48] Steffen Rendle. 2010. Factorization machines. In 2010 IEEE International conference Neural news recommendation with multi-head self-attention. In Proceedings of the
on data mining. IEEE, 995–1000. 2019 Conference on Empirical Methods in Natural Language Processing and the 9th
[49] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
2009. BPR: Bayesian Personalized Ranking from Implicit Feedback. In Proceedings 6389–6394.
of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (Montreal, [70] Yikun Xian, Zuohui Fu, Shan Muthukrishnan, Gerard De Melo, and Yongfeng
Quebec, Canada) (UAI ’09). AUAI Press, Arlington, Virginia, USA, 452–461. Zhang. 2019. Reinforcement knowledge graph reasoning for explainable rec-
[50] Paul Resnick, Neophytos Iacovou, Mitesh Suchak, Peter Bergstrom, and John ommendation. In Proceedings of the 42nd international ACM SIGIR conference on
Riedl. 1994. Grouplens: An open architecture for collaborative filtering of netnews. research and development in information retrieval. 285–294.
In Proceedings of the 1994 ACM conference on Computer supported cooperative [71] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Faisal Ahmed, Zicheng
work. 175–186. Liu, Yumao Lu, and Lijuan Wang. 2021. Crossing the Format Boundary of
[51] Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Text and Boxes: Towards Unified Vision-Language Modeling. arXiv preprint
Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful arXiv:2111.12085 (2021).
Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, [72] Fajie Yuan, Guoxiao Zhang, Alexandros Karatzoglou, Joemon Jose, Beibei Kong,
Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonathan Chang, and Yudong Li. 2021. One person, one model, one world: Learning continual user
Mike Tian-Jian Jiang, Han Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong, representation without forgetting. In Proceedings of the 44th International ACM
Harshit Pandey, Rachel Bawden, Thomas Wang, Trishala Neeraj, Jos Rozen, SIGIR Conference on Research and Development in Information Retrieval. 696–705.
Abheesht Sharma, Andrea Santilli, Thibault Fevry, Jason Alan Fries, Ryan Teehan, [73] Tingting Zhang, Pengpeng Zhao, Yanchi Liu, Victor S Sheng, Jiajie Xu, De-
Teven Le Scao, Stella Biderman, Leo Gao, Thomas Wolf, and Alexander M Rush. qing Wang, Guanfeng Liu, and Xiaofang Zhou. 2019. Feature-level Deeper
2022. Multitask Prompted Training Enables Zero-Shot Task Generalization. In Self-Attention Network for Sequential Recommendation.. In IJCAI. 4320–4326.
International Conference on Learning Representations. [74] Yongfeng Zhang, Qingyao Ai, Xu Chen, and W Bruce Croft. 2017. Joint repre-
[52] Badrul Sarwar, George Karypis, Joseph Konstan, and John Riedl. 2001. Item-based sentation learning for top-n recommendation with heterogeneous information
collaborative filtering recommendation algorithms. In Proceedings of the 10th sources. In Proceedings of the 2017 ACM on Conference on Information and Knowl-
international conference on World Wide Web. 285–295. edge Management. 1449–1458.
[53] Andrew I Schein, Alexandrin Popescul, Lyle H Ungar, and David M Pennock. [75] Yongfeng Zhang, Xu Chen, et al. 2020. Explainable recommendation: A survey
2002. Methods and metrics for cold-start recommendations. In Proceedings of the and new perspectives. Foundations and Trends® in Information Retrieval 14, 1
25th annual international ACM SIGIR conference on Research and development in (2020), 1–101.
information retrieval. 253–260. [76] Yongfeng Zhang, Xu Chen, Qingyao Ai, Liu Yang, and W Bruce Croft. 2018.
[54] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Towards conversational search and recommendation: System ask, user respond. In
Translation of Rare Words with Subword Units. In Proceedings of the 54th Annual Proceedings of the 27th acm international conference on information and knowledge
Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). management. 177–186.
1715–1725. [77] Yongfeng Zhang, Guokun Lai, Min Zhang, Yi Zhang, Yiqun Liu, and Shaoping
[55] Shaoyun Shi, Min Zhang, Xinxing Yu, Yongfeng Zhang, Bin Hao, Yiqun Liu, and Ma. 2014. Explicit factor models for explainable recommendation based on
Shaoping Ma. 2019. Adaptive feature sampling for recommendation with missing phrase-level sentiment analysis. In Proceedings of the 37th international ACM
content feature values. In Proceedings of the 28th ACM International Conference SIGIR conference on Research & development in information retrieval. 83–92.
on Information and Knowledge Management. 1451–1460. [78] Yongfeng Zhang, Haochen Zhang, Min Zhang, Yiqun Liu, and Shaoping Ma. 2014.
[56] Kyuyong Shin, Hanock Kwak, Kyung-Min Kim, Minkyu Kim, Young-Jin Park, Jisu Do users rate or review? Boost phrase-level sentiment labeling with review-level
Jeong, and Seungjae Jung. 2021. One4all User Representation for Recommender sentiment classification. In SIGIR.
Systems in E-commerce. arXiv preprint arXiv:2106.00573 (2021). [79] Lei Zheng, Vahid Noroozi, and Philip S Yu. 2017. Joint deep modeling of users
[57] Kyuyong Shin, Hanock Kwak, Kyung-Min Kim, Su Young Kim, and Max Nihlen and items using reviews for recommendation. In Proceedings of the tenth ACM
Ramstrom. 2021. Scaling Law for Recommendation Models: Towards General- international conference on web search and data mining. 425–434.
purpose User Representations. arXiv preprint arXiv:2111.11294 (2021). [80] Kun Zhou, Hui Wang, Wayne Xin Zhao, Yutao Zhu, Sirui Wang, Fuzheng Zhang,
[58] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric Wallace, and Sameer Singh. Zhongyuan Wang, and Ji-Rong Wen. 2020. S3-rec: Self-supervised learning
2020. Autoprompt: Eliciting knowledge from language models with automatically for sequential recommendation with mutual information maximization. In Pro-
generated prompts. In EMNLP. ceedings of the 29th ACM International Conference on Information & Knowledge
[59] Ajit P Singh and Geoffrey J Gordon. 2008. Relational learning via collective matrix Management. 1893–1902.
factorization. In Proceedings of the 14th ACM SIGKDD international conference on [81] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. 2021. Learning
Knowledge discovery and data mining. 650–658. to Prompt for Vision-Language Models. arXiv preprint arXiv:2109.01134 (2021).
[60] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. [82] Feng Zhu, Yan Wang, Chaochao Chen, Jun Zhou, Longfei Li, and Guanfeng Liu.
2019. BERT4Rec: Sequential recommendation with bidirectional encoder rep- 2021. Cross-Domain Recommendation: Challenges, Progress, and Prospects. In
resentations from transformer. In Proceedings of the 28th ACM international IJCAI. https://doi.org/10.24963/ijcai.2021/639
conference on information and knowledge management. 1441–1450.
[61] Yueming Sun and Yi Zhang. 2018. Conversational recommender system. In The
41st international acm sigir conference on research & development in information
retrieval. 235–244.
[62] Juntao Tan, Shuyuan Xu, Yingqiang Ge, Yunqi Li, Xu Chen, and Yongfeng Zhang.
2021. Counterfactual explainable recommendation. In Proceedings of the 30th ACM
International Conference on Information & Knowledge Management. 1784–1793.
[63] Jiaxi Tang and Ke Wang. 2018. Personalized top-n sequential recommenda-
tion via convolutional sequence embedding. In Proceedings of the eleventh ACM
international conference on web search and data mining. 565–573.
[64] Manasi Vartak, Arvind Thiagarajan, Conrado Miranda, Jeshua Bratman, and Hugo
Larochelle. 2017. A meta-learning perspective on cold-start recommendations
for items. Advances in neural information processing systems 30 (2017).
[65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing systems 30 (2017).

12
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Sea�le, WA, USA

APPENDIX Table 11: Performance on sequential recommendation.


In this appendix, we �rst provide additional experimental results Yelp
on Yelp dataset in Section A. Then we collect and report statistics Methods
HR@5 NDCG@5 HR@10 NDCG@10
on the training and inference time of P5 variants in Section B. Since
personalized prompts constitute a very important part of our work, Caser 0.0151 0.0096 0.0253 0.0129
we thus o�er the full list of the personalized prompts for Amazon HGN 0.0186 0.0115 0.0326 0.0159
datasets in Section D and Yelp dataset in Section E, respectively. GRU4Rec 0.0152 0.0099 0.0263 0.0134
BERT4Rec 0.0051 0.0033 0.0090 0.0045
A EXPERIMENTAL RESULTS ON YELP FDSA 0.0158 0.0098 0.0276 0.0136
SASRec 0.0162 0.0100 0.0274 0.0136
DATASET
S3 -Rec 0.0201 0.0123 0.0341 0.0168
For Yelp dataset, we also follow the default pretrain–predict task P5-S (2-3) 0.0568 0.0402 0.0707 0.0447
combination setup. Based on the personalized prompts presented P5-S (2-13) 0.0574 0.0403 0.0703 0.0445
in Section E, we leave Prompt 1-10, Prompt 2-13, Prompt 3-10,
Prompt 4-3, and Prompt 5-8 for zero-shot evaluation and pretrain Table 12: Performance on explanation generation (%).
a P5-S model with the remaining prompts. We adopt the same
baselines as described in Section 5.2 for performance comparison. Yelp
Methods
Again, for each task family, we choose one or more seen prompts BLUE4 ROUGE1 ROUGE2 ROUGEL
as supplement to the aforementioned zero-shot unseen prompts
to perform evaluations. The performances of P5-S and relevant Attn2Seq 0.8031 14.1185 1.9730 10.9220
baseline models are shown in Table 10 to Table 14. As indicated NRT 0.8128 13.9256 1.9635 10.6980
in these experimental results, P5 also shows great capability on PETER 0.5938 12.0065 1.8645 9.4645
Yelp, especially for the sequential recommendation and explanation P5-S (3-2) 1.2717 18.1796 2.9477 13.5465
generation tasks. PETER+ 3.2827 27.2366 8.1941 21.1573
P5-S (3-7) 2.9797 27.1860 6.6827 19.6348
B STATISTICS ON TRAINING & INFERENCE P5-S (3-10) 3.0600 27.2990 6.7655 19.6329
TIME (RQ5)
Table 13: Performance on review preference prediction.
In this section, we provide statistics on the training and inference
time of P5 models, we collect the running time on the Beauty dataset. Yelp
As mentioned in Section 5.1, we trained our P5 models on 4 ⇥ A5000 Methods
RMSE MAE
GPUs. From our records, P5-S spent 6.7 hours to �nish training,
while P5-B took 24.3 hours due to the larger number of parameters. T0 (4-2) 0.5383 0.2756
In terms of inference, we use a single A5000 GPU to conduct evalua- T0 (4-3) 0.5359 0.2732
tions. The average inference time of P5-S and P5-B on di�erent tasks P5-S (4-2) 0.6301 0.3113
are presented in Table 15. Among all tasks, sequential and direct P5-S (4-3) 0.6350 0.3150
recommendation tasks require much longer inference time than
other tasks. This can be ascribed to the beam search step described Table 14: Performance on direct recommendation.
in Eq. (2). Since we need to generate a list of recommended items, Yelp
the larger the beam size ⌫ is, the longer the decoding will take. Methods
HR@1 HR@5 NDCG@5 HR@10 NDCG@10
Besides, we can also observe that direct recommendation typically
takes longer than sequential recommendation. The reason is that BPR-MF 0.0813 0.2251 0.1543 0.3312 0.1886
100 candidates are included in the input of direct recommendation BPR-MLP 0.0489 0.1876 0.1184 0.3066 0.1566
SimpleX 0.0569 0.3970 0.2538 0.5473 0.3020
prompts, which usually have more tokens than that of sequential
recommendation prompts. Overall, even though the pretraining P5-S (5-1) 0.0603 0.2097 0.1354 0.3197 0.1708
of P5 takes hours to �nish, but the inference is very fast. It is also P5-S (5-4) 0.0567 0.2065 0.1320 0.3190 0.1682
promising to further reduce the training and inference time with P5-S (5-5) 0.0696 0.2134 0.1423 0.3178 0.1758
the help of e�cient Transformer techniques. P5-S (5-8) 0.0705 0.2136 0.1432 0.3219 0.1780

Table 15: Average inference time (in milliseconds) of P5 vari-


Table 10: Performance comparison on rating prediction.
ants on di�erent tasks on the Beauty dataset.
Yelp
Methods per user per user–item pair per review
RMSE MAE Models Sequential Direct Rating Explanation Summarization Preference
MF 1.2645 1.0426 (2-13) (5-8) (1-10) (3-12) (4-1) (4-4)
MLP 1.2951 1.0340 P5-S 53.96 57.40 4.93 13.41 8.66 6.25
P5-S (1-6) 1.4868 1.0186 P5-B 78.19 98.80 13.40 34.71 23.51 14.85
P5-S (1-10) 1.4685 1.0054
13
RecSys ’22, September 18–23, 2022, Sea�le, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

Table 16: Performance comparison on direct recommendation under larger sample size.
Sports Beauty Toys
Methods
HR@1 HR@5 NDCG@5 HR@10 NDCG@10 HR@1 HR@5 NDCG@5 HR@10 NDCG@10 HR@1 HR@5 NDCG@5 HR@10 NDCG@10
BPR-MF 0.0314 0.1404 0.0848 0.2563 0.1220 0.0311 0.1426 0.0857 0.2573 0.1224 0.0233 0.1066 0.0641 0.2003 0.0940
BPR-MLP 0.0351 0.1520 0.0927 0.2671 0.1296 0.0317 0.1392 0.0848 0.2542 0.1215 0.0252 0.1142 0.0688 0.2077 0.0988
SimpleX 0.0331 0.2362 0.1505 0.3290 0.1800 0.0325 0.2247 0.1441 0.3090 0.1711 0.0268 0.1958 0.1244 0.2662 0.1469
P5-S (5-1) 0.0638 0.2096 0.1375 0.3143 0.1711 0.0600 0.2021 0.1316 0.3121 0.1670 0.0405 0.1538 0.0969 0.2405 0.1248
P5-B (5-1) 0.0245 0.0816 0.0529 0.1384 0.0711 0.0224 0.0904 0.0559 0.1593 0.0780 0.0187 0.0827 0.0500 0.1543 0.0729
P5-S (5-4) 0.0701 0.2241 0.1483 0.3313 0.1827 0.0862 0.2448 0.1673 0.3441 0.1993 0.0413 0.1411 0.0916 0.2227 0.1178
P5-B (5-4) 0.0299 0.1026 0.0665 0.1708 0.0883 0.0506 0.1557 0.1033 0.2350 0.1287 0.0435 0.1316 0.0882 0.2000 0.1102
P5-S (5-5) 0.0766 0.2195 0.1495 0.3187 0.1814 0.0826 0.2378 0.1613 0.3289 0.1906 0.0453 0.1404 0.0933 0.2192 0.1185
P5-B (5-5) 0.0618 0.1812 0.1226 0.2669 0.1501 0.0601 0.1607 0.1113 0.2308 0.1339 0.0438 0.1367 0.0910 0.2106 0.1147
P5-S (5-8) 0.0967 0.2519 0.1763 0.3530 0.2089 0.1020 0.2688 0.1869 0.3665 0.2184 0.0588 0.1793 0.1203 0.2691 0.1491
P5-B (5-8) 0.0631 0.1834 0.1245 0.2643 0.1505 0.0576 0.1572 0.1086 0.2306 0.1321 0.0442 0.1374 0.0914 0.2160 0.1167

C INFLUENCE OF SAMPLE SIZE Input template: Predict the user_{{user_id}} s preference on


item_{{item_id}} ({{item_title}})
When creating the training sentences for direct recommendation -1 \n -2 \n -3 \n -4 \n -5
under the generative prompts (Prompt 5-5 to 5-8), for each user-
item interaction in the training set, we randomly sample 99 negative Target template: {{star_rating}}
items that the user did not interact with. Together with the positive Prompt ID: 1-6
item, a 100-item candidate list is then created. We then randomly Input template: What star rating do you think {{user_desc}} will give
select a prompt from Prompt 5-5 to Prompt 5-8 to create a training item_{{item_id}}? (1 being lowest and 5 being highest)
sentence. For each user-item interaction, we repeat the above pro- Target template: {{star_rating}}
cess for # times and achieve # training sentences. In the default
setting, we set the sample size # to 10. We �nd that increasing Prompt ID: 1-7
the training sample size could bene�t the performance of direct Input template: How will {{user_desc}} rate this product:
{{item_title}}? (1 being lowest and 5 being highest)
recommendation under the generative prompts. We increase the
sample size from the default number 10 to 200 and report the per- Target template: {{star_rating}}
formances on Prompts 5-5 and 5-8 in Table 16. From the table, we Prompt ID: 1-8
can see that a larger data sample size can greatly boost the direct
Input template: Will {{user_desc}} give a {{star_rating}}-star rating
recommendation performances. for {{item_title}}? (1 being lowest and 5 being highest)

Target template: {{answer_choices[label]}} (yes/no)


D FULL LIST OF PERSONALIZED PROMPTS
FOR AMAZON DATASETS Prompt ID: 1-9
Input template: Does {{user_desc}} like or dislike {{item_title}}?
D.1 Task Family 1: Rating Prediction
Target template:
Prompt ID: 1-1 {{answer_choices[label]}} (like/dislike) – like (4,5) / dislike
Input template: Which star rating will user_{{user_id}} give (1,2,3)
item_{{item_id}}? (1 being lowest and 5 being highest) Prompt ID: 1-10
Target template: {{star_rating}} Input template: Predict {{user_desc}} s preference towards
{{item_title}} (1 being lowest and 5 being highest)
Prompt ID: 1-2
Target template: {{star_rating}}
Input template: How will user_{{user_id}} rate this product:
{{item_title}}? (1 being lowest and 5 being highest)
D.2 Task Family 2: Sequential Recommendation
Target template: {{star_rating}}
Prompt ID: 2-1
Prompt ID: 1-3 Input template: Given the following purchase history of
Input template: Will user_{{user_id}} give item_{{item_id}} a user_{{user_id}}:
{{purchase_history}}
{{star_rating}}-star rating? (1 being lowest and 5 being highest)
predict next possible item to be purchased by the user?
Target template: {{answer_choices[label]}} (yes/no)
Target template: {{next_item}}
Prompt ID: 1-4 Prompt ID: 2-2
Input template: Does user_{{user_id}} like or dislike Input template: I find the purchase history list of user_{{user_id}}:
item_{{item_id}}? {{purchase_history}}
I wonder which is the next item to recommend to the user. Can you
Target template: help me decide?
{{answer_choices[label]}} (like/dislike) – like (4,5) / dislike
(1,2,3) Target template: {{next_item}}

Prompt ID: 1-5 Prompt ID: 2-3


14
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Sea�le, WA, USA

Input template: Here is the purchase history list of


user_{{user_id}}: Target template: {{answer_choices[label]}} (yes/no)
{{purchase_history}}
try to recommend next item to the user Prompt ID: 2-13
Input template: According to the purchase history of {{user_desc}}:
Target template: {{next_item}} {{purchase_history}}
Can you recommend the next possible item to the user?
Prompt ID: 2-4
Input template: Given the following purchase history of Target template: {{next_item}}
{{user_desc}}:
{{purchase_history}}
predict next possible item for the user D.3 Task Family 3: Explanation Generation
Target template: {{next_item}}
Prompt ID: 3-1
Input template: Generate an explanation for user_{{user_id}} about
Prompt ID: 2-5 this product: {{item_title}}
Input template: Based on the purchase history of {{user_desc}}:
{{purchase_history}} Target template: {{explanation}}
Can you decide the next item likely to be purchased by the user?
Prompt ID: 3-2
Target template: {{next_item}}
Input template: Given the following review headline
Prompt ID: 2-6 {{review_headline}}
can you help generate an explanation of user_{{user_id}} for
Input template: Here is the purchase history of {{user_desc}}:
item_{{item_id}}?
{{purchase_history}}
What to recommend next for the user?
Target template: {{explanation}}
Target template: {{next_item}} Prompt ID: 3-3
Prompt ID: 2-7 Input template: Help user_{{user_id}} generate a {{star_rating}}-star
Input template: Here is the purchase history of user_{{user_id}}: explanation about this product:
{{purchase_history}} {{item_title}}
Select the next possible item likely to be purchased by the user from
the following candidates: Target template: {{explanation}}
{{candidate_items}}
Prompt ID: 3-4
Target template: {{next_item}} Input template: Generate an explanation for {{user_desc}} about this
Prompt ID: 2-8 product: {{item_title}}

Input template: Given the following purchase history of Target template: {{explanation}}
{{user_desc}}:
{{purchase_history}} Prompt ID: 3-5
What to recommend next for the user? Select one from the following Input template: Based on the following review headline:
items: {{review_headline}}
{{candidate_items}} Generate {{user_desc}} s purchase explanation about {{item_title}}

Target template: {{next_item}} Target template: {{explanation}}


Prompt ID: 2-9 Prompt ID: 3-6
Input template: Based on the purchase history of user_{{user_id}}:
Input template: Help {{user_desc}} generate a {{star_rating}}-star
{{purchase_history}}
Choose the next possible purchased item from the following explanation for item_{{item_id}}
candidates:
Target template: {{explanation}}
{{candidate_items}}
Prompt ID: 3-7
Target template: {{next_item}}
Input template: Predict the star rating, then use {{feature_word}} as
Prompt ID: 2-10 feature word to generate user_{{user_id}} s purchase explanation for
Input template: I find the purchase history list of {{user_desc}}: item_{{item_id}}
{{purchase_history}}
I wonder which is the next item to recommend to the user. Try to Target template: {{star_rating}}, {{explanation}}
select one from the following candidates: Prompt ID: 3-8
{{candidate_items}}
Input template: What score will {{user_desc}} rate item_{{item_id}}?
Target template: {{next_item}} Then give an explanation for the rating score. (1 being lowest and 5
being highest)
Prompt ID: 2-11
Input template: User_{{user_id}} has the following purchase history: Target template: {{star_rating}}, {{explanation}}
{{purchase_history}}
Does the user likely to buy {{candidate_item}} next? Prompt ID: 3-9
Input template: Based on the feature word {{feature_word}}, generate
Target template: {{answer_choices[label]}} (yes/no) an explanation for user_{{user_id}} about this product:
Prompt ID: 2-12 {{item_title}}

Input template: According to {{user_desc}} s purchase history list: Target template: {{explanation}}
{{purchase_history}}
Predict whether the user will purchase {{candidate_item}} next? Prompt ID: 3-10
15
RecSys ’22, September 18–23, 2022, Sea�le, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

Input template: Given the word {{feature_word}}, can you help Input template: I would like to recommend some items for
generate an explanation for {{user_desc}} about the product: \n user_{{user_id}}. Is the following item a good choice?
{{item_title}} {{item_title}}

Target template: {{explanation}} Target template: {{answer_choices[label]}} (yes/no)


Prompt ID: 3-11 Prompt ID: 5-5
Input template: Using the word {{feature_word}}, write a Input template: Which item of the following to recommend for
{{star_rating}}-star explanation for user_{{user_id}} about {{user_desc}}?
item_{{item_id}} {{candidate_items}}

Target template: {{explanation}} Target template: {{target_item}}


Prompt ID: 3-12 Prompt ID: 5-6
Input template: According to the feature word {{feature_word}}, Input template: Choose the best item from the candidates to recommend
generate a {{star_rating}}-star explanation for {{user_desc}} about for {{user_desc}}?
item_{{item_id}} {{candidate_items}}
Target template: {{explanation}} Target template: {{target_item}}
Prompt ID: 5-7
D.4 Task Family 4: Review Related
Input template: Pick the most suitable item from the following list
Prompt ID: 4-1 and recommend to user_{{user_id}}:
Input template: Write a short sentence to summarize the following {{candidate_items}}
product review from user_{{user_id}}:
{{review}} Target template: {{target_item}}
Prompt ID: 5-8
Target template: {{summary}}
Input template: We want to make recommendation for user_{{user_id}}.
Prompt ID: 4-2 Select the best item from these candidates:
Input template: Given the following review written by {{candidate_items}}
user_{{user_id}}:
{{review}} Target template: {{target_item}}
Can you predict the associated star rating (1 being lowest and 5
being highest)? D.6 Task Family Z: Zero-Shot Generalization
Target template: {{star_rating}} Prompt ID: Z-1
Prompt ID: 4-3 Input template: Given the facts about the new product, do you think
user {{user_id}} will like or dislike it? title: {{item_title}}
Input template: Give a short sentence describing the following brand: {{brand}} price: {{price}}
product review from {{user_desc}}:
{{review}} Target template: {{answer_choices[label]}} (like/dislike) – like
(4,5) / dislike (1,2,3)
Target template: {{summary}}
Prompt ID: Z-2
Prompt ID: 4-4 Input template: Here are the details about a new product: title:
Input template: According to the following review written by {{item_title}} brand: {{brand}} price: {{price}} What star will
{{user_desc}}: {{user_desc}} probably rate the product?
{{review}} -1 -2 -3 -4 -5
Predict the associated star rating (1 being lowest and 5 being
highest) Target template: {{star_rating}}

Target template: {{star_rating}}


Prompt ID: Z-3
Input template: Predict user_{{user_id}} s preference about the new
product (1 being lowest and 5 being highest):
D.5 Task Family 5: Direct Recommendation title: {{item_title}} price: {{price}} brand: {{brand}}
Prompt ID: 5-1
Target template: {{star_rating}}
Input template: Will user_{{user_id}} likely to interact with
item_{{item_id}}? Prompt ID: Z-4
Input template: Will {{user_desc}} like or dislike the following
Target template: {{answer_choices[label]}} (yes/no) product? title: {{item_title}} price: {{price}} brand: {{brand}}
Prompt ID: 5-2
Target template:
Input template: Shall we recommend item_{{item_id}} to {{user_desc}}? {{answer_choices[label]}} (like/dislike) – like (4,5) / dislike
(1,2,3)
Target template: {{answer_choices[label]}} (yes/no)
Prompt ID: Z-5
Prompt ID: 5-3 Input template: Generate a possible explanation for {{user_desc}} s
Input template: For {{user_desc}}, do you think it is good to preference about the following product: title: {{item_title}} brand:
recommend {{item_title}}? {{brand}} price: {{price}}

Target template: {{answer_choices[label]}} (yes/no) Target template: {{explanation}}


Prompt ID: 5-4 Prompt ID: Z-6
16
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Sea�le, WA, USA

Input template: Based on the word {{feature_word}}, help Input template: Predict {{user_desc}} s preference towards
user_{{user_id}} write a {{star_rating}}-star explanation for this {{item_title}} (1 being lowest and 5 being highest)
new product: title: {{item_title}} price: {{price}} brand: {{brand}}
Target template: {{star_rating}}
Target template: {{explanation}}
Prompt ID: Z-7 E.2 Task Family 2: Sequential Recommendation
Input template: For the new product {{item_title}}, we would like to Prompt ID: 2-1
know whether {{user_desc}} will love it. If you think the user will
love it, please help explain why. Input template: Given the following visit history of
user_{{user_id}}: {{visit_history}}
Target template: {{explanation}} predict next possible business to be visited by the user?

Target template: {{next_item}}


E FULL LIST OF PERSONALIZED PROMPTS
Prompt ID: 2-2
FOR YELP DATASET Input template: I find the visit history list of user_{{user_id}}:
E.1 Task Family 1: Rating Prediction {{visit_history}}
I wonder which is the next item to recommend to the user. Can you
Prompt ID: 1-1 help me decide?
Input template: Which star rating will user_{{user_id}} give
item_{{item_id}}? (1 being lowest and 5 being highest) Target template: {{next_item}}
Prompt ID: 2-3
Target template: {{star_rating}}
Input template: Here is the visit history list of user_{{user_id}}:
Prompt ID: 1-2 {{visit_history}}
Input template: How will user_{{user_id}} rate this business: try to recommend next item to the user
{{item_title}}? (1 being lowest and 5 being highest)
Target template: {{next_item}}
Target template: {{star_rating}} Prompt ID: 2-4
Prompt ID: 1-3 Input template: Given the following visit history of {{user_desc}}:
{{visit_history}}
Input template: Will user_{{user_id}} give item_{{item_id}} a predict next possible business for the user
{{star_rating}}-star rating? (1 being lowest and 5 being highest)
Target template: {{next_item}}
Target template: {{answer_choices[label]}} (yes/no)
Prompt ID: 2-5
Prompt ID: 1-4
Input template: Based on the visit history of {{user_desc}}:
Input template: Does user_{{user_id}} like or dislike {{visit_history}}
item_{{item_id}}? Can you decide the next business likely to be visited by the user?

Target template: {{answer_choices[label]}} (like/dislike) – like Target template: {{next_item}}


(4,5) / dislike (1,2,3)
Prompt ID: 2-6
Prompt ID: 1-5 Input template: Here is the visit history of {{user_desc}}:
Input template: Predict the user_{{user_id}} s preference on {{visit_history}}
item_{{item_id}} ({{item_title}}) What to recommend next for the user?
-1 \n -2 \n -3 \n -4 \n -5
Target template: {{next_item}}
Target template: {{star_rating}}
Prompt ID: 2-7
Prompt ID: 1-6 Input template: Here is the visit history of user_{{user_id}}:
Input template: What star rating do you think {{user_desc}} will give {{visit_history}}
item_{{item_id}}? (1 being lowest and 5 being highest) Select the next possible business likely to be visited by the user
from the following candidates:
Target template: {{star_rating}} {{candidate_items}}

Prompt ID: 1-7 Target template: {{next_item}}


Input template: How will {{user_desc}} rate this business: Prompt ID: 2-8
{{item_title}}? (1 being lowest and 5 being highest)
Input template: Given the following visit history of {{user_desc}}:
Target template: {{star_rating}} {{visit_history}}
What to recommend next for the user? Select one from the following
Prompt ID: 1-8 items:
Input template: Will {{user_desc}} give a {{star_rating}}-star rating {{candidate_items}}
for {{item_title}}? (1 being lowest and 5 being highest)
Target template: {{next_item}}
Target template: {{answer_choices[label]}} (yes/no) Prompt ID: 2-9
Prompt ID: 1-9 Input template: Based on the visit history of user_{{user_id}}:
{{visit_history}}
Input template: Does {{user_desc}} like or dislike {{item_title}}?
Choose the next possible visited business from the following
Target template: {{answer_choices[label]}} (like/dislike) – like candidates:
{{candidate_items}}
(4,5) / dislike (1,2,3)
Prompt ID: 1-10 Target template: {{next_item}}
17
RecSys ’22, September 18–23, 2022, Sea�le, WA, USA Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, Yongfeng Zhang

Prompt ID: 2-10 Prompt ID: 3-8


Input template: I find the visit history list of {{user_desc}}: Input template: Given the word {{feature_word}}, can you help
{{visit_history}} generate an explanation for {{user_desc}} about the business: \n
I wonder which is the next item to recommend to the user. Try to {{item_title}}
select one from the following candidates:
{{candidate_items}} Target template: {{explanation}}

Target template: {{next_item}} Prompt ID: 3-9


Input template: Using the word {{feature_word}}, write a
Prompt ID: 2-11 {{star_rating}}-star explanation for user_{{user_id}} about
Input template: User_{{user_id}} has the following visit history: item_{{item_id}}
{{visit_history}}
Does the user likely to visit {{candidate_item}} next? Target template: {{explanation}}

Target template: {{answer_choices[label]}} (yes/no) Prompt ID: 3-10


Input template: According to the feature word {{feature_word}},
Prompt ID: 2-12 generate a {{star_rating}}-star explanation for {{user_desc}} about
Input template: According to {{user_desc}} s visit history list: item_{{item_id}}
{{visit_history}}
Predict whether the user will visit {{candidate_item}} next? Target template: {{explanation}}

Target template: {{answer_choices[label]}} (yes/no)


E.4 Task Family 4: Review Related
Prompt ID: 2-13
Prompt ID: 4-1
Input template: According to the visit history of {{user_desc}}:
{{visit_history}} Input template: Predict the associated rating score of the review
Can you recommend the next possible business to the user? written by user_{{user_id}} (1 being lowest and 5 being highest):
{{review}}
Target template: {{next_item}}
Target template: {{star_rating}}

E.3 Task Family 3: Explanation Generation Prompt ID: 4-2


Input template: Given the following review written by
Prompt ID: 3-1
user_{{user_id}}:
Input template: Generate an explanation for user_{{user_id}} about {{review}}
this business: {{item_title}} Can you predict the associated star rating (1 being lowest and 5
being highest)?
Target template: {{explanation}}
Prompt ID: 3-2 Target template: {{star_rating}}
Input template: Help user_{{user_id}} generate a {{star_rating}}-star Prompt ID: 4-3
explanation about this business: {{item_title}} Input template: According to the following review written by
{{user_desc}}:
Target template: {{explanation}} {{review}}
Prompt ID: 3-3 Predict the associated star rating (1 being lowest and 5 being
highest)
Input template: Generate an explanation for {{user_desc}} about this
business: {{item_title}} Target template: {{star_rating}}
Target template: {{explanation}}
Prompt ID: 3-4
E.5 Task Family 5: Direct Recommendation
Input template: Help {{user_desc}} generate a {{star_rating}}-star Prompt ID: 5-1
explanation for item_{{item_id}} Input template: Will user_{{user_id}} likely to interact with
item_{{item_id}}?
Target template: {{explanation}}
Target template: {{answer_choices[label]}} (yes/no)
Prompt ID: 3-5
Input template: Predict the star rating, then use {{feature_word}} as Prompt ID: 5-2
feature word to generate user_{{user_id}} s visit explanation for Input template: Shall we recommend item_{{item_id}} to {{user_desc}}?
item_{{item_id}}
Target template: {{answer_choices[label]}} (yes/no)
Target template: {{star_rating}}, {{explanation}}
Prompt ID: 5-3
Prompt ID: 3-6 Input template: For {{user_desc}}, do you think it is good to
Input template: What score will {{user_desc}} rate item_{{item_id}}? recommend {{item_title}}?
Then give an explanation for the rating score. (1 being lowest and 5
being highest) Target template: {{answer_choices[label]}} (yes/no)

Target template: {{star_rating}}, {{explanation}} Prompt ID: 5-4


Prompt ID: 3-7 Input template: I would like to recommend some items for
user_{{user_id}}. Is the following item a good choice?
Input template: Based on the feature word {{feature_word}}, generate {{item_title}}
an explanation for user_{{user_id}} about this business:
{{item_title}} Target template: {{answer_choices[label]}} (yes/no)
Target template: {{explanation}} Prompt ID: 5-5
18
Recommendation as Language Processing: P5 RecSys ’22, September 18–23, 2022, Sea�le, WA, USA

Input template: Which item of the following to recommend for Prompt ID: 5-7
{{user_desc}}? Input template: Pick the most suitable item from the following list
{{candidate_items}} and recommend to user_{{user_id}}:
{{candidate_items}}
Target template: {{target_item}}
Prompt ID: 5-6 Target template: {{target_item}}
Input template: Choose the best item from the candidates to recommend Prompt ID: 5-8
for {{user_desc}}? Input template: We want to make recommendation for user_{{user_id}}.
{{candidate_items}} Select the best item from these candidates:
{{candidate_items}}
Target template: {{target_item}}
Target template: {{target_item}}

19

You might also like