22 - State Graph Reasoning For Multimodal Conversational Recommendation
22 - State Graph Reasoning For Multimodal Conversational Recommendation
Abstract—Conversational recommendation system (CRS) at- predict users’ preferences towards items by analyzing their
tracts increasing attention in various application domains such past behaviors such as click history, visit log, and ratings on
as retail and travel. It offers an effective way to capture users’ items, etc, which are widely applied in many domains such as
dynamic preferences with multi-turn conversations. However,
most current studies center on the recommendation aspect While with the help of multi-turn conversations, the system
while over-simplifying the conversation process. The negligence can further capture the detailed and dynamic preferences of
of complexity in data structure and conversation flow hinders users, which may lead to better recommendation results and
their practicality and utility. In reality, there exist various user experience [28].
relationships among slots and values, while users’ requirements There have been many efforts centered on integrating con-
may dynamically adjust or change. Moreover, the conversation
often involves visual modality to facilitate the conversation. These versation modelling into recommendation systems. From a
actually call for a more advanced internal state representation of broader perspective, pioneering works emerge from tag-based
the dialogue and a proper reasoning scheme to guide the decision interaction between user and systems where the interaction is
making process. mainly realized by tags [5]. To further improve the conve-
In this paper, we explore multiple facets of multimodal conver- nience of developed systems, more efforts focus on multi-turn
sational recommendation and try to address the above mentioned
challenges. In particular, we represent the structured back-end conversations with natural language as both system’s input and
database as a multimodal knowledge graph which captures the output [16, 32, 41]. Generally speaking, existing methods have
various relations and evidence in different modalities. The user emphasized on three broad directions: a) Since the key advan-
preferences expressed via conversation utterances will then be tage of CRS is being able to ask questions, a line of studies
gradually updated to the state graph with clear polarity. Based work on learning to ask appropriate attributes/topics/categories
on these, we train an end-to-end State Graph-based Reasoning
model (SGR) to perform reasoning over the whole state graph. of items to narrow down the candidate items [5, 38, 39, 42];
The prediction of our proposed model benefits from the structure b) Another line of efforts target at learning better strategy for
of the graph. It not only allows for zero-shot reasoning for items making successful recommendations with less turns of inter-
unseen in training conversations, but also provides a natural actions [14, 15, 32]; c) There are also works that further delve
way to explain the policies. Extensive experiments show that into in-depth dialogue understanding and response generation
our model achieves better performance compared with existing
methods. [4, 16, 21, 37, 41].
However, there exist several shortages of these current meth-
Index Terms—Recommendation systems, conversation, knowl-
ods. First, most of these methods directly generate responses
edge graph
via action prediction and entity linking, while ignoring the
various relationships among slot values and its relation to
I. I NTRODUCTION the explicit dialogue state representations [14, 16], shown as
ONVERSATIONAL recommendation system (CRS) has
C become an emerging research topic in information seek-
ing. It integrates the strength of recommendation systems and
Fig. 1(a). This would lead to less informative representation
of user preference and thus harm the recommendation perfor-
mance. For example, when one user wants to find a cheap
conversation techniques. In general, recommendation systems item, it indicates that the user has negative preference on the
other price values. Second, the modelling of dynamic user
This work was supported in part by the scholarship from China Scholarship
Council (CSC) under Grant 202006280325; in part by the NSFC, China under preference change is relatively weak. Although some efforts
Grants 61902309, 61701391, and 61772407; in part by ShaanXi Province try to achieve this with graph based methods [7, 15, 23, 37]
under Grant 2018JM6092; in part by the Fundamental Research Funds for as shown in Fig. 1(b), they fail to represent the dynamic
the Central Universities, China (xxj022019003); in part by China Postdoctoral
Science Foundation (2020M683496); in part by the National Postdoctoral change in an effective way. For example, the model in [15]
Innovative Talents Support Program, China (BX20190273); and in part by the simply deleted the mentioned attributes or items negated in
Science and Technology Program of Xi’an, China under Grant 21RGZN0017. historical turns without updating any user or item representa-
Yuxia Wu, Guoshuai Zhao and Xueming Qian are with the Ministry
of Education Key Laboratory for Intelligent Networks and Network Se- tions. Meanwhile, the model in [37] only considered the most
curity, Xi’an Jiaotong University (e-mail: [email protected], gu- updated user requirements for each slot without remembering
[email protected] [email protected]) any former denied ones.[7] Last but not the least, most of the
Lizi Liao is with the Singapore Management University (e-mail:
[email protected]). existing works focus on textual conversations. However, there
Gangyi Zhang is with the University of Science and Technology of China is a growing demand for multimodal conversations to facilitate
(e-mail: [email protected]) recommendation in domains like e-commerce retail and travel.
Wenqiang Lei and Tat-Seng Chua are with the National University of
Singapore (e-mail: [email protected], [email protected]) Although there are some initial works [19, 25, 28], the over-
? Lizi Liao and Xueming Qian are the corresponding authors. simplified usage of image information hinders the image
2
(a) Corpus-based methods [14, 16] (b) Graph-based methods [7, 15, 23, 37] (c) Our method
modality’s contribution in conversational recommendation. between these lines of research and emphasize the research gap
To address the above-mentioned issues, we focus on a new targeted by this work.
scheme as shown in Fig. 1(c). We explore the complex rela-
tionship among slots and values in the back-end item database A. Conversational Recommendation
and represent the structured data into a multimodal knowledge Conversational recommendation aims at providing inter-
graph. Then the graph is updated with user preferences rep- active recommendations through dialogues. Compared with
resented in the dialogue states with the conversation goes on. traditional static recommender systems, it has the advantage
Based on which, the actions to take, such as recommendation of capturing users’ dynamic preferences from the multi-turn
or further inquiry, can be generated via performing reasoning utterances [13].
on the state graph. Specifically, the state graph is initiated as a The existing works on conversational recommender systems
signed graph containing positive and negative links to model fall into three broad categories. One line of efforts was
the complicated relationships among items, slots and values, largely question driven. They focused on learning to ask
as well as the rich modalities of information in the back-end attributes/topics/categories of items to reduce the search space
database. Then we gradually update the state graph to capture [5, 38, 39, 42]. For example, Multi-Memory Network (MMN)
the dynamic user requirements based on the user intention [39] was a unified model integrating query/item representation
harvested from the conversation history. The state graph is learning and conversational search/recommendation. It learned
updated in an explicit way by adding, deleting or changing user preference by asking questions. However, it did not
the links between the user and and other nodes based on contain any special policy network to decide when to ask or
the dynamic user preference. Basically, the state graph keeps recommend. Another line of studies targeted at better strategy
track of the conversation progress and serves as a base for for making successful recommendations in less turns. For
performing reasoning about entity ranking of the nodes. We instance, Conversational Recommender Model (CRM) [32]
then train an end-to-end graph reasoning model SGR to infer applied reinforcement learning to decide when to ask or rec-
reasoning which explicitly differentiates between positive and ommend items based on users’ current preference learned by a
negative user preferences. Since the prediction results inherit belief tracker. However, they would only recommend one time,
the graph structure, it caters for cold start venues and provides which would fail if the user gave negative feedback on rec-
natural explanations. ommendations. To solve these problems, Estimation–Action–
We summarize our contributions as follows: Reflection (EAR) [14] was proposed to learn user preferences
• We explicitly model users’ dynamic preferences and and rank the items and attributes. A policy network was
integrate it with a multimodal knowledge graph for better applied to determine whether to ask attributes further or
state representation. The update of the graph reflects the recommend items. The model would also be updated when
real change of users’ preferences based on both textual the user rejects the recommendations. Later on, the authors
and visual modality. further devised an interactive path reasoning mechanism to
• We design a state graph reasoning model to capture the help ranking of item and attributes [15]. Beyond emphasizing
various evidence in the multimodal knowledge graph, and on the recommendation part, the third line of efforts delved
generate more accurate agent behavior predictions. into in-depth dialogue understanding and response generation
• Extensive experiments demonstrate the effectiveness of [4, 16, 21, 37, 41]. For example, knowledge graph was in-
our proposed method. Qualitative results also show that troduced in Knowledge-Based Recommender Dialog (KBRD)
the proposed method not only handles zero-shot situations [4] to bridge the recommender system and the dialogue system
well but also offers good explainability. in an end-to-end manners. The model linked the entities in
dialogue history to the external knowledge graph to enhance
II. R ELATED W ORK the representation of users’ preferences. Then the dialogue
system can generate responses that are consistent with users’
Our work is closely related to three lines of researches: con- interest. Similarly, the model in [40] incorporated both word-
versational recommendation, graph reasoning and multimodal based and entity-based Knowledge Graph (KG) to enhance the
knowledge graph. Here, we will briefly discuss the connections semantic representations in CRS. However, these models lack
3
dialogue state management. In our work, we incorporate the C. Multimodal Knowledge Graph
knowledge graph into our internal dialogue state representation As we incorporate multimodal information into knowledge
and perform reasoning on it to yield better results. graph to perform reasoning, our work is also closely related
to the works on Multimodal Knowledge Graph (MMKG).
MMKG integrates multimodal data (such as images and texts)
into the knowledge graph and treats the image or text as an
B. Graph Reasoning
entity or an attribute of the entity [20, 33]. In general, MMKG
With knowledge graph structures, there are a lot of efforts representation learning can be divided into two categories:
trying to do reasoning over the graph to enhance conver- feature-based methods [24, 36] and entity-based methods [26].
sational recommendation performance. Actually, graph rea- The former kind treated the visual information as the features
soning has been successfully applied to many tasks such as of entities. They modify TransE model [2] to integrate the
social network analysis, question answering, recommendation, visual features of entities. However, this kind of methods
and so on. For conversational recommendation, Open-ended requires each entity to provide visual information, which was
Dialog KG (OpenDialKG) [23] learns the optimal path of not suitable for many tasks. The later one [26] constructs
dialogue context within a large common-sense KG for open- the MMKG by adding extra relations on the original KG,
ended dialogue system. The system would ask questions about such as hasImage, hasDescription. The multimodal infor-
the attributes of items and also chat with the users. It applies mation could be aggregated into its neighbor entity. Then
attention-based graph decoder to rank the candidate entities GCN or R-GCN was applied to learn the representations of
from the KG. During path walking, it prunes unattended the entities. For instance, the researchers in [34] introduced
paths to effectively reduce the search space. A Simple Con- Knowledge-driven Multimodal Graph Convolutional Network
versational Path Reasoning (CPR) [15] is an extension of (KMGCN) to model the semantic representations of textual
EAR which utilizes user’s attribute feedback explicitly and information, knowledge concepts and visual information for
converts conversational recommendation as an interactive path fake news detection. [31] was the first work that incorporated
reasoning problem over graph. It relies on users’ historical MMKG into recommender systems. In this work, multimodal
interaction records to learn user and item representations knowledge graph attention network was proposed to learn
by offline training. Besides, it only considers the one-hop the representations of entities. The experiments demonstrated
neighbor entities of the attributes or items while ignoring that multimodal features outperform any single-modal fea-
different slots and values of the items. tures. However, the application of MMKG in conversational
recommendation is currently under-explored. In this work,
To capture users’ dynamic preference, researchers proposed
we aim to make use of the MMKG to boost conversational
user memory reasoning for conversational recommendation
recommendation performance.
[37]. They constructed user memory graph by users’ past pref-
erences and the current requests during conversation. When
III. M ETHODOLOGY
updating the memory graph, they just considered the sentiment
relations. Relational Graph Convolutional Network (R-GCN) The overall framework is illustrated in Fig. 2. The proposed
was applied to learn the hidden state of each entity and SGR model starts from 1) constructing an MMKG. Then for
predict dialogue action. More recently, an adaptive reinforce- each dialogue, it 2) updates the MMKG-based state graph
ment learning framework, namely UNIfied COnversational turn by turn; and 3) reasons over the state graph for detailed
RecommeNder (UNICORN) was proposed in [7]. The model decision making.
integrated three separated decision-making processes in con- Specially, as each conversation begins and goes on, we
versational recommender systems as a unified policy learning update the state graph gradually to introduce user preferences
problem. A dynamic weighted graph was designed to capture expressed in both textual and visual modalities. The update
the sequential information of the dialogue history which is module includes add, change and negate operations, which
beneficial to learn user’s preference on items. However it also helps to capture user’s dynamically changing requirements in
only uses weighted entropy to select the candidate slots and a convenient and explicit way. Based on the up-to-date state
values similar to SCPR. graph, we conduct reasoning over it via signed graph convo-
lutional neural networks. It integrates evidence from the inter-
Our work is close to these works but has several key
connections of nodes and captures the preference polarities of
differences. First, the existing work ignores the complicated
user. Guided by the detailed intent actions predicted via pre-
relations among items, slots and values. The inter-connections
trained GPT-2 model, the corresponding entities such as the
among them also provides evidence for reasoning. For in-
slots, values or venues are then ranked via the learned node
stance, when one user prefers cheap venue, it signals negative
representations accordingly. In what follows, we introduce
tendencies towards the venues which are connected to other
these modules in detail.
price categories. To this end, we propose the signed Graph
Convolutional Network (GCN) based method to better model
the polarity in user preferences. Second, we update the state A. Constructing MMKG
graph to model the dynamic change of user preferences in an Different from the current dialogue research using database
explicit way. Our model performs reasoning over the global query for the target, we aim at building graph structure to
state graph instead of local one as in [15]. capture various information for reasoning the target. Hence, we
4
Fig. 2: The overall architecture of the proposed SGR model. It first builds the initial MMKG to capture back-end database
knowledge. Then based on gradually updated state graph during the conversation process, it performs reasoning over the state
graph to generate agent action decisions.
expect it to have the following characteristics: 1) it should have inform-price-expensive which represents the action, slot and
the ability to capture different modalities, such as texts and value, respectively) into state graph initiated by the MMKG
images; 2) it should be able to represent various relationships and update it gradually as the conversation goes on. We add a
in the back-end database, such as exclusive relations among node to represent the current user and use the signed links
values of the same slot; and 3) we also expect it to be able to to denote the user preference polarity towards the entities
handle users’ preference polarities such as like or dislike. in the MMKG. Note that our constructed MMKG contains
To achieve these, we construct the general MMKG to information in both modalities, and the user dialogues are also
represent the back-end database. Formally, we represent the flexible in modality usage. To give a clear view of how to
graph as G = (, γ), where denotes the set of nodes, and γ update the state graph, we illustrate the process according to
represents the links in the graph. Generally, the nodes represent information modality separately.
the items and the attributes in the database. The attributes can 1) Update Textual Slots: The textual information conveys
be some textual terms or images belonging to the items. To users’ requirements and preferences. It is natural for users
cover different modalities, we follow the entity-based MMKG to change their requirements as the conversation goes on.
method [26] to treat the images as the nodes in G. The links Therefore, we update the state graph dynamically including:
connect the items and its attributes. They can be of different add, change, negate. When the user provides new requirements
types indicating different relationships among attributes. The (see turn 2 in Fig. 3), we add signed links between the user and
links can also represent different polarities such as positive or the corresponding slot-value nodes. When the user changes
negative links. For example, for the scenario of conversational requirements, we can also change the links for further update.
recommendation on the task of finding places, the nodes are It can be seen that the state graph can effectively reflect users’
the venues and the slot-value pairs of these venues (here dynamic preferences in an explicit way. And it is convenient
the venues refer to the aforementioned items, and the slot- to change the links based on the dialogue state. By adding the
value pairs refer to the attributes such as price-cheap). There signed edges between the user and the attributes, we can show
exists various relationships among the values for the same users’ like and dislike clearly in the state graph.
slot: 1) slots containing mutually exclusive values (such as
2) Update Intention via Image: In our multimodal conver-
price with value candidates cheap, moderate and expensive
sational recommendation task, users usually offer images to
etc); 2) slots without mutually exclusive values (such as has
express their intention conveniently. To understand users’ in-
image). To represent the exclusive relationships, we update
tention based on images, we apply a layer-by-layer taxonomy-
these link combinations between the venue and the slot-values.
based ResNet [10] classifier to learn the visual features of the
For example, if the attributes of one venue contain price-
images [17]. To update users’ intention into the state graph,
cheap, then there will be positive link between the venue and
we compute the cosine similarities between the user provided
price-cheap and negative link between the venue and the other
images and the images in the MMKG. If there exist images
price-value candidates.
in the MMKG with similarity scores exceeding a pre-defined
threshold, we will update the state graph by adding a positive
B. Updating MMKG-based State Graph link between the user and the image in the MMKG (see turn
As the conversation begins and goes on, it is essential for 3 in Fig. 3). In this way, the user will be quickly connected to
the agent to understand users’ intention with textual or visual a specific venue closely. Also, if the user negates an image of
modalities. We thus transform the dialogue states (such as an venue provided by the agent, we will also add an negative
5
link between the user and the very similar images in MMKG. where nk is the length of the instance g k .
Then the related venue node would easily receive negative 2) Reasoning for Details: After the action prediction mod-
tendencies from the user via the links. ule, the model manages to predict a set of actions A =
(a1 , a2 , · · · , an ). The next step is to enrich these predicted
C. Reasoning over State Graph actions with detailed slots and values. In detail, if ai is inform,
To facilitate the description of the following modules, we then the top-1 slot-value pair (si , vi ) will be selected to yield
define Bt = {b1 , b2 , · · · , bm } as the belief states of turn t. bt a detailed tuple (inf orm, si , vi ). When ai is request, then the
summarizes the dialogue history up to the current turn t. Each tuple will be (request, si , null) where si is the top ranked
state consists of tuples like {a, s, v}, where a is action, s is slot. Similarly, when ai is recommend, the tuple will be
slot and v is the value. (recommend, null, vi ) where the vi is the top ranked venue.
For turn t, given the historical dialogue state Bt and the To do reasoning, we first learn the node representations of
previous state graph Gt−1 of turn t-1, the target of our model is the state graph. Considering that the graph has both positive
to predict a series of tuples Y = {y1 , y2 , · · · , yn }, where yi = and negative links, it is not suitable to apply the traditional
{ai , si , vi }. If ai = request, the vi will be set to null. If ai = GCN. Actually, to deal with this problem, researchers have
recommend, the si will be set to null. We first predict the carried out extensive explorations and proposed signed GCN
actions and then get the detail arguments of the corresponding [8]. To properly integrate the positive and negative tendencies
dialogue actions. We introduce the details of each step in the during the aggregation process, we leverage multiple layers
following sections. of signed GCN [8] over the graph and obtain the hidden
1) Predicting Dialogue Actions: The dialogue actions of representations of all nodes.
the agent have a strong dependence on the Bt . Many of In the signed state graph, for each node i there are
the existing works apply classification model to predict the two kinds of neighbors: positive-linked neighbors N+ i and
dialogue act for each turn. However, this oversimplifies the negative-linked neighbors N− i . It is not sufficient to learn one
conversational recommendation scenarios as demonstrated in single representation for each node as in traditional unsigned
[18]. Human agents often perform more than one action in a GCN. Thus we maintain two kinds of representations of each
single turn. As it is unrealistic to pre-define the number of Q
node. We define hP i and hi as the positive and negative
actions to perform in each turn, we cast the action prediction representations of i , respectively. The representations are ag-
as a sequence generation task, in which the model can au- gregated layer by layer. To incorporate the signed information
tomatically decide how many actions in one turn to perform during aggregation, we follow the balanced theory mentioned
based on the context. in [8]. The aggregation of the neighbor information comes
With the development of the NLP techniques, various pre- from two parts: the information from the neighbors N+ i and
training models are emerging such as Bert, Transformer, GPT- the information from the neighbors N− i . The detail aggregation
2 [27]. Inspired by SimpleTOD [12], we recast the action process is shown as follows:
prediction task as a sequence-to-sequence generation problem
where Bt is treated as the input and the target actions are
treated as the output. The model will learn to automatically P (l)
X hP
j
(l−1)
X hQ(l−1) P (l−1)
hi = σ(W P (l) [ , k
, hi ]),
generate the end token if it feels confident enough. The pre- +
N+i −
N−
i
j∈Ni k∈Ni
trained GPT-2 model is leveraged.
To adapt our input Bt to the GPT-2 model, we first transfer X hQ(l−1) X hP (l−1) Q(l−1)
Q(l) Q(l) j k
the dialogue states Bt into a sequence containing a list of hi = σ(W [ , , hi ]),
triplets: x = “ < |belief | > a1 , s1 , v1 ; · · · , am , sm , vm , < +
+
Ni −
N−i
j∈Ni k∈Ni
|endof belief | > ”. The output is also a sequence: y = “ <
P (l) Q(l)
|action| > a1 , · · · , an < |endof action| > ”. A training where hi and hi are the positive and negative represen-
sequence is the concatenation [x; y] of the x and y. We denote tations at layer l respectively; and W P (1) and W Q(1) are the
it as g = (g1 , g2 , · · · , g|g| ). Given a training instance like this, linear transformation matrices. The aggregation starts from h0i
the joint probability of the sequence is calculated as: which is the initial representation of i .
|g| The final representation of the node i is the concatenation
of the positive and negative representations:
Y
p(g) = p(gi |g<i ).
i=1 P (l) Q(l)
hli = [hi , hi ].
Given a dataset with |K| training instances, the loss for
training the generator ( a neural network with parameters θ) Then we calculate the loss LH of node representation learn-
is the negative log-likelihood over the whole training data. We ing like [8]. The loss is designed to capture the relationships
aim to minimize the loss as follows: among the nodes. We construct a set M containing triplets
|K| nk
X X (i , j , z) where {+, −, ?} denotes positive, negative and no
LA = − logpθ (gik |g<i
k
), link, respectively. For each pair of linked nodes (i , j ), we
k=1 i=1 sample a non-linked node k . The first term is a weighted
1 Here the user provided image is used for linking the similar image in
multinomial logistic regression (MLG) classifier to classify
MMKG. On the other side, the content of user provided image is captured in the relationship z ∈ {+, −, ?} of two nodes. The second term
action prediction. is to guarantee the distance of positive-linked nodes is closer
6
cheap orchard good expensive cheap orchard good expensive cheap orchard good expensive
score score score
+ + + +
+
- - +
user user user
Hi, I am looking for a restaurant in I would like expensive ones. I want to try something like this.
orchard road.
What price range do you prefer? Is there anything else you would like? How about the Mellben Seafood ?
than that of the non-linked nodes, and the distance of the non- where y denotes the ranking result (e.g. yS , yV , yC ) and y ∗ is
linked nodes is closer than that of the negative-linked nodes. the ground truth result accordingly. Finally we obtain the total
loss of ranking as : LR = δS LS + δV LV + δC LC . δS , δV and
1 X exp([hi ; hj ]θzM LG ) δC denote the balancing coefficients. LS , LV and LC are the
LH = − wz log P M LG ) loss of the slots, slot-values and venues.
M q∈{+,−,?} exp([hi ; hj ]θq
(i ,j ,z)∈M
3) Joint Training: For joint training, the total loss of the
1 X 2 2 node presentation learning and graph reasoning is as follows:
+ λ[ max(0, (khi − hj k2 − khi − hk k2 ))
M(+,?) f ∈M
ijk (+,?)
L = αLH + βLR , (2)
1 X 2 2
+ max(0, (khi − hj k2 − khi − hk k2 ))] where L and L represent the loss of the node representation
M(−,?) f ∈M H R
ijk (−,?) learning and ranking, respectively. α and β are the coefficients.
+ Reg(θW , θM LG ),
where wz denotes the weights of class z. θW and θM LG are IV. E XPERIMENTS
the parameters of the signed GCN and MLG. fijk denotes the In this section, we will evaluate our proposed model. For
nodes (i , j , k ) in M(+,?) and M(−,?) , which are the set better explain and analyze our model, the following questions
of paired nodes including the linked nodes (i , j ) and non- are used to guide the analysis of the experiment.
liked nodes (i , k ). Reg stands for the regularization of the
• RQ1. Compared with the state-of-the-art conversational
parameters.
recommendation methods, how does our method per-
After obtaining the node representations, we compute the
form?
ranking score of the corresponding slots and values. For the
• RQ2. Is our model robust to different settings, and which
ranking of slots S, considering that there are only venues and
design of our model has more significant effects?
slot-values pairs in our graph, we compute the scores of slots
• RQ3. Whether our model can handle zero-shot scenarios
by aggregating the scores of the slot-value pairs belonging to
and provide explanations for decision making?
the same slot and use softmax to get the normalized score of all
the slots. For the ranking of slot-values and venues, We apply
multi-layer perception (MLP) based on the concatenation of A. Experiments Setup
the representations of the user and the corresponding nodes 1) Dataset: There are several multimodal dialogue datasets
in MMKG. We denote the ranking scores of slot, slot-values contributed. However, to the best of our knowledge, most of
and venue names as yS , yV and yC , respectively, which are them are not suitable for our conversational recommendation
computed as follows: with MMKG scenario. For example, MMD [28] comes with
X
yS = Sof tmax( yv ), no dialogue state or dialogue act annotation. MDMMD++
v∈Si [9] looks promising but the dataset is not publicly available
yet. SIMMC [22] comes with state and action annotations
yV = Sigmoid(M LP ([hu , hV ]),
but it is not a recommendation setting. Fortunately, Liao et
yC = Sigmoid(M LP ([hu , hC ]), al [18] propose a fully annotated task-oriented Multimodal
where MLP is the multi-layer perception. hu , hV and hC are Multi-domain Conversational dataset (MMConv2 ) which pro-
the representations of the user, slot-values and venues. vides realistic conversational recommendation scenarios. The
Here we apply cross entropy to calculate the loss functions dataset contains large-scale multi-turn dialogues covering five
for ranking: domains: food, hotel, nightlife, shopping mall and sightseeing.
|Zgt
i i
∩Zpre |
(
It also contains a structured venue database and annotated , if the predicted actions are correct
EM Ri = |Zgt
i
|
images. During the conversation, both the agent and the user
0, otherwise
can provide images to each other. The statistics of the dataset
is shown in Tab. I. The conversations between the user and i
where EM Ri is the EMR score of the i-th sample. The Zgt
the agent are designed based on real user settings. The goal of i
and Zpre is the ground truth entity set and the top-k predicted
the agent is to recommend the target venues to the user. The entity set for all actions, respectively.
dialogues are fully annotated with dialogue belief states of the IMR stands for dialogue-level item matching rate, which
user and tuples of the agent such as inform-price-cheap. evaluates the predicted venues against the ground-truth across
all turns in a dialogue. For each dialogue, we maintain a venue
TABLE I: Statistics of the Dataset i
set Jpre which stores the top-1 predicted venue of the turn
# Dials # of Turns # of venues # of reviews # of images whose predicted actions contain “recommendation”, the IMR
5,106 39,759 1,772 39,772 103,773 is calculated as:
N i i
1 X Jgt ∩ Jpre
IM R = i
TABLE II: Statistics of the Dataset Split N i=1 Jgt
Dataset # of Dials # of Turns Avg. # of Turns
i i
where Jgt and Jpre is the ground truth item set and the top-1
Train 3,500 26,869 7.677
Val 606 4,931 8.137 predicted item set, respectively.
Test 1,000 7,959 7.959 For online evaluation, we use user simulator to evaluate
the performance of recommendation like [8]. We simulate
2) Training Details: We split the dataset for training, the interaction process between the user and the agent. The
validation and testing. To facilitate the investigation of the user will randomly inform a slot-value at the first turn, and
zero-shot situation, we split the dataset by different goals of then the agent provides the response to the user based on
the dialogues. We ensure that there are no overlapping goals the output of our model. After the multi-turn interactions, the
in training, validation and testing datasets. The statistics is dialogue will finish when the agent successfully recommends
shown in Tab. II. The input to the action prediction model the target venues or the dialogue reaches to the predefined
is tokenized with pretrained BPE codes [30] associated with number of turns. We use SR@t to measure the cumulative
DistilGPT2 [29]. We use default hyper parameters for GPT-2 ratio of conversation completion by turn t of dialogue. SR is
and DistilGPT2 in Huggingface Transformers [35]. Text se- mainly used to evaluate whether the user’s ground truth items
quences longer than 1024 tokens are truncated. For reasoning can be found quickly in an interactive scenario.
part, the layer number L of the signed GCN is set to 2. The 4) Baselines: Several baselines on conversational recom-
dimension of the node features in signed GCN is 128 and the mendation are used for comparison.
batch size is set to 64. The learning rate is set to 0.001. The Max Entropy. This method designs the rules to perform the
initial representations of the entities in the MMKG and the user actions of the dialogue. When generating questions, it always
in state graph are set to random vectors. The maximum number chooses the attribute that has the maximum entropy among
of training epoch is 100. The maximum number of turns of the candidate attributes in each turn. The method makes a
online evaluation is set to 15. To define the image similarity recommendation based on the number of candidate item sets
threshold when updating MMKG via images, we apply simple with a certain probability.
greedy search method to validate the performance based on Abs Greedy [6]. This method only performs the recom-
the candidate thresholds {0.5, 0.7, 0.9}. All the parameters are mendation and it recommends item in each turn until it makes
tuned on the validation set. successful recommendation. It updates the model and takes the
3) Evaluation Metrics: Our goal is to predict the dialogue user rejected items as negative samples. The method achieves
act of the agent and also provide the detailed content of act to equal or better performance than bandit algorithm such as
the user such as inform slot value, request slot or recommend Upper Confidence Bounds [1] and Thompson Sampling [3].
a venue. We measure the performance of our model by offline SCPR [15]. It is a graph-based path reasoning method to
and online evaluation similar to [8]. model the multi-turn conversational recommendation. It starts
For offline evaluation, the Act Accuracy is the proportion of from the user vertex and then walks through the attribute
the correct predicted samples to the total samples. Considering vertices on the graph based on user feedbacks.
that the result contains more than one actions, we regard it UMGR [37]. This method represents user preference by
correct when all the predicted actions are strictly equal to the user memory graph and then applies graph reasoning to model
ground truth omitting the order of the actions. EMR stands multi-turn conversational recommendation. The dialogue acts
for turn-level entity matching rate, which compares predicted are predicted based on the hidden state of the user memory
entities (slots, values, venues) against annotated ones when the graph.
dialogue act is predicted correctly. Given the testing set with UNICORN [7]. This method is a unified conversational
R turns, the EMR is calculated as follows: recommendation policy learning method. The authors leverage
R
a dynamic weighted graph based RL method to capture
1 X dynamic user preferences and learn the action selection strate-
EM R@k = EM Ri ,
R i=1 gies at each conversation turn. They apply preference-based
8
item selection and weighted entropy-based attribute selection From the online evaluation, it can be seen that SCPR has
strategies to obtain the detailed actions. superior results than Max Entropy and Abs Greedy because
SCPR uses reinforcement learning to learn a well-designed
B. Quantitative Results policy that is responsible for performing the appropriate act to
1) Main Results: The performance of all methods on the interact with the user. Intuitively, reinforcement learning can
dataset is presented in Tab. III. We can observe that our method make better use of feedback in the process of user interaction.
outperforms the baselines on most of the metrics. From the The performance of UNICORN is relatively lower than that
results of the offline evaluations, we have the following discov- of other baselines. The reason is that the UNICORN applies
eries: First of all, it is important to see that our model shows weighted entropy to obtain the requested slots which is more
higher Act Accuracy compared to other baselines. Our model difficult to select the candidate items in less turns. Our model
is designed to be more suitable for natural dialogue scenarios. has a significant improvement compared with the baselines.
In each turn, our model is able to generate several actions at It indicates that our model can identify the ground truth item
the same time. However, the baselines can only consider one in a shorter number of dialogue turns, which shows that our
action at one turn. Moreover, our model also obtains better design can adapt to more flexible interaction scenarios. We
EMR and IMR, especially EMR@1 and EMR@3, which is a suspect that the multi-action prediction also contributes to the
significant improvement over all the comparison algorithms. better performance than other baselines.
This effectively demonstrates the superiority of our method in 2) Ablation Studies: (I) Analysis of signed links. To ex-
offline prediction. Specifically, it can be found that our model plore the complex relationship among the slots and values of
achieves a higher improvement under EMR@1. This indicates the venues, we design positive links and negative links in the
that our algorithm can effectively select the most appropriate database MMKG and user state graph. During reasoning, we
slot, slot-value pair or venues while maintaining a high act leverage signed GCN to learn the node features which can
accuracy. This means that our model is not only closer to the better capture the signed relationship among different nodes.
natural form of dialogue, but can also ask the most effective To demonstrate the effective of signed links, we perform
questions for the user. experiment on a variant model which transfers the signed links
Compared with our algorithm, other algorithms have lower into unsigned links. We delete the negative links and construct
offline evaluation results. Max Entropy uses simple rule-based unsigned graph for database and user state graph. Then we
protocol to select the actions of the dialogue with probability. apply a widely used GCN-based method named LightGCN
Such a policy has high randomness and does not maintain [11] to learn the node representations. To be fair, the rest of
the coherence of the dialogue. SCPR takes advantage of the our model remains the same. The result is shown in Fig. 4. The
structural information of the graph to effectively filter out some SGR LightGCN represents the variant model using LightGCN
useless slot-values, so its EMR and IMR are higher than that to learn the node features. We can observe that although there’s
of Max Entropy and Abs Greedy. However, compared with no obvious difference under EMR between the two method,
our model, SCPR has lower EMR@1, EMR@3, IMR That’s our SGR performs much better than SGR LightGCN under
because our model has better performance on slot and venue IMR. It indicates that the signed links can better recommend
ranking. SCPR uses an algorithm similar to max entropy to proper venues to the users. That’s because the slots with
sort the entities. It cannot well select the most appropriate slots exclusive values increase the number of paths between the
and venues according to the current state of the dialogue. For user and venues by positive and negative links. Thus with the
UMGR, the ranking of entities is only related with the hidden help of signed links, the relationship between the user and
state of entities without considering the user information, venues are enhanced.
which makes it less sensitive to the exact dialogue situation. (II) Effectiveness of the ratio of negative links. We conduct
The EMR@1 and IMR of UNICORN is higher than other experiments under different proportion of negative links of
baselines. That’s because the UNICORN applied dynamic each slots. The proportion ranges from 0 to 1, where 0
weighted graph to capture the sequential information of the means there is no negative links in the MMKG and 1 means
dialogue history which is beneficial to learn user’s preference that we leverage all the negative links of each slot in the
on items. However it still uses weighted entropy to select MMKG. As shown on Fig. 5, with the proportion increasing,
the candidate slots and values similar to SCPR. It also does the performance improved and then degraded with a larger
not consider the complex relationship among slots and values. proportion. We suspect that the negative links provide more
Therefore, the performance of UNICORN is worse than that evidence about users’ preferences. Thus the introduction of
of our model. negative links with proper proportion will help the model do
9
user and the similar image. Note that the image is also linked
with the venue it belongs to in the graph. In this way, we link
the user and candidate venue in the state graph. By leveraging
signed GCN, we integrate the information of the venue into
the user node, which is beneficial for the model to better rank
the entities as well as the venues.
3) The performance of different domains: We conduct
experiment about the performance of different domains in the
dataset used in our paper. The result is shown in Tab. IV.
C. Qualitative Results
reinforcement learning,” in Proceedings of the 44th Interna- over knowledge graphs,” in Proceedings of the 57th Annual
tional ACM SIGIR Conference on Research and Development Meeting of the Association for Computational Linguistics, 2019,
in Information Retrieval, 2021, pp. 1431–1441. pp. 845–854.
[8] T. Derr, Y. Ma, and J. Tang, “Signed graph convolutional [24] H. Mousselly-Sergieh, T. Botschen, I. Gurevych, and S. Roth,
networks,” in 2018 IEEE International Conference on Data “A multimodal translation-based approach for knowledge graph
Mining (ICDM), 2018, pp. 929–934. representation learning,” in Proceedings of the Seventh Joint
[9] M. Firdaus, N. Thakur, and A. Ekbal, “Aspect-aware response Conference on Lexical and Computational Semantics, 2018, pp.
generation for multimodal dialogue system,” ACM Transactions 225–234.
on Intelligent Systems and Technology (TIST), vol. 12, no. 2, pp. [25] L. Nie, W. Wang, R. Hong, M. Wang, and Q. Tian, “Multimodal
1–33, 2021. dialog system: Generating responses via adaptive decoders,”
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for in Proceedings of the 27th ACM International Conference on
image recognition,” in Proceedings of the IEEE conference on Multimedia, 2019, pp. 1098–1106.
Computer Vision and Pattern Recognition, 2016, pp. 770–778. [26] P. Pezeshkpour, L. Chen, and S. Singh, “Embedding multimodal
[11] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, relational data for knowledge base completion,” in Proceedings
“Lightgcn: Simplifying and powering graph convolution net- of the 2018 Conference on Empirical Methods in Natural
work for recommendation,” in Proceedings of the 43rd Interna- Language Processing, 2018, pp. 3208–3218.
tional ACM SIGIR conference on research and development in [27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever
Information Retrieval, 2020, pp. 639–648. et al., “Language models are unsupervised multitask learners,”
[12] E. Hosseini-Asl, B. McCann, C.-S. Wu, S. Yavuz, and OpenAI blog, vol. 1, no. 8, pp. 1–24, 2019.
R. Socher, “A simple language model for task-oriented dia- [28] A. Saha, M. Khapra, and K. Sankaranarayanan, “Towards
logue,” in Proceedings of the 34th International Conference building large scale multimodal domain-aware conversation
on Neural Information Processing Systems, 2020, pp. 20 179– systems,” in Proceedings of the AAAI Conference on Artificial
20 191. Intelligence, 2018, pp. 696–704.
[13] W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Conversational [29] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a
recommendation: Formulation, methods, and evaluation,” in distilled version of bert: smaller, faster, cheaper and lighter,” in
Proceedings of the 43rd International ACM SIGIR Conference the 5th Workshop on Energy Efficient Machine Learning and
on Research and Development in Information Retrieval, 2020, Cognitive Computing, 2019, pp. 1–5.
pp. 2425–2428. [30] R. Sennrich, B. Haddow, and A. Birch, “Neural machine trans-
[14] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M.-Y. Kan, and T.-S. lation of rare words with subword units,” in Proceedings of
Chua, “Estimation-action-reflection: Towards deep interaction the 54th Annual Meeting of the Association for Computational
between conversational and recommender systems,” in Proceed- Linguistics, 2016, pp. 1715–1725.
ings of the 13th International Conference on Web Search and [31] R. Sun, X. Cao, Y. Zhao, J. Wan, K. Zhou, F. Zhang, Z. Wang,
Data Mining, 2020, pp. 304–312. and K. Zheng, “Multi-modal knowledge graphs for recom-
[15] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen, and T.-S. mender systems,” in Proceedings of the 29th ACM International
Chua, “Interactive path reasoning on graph for conversational Conference on Information & Knowledge Management, 2020,
recommendation,” in Proceedings of the 26th ACM SIGKDD pp. 1405–1414.
International Conference on Knowledge Discovery & Data [32] Y. Sun and Y. Zhang, “Conversational recommender system,”
Mining, 2020, pp. 2073–2083. in The 41st international acm sigir conference on research &
[16] R. Li, S. Kahou, H. Schulz, V. Michalski, L. Charlin, and development in information retrieval, 2018, pp. 235–244.
C. Pal, “Towards deep conversational recommendations,” in [33] M. Wang, H. Wang, G. Qi, and Q. Zheng, “Richpedia: a large-
Proceedings of the 32nd International Conference on Neural scale, comprehensive multi-modal knowledge graph,” Big Data
Information Processing Systems, 2018, pp. 9748–9758. Research, vol. 22, p. 100159, 2020.
[17] L. Liao, L. Kennedy, L. Wilcox, and T.-S. Chua, “Crowd knowl- [34] Y. Wang, S. Qian, J. Hu, Q. Fang, and C. Xu, “Fake news
edge enhanced multimodal conversational assistant in travel detection via knowledge-driven multimodal graph convolutional
domain,” in International Conference on Multimedia Modeling, networks,” in Proceedings of the 2020 International Conference
2020, pp. 405–418. on Multimedia Retrieval, 2020, pp. 540–547.
[18] L. Liao, L. H. Long, Z. Zhang, M. Huang, and T.-S. Chua, [35] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue,
“Mmconv: An environment for multimodal conversational A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer et al.,
search across multiple domains,” in Proceedings of the 44th “Transformers: State-of-the-art natural language processing,” in
International ACM SIGIR Conference on Research and Devel- Proceedings of the 2020 Conference on Empirical Methods in
opment in Information Retrieval, 2021, pp. 675–684. Natural Language Processing: System Demonstrations, 2020,
[19] L. Liao, Y. Ma, X. He, R. Hong, and T.-s. Chua, “Knowledge- pp. 38–45.
aware multimodal dialogue systems,” in Proceedings of the 26th [36] R. Xie, Z. Liu, H. Luan, and M. Sun, “Image-embodied
ACM international conference on Multimedia, 2018, pp. 801– knowledge representation learning,” in Proceedings of the 26th
809. International Joint Conference on Artificial Intelligence, 2017,
[20] Y. Liu, H. Li, A. Garcia-Duran, M. Niepert, D. Onoro-Rubio, pp. 3140–3146.
and D. S. Rosenblum, “Mmkg: multi-modal knowledge graphs,” [37] H. Xu, S. Moon, H. Liu, B. Liu, P. Shah, and S. Y. Philip,
in European Semantic Web Conference, 2019, pp. 459–474. “User memory reasoning for conversational recommendation,”
[21] Z. Liu, H. Wang, Z.-Y. Niu, H. Wu, W. Che, and T. Liu, in Proceedings of the 28th International Conference on Com-
“Towards conversational recommendation over multi-type di- putational Linguistics, 2020, pp. 5288–5308.
alogs,” in Proceedings of the 58th International Conference on [38] Y. Zhang, Z. Ou, and Z. Yu, “Task-oriented dialog systems
Computational Linguistics, 2020, pp. 1036–1049. that consider multiple appropriate responses under the same
[22] S. Moon, S. Kottur, P. A. Crook, A. De, S. Poddar, T. Levin, context,” in Proceedings of the AAAI Conference on Artificial
D. Whitney, D. Difranco, A. Beirami, E. Cho et al., “Situated Intelligence, 2020, pp. 9604–9611.
and interactive multimodal conversations,” in Proceedings of the [39] Y. Zhang, X. Chen, Q. Ai, L. Yang, and W. B. Croft, “Towards
28th International Conference on Computational Linguistics, conversational search and recommendation: System ask, user
2020, pp. 1103–1121. respond,” in Proceedings of the 27th acm international con-
[23] S. Moon, P. Shah, A. Kumar, and R. Subba, “Opendialkg: Ex- ference on information and knowledge management, 2018, pp.
plainable conversational reasoning with attention-based walks 177–186.
12
[40] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen, and Xueming Qian (M’10) received the B.S. and M.S.
J. Yu, “Improving conversational recommender systems via degrees from the Xi’an University of Technology,
knowledge graph based semantic fusion,” in Proceedings of the Xi’an, China, in 1999 and 2004, respectively, and
26th ACM SIGKDD International Conference on Knowledge the Ph.D. degree in electronics and information
Discovery & Data Mining, 2020, pp. 1006–1014. engineering from Xi’an Jiaotong University, Xi’an,
China, in 2008. He was a Visiting Scholar with
[41] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, and J.-R. Wen, Microsoft Research Asia, Beijing, China, from 2010
“Towards topic-guided conversational recommender system,” in to 2011. He was previously an Assistant Professor
Proceedings of the 28th International Conference on Computa- at Xi’an Jiaotong University, where he was an Asso-
tional Linguistics, 2020, pp. 4128–4139. ciate Professor from 2011 to 2014, and is currently a
[42] J. Zou, Y. Chen, and E. Kanoulas, “Towards question-based Full Professor. He is also the Director of the Smiles
recommender systems,” in Proceedings of the 43rd International Laboratory, Xi’an Jiaotong University. His rsearch interests include social
ACM SIGIR Conference on Research and Development in media big data mining and search.
Information Retrieval, 2020, pp. 881–890.
Yuxia Wu received the B.S. degree from Zhengzhou Tat-Seng Chua is the KITHCT Chair Professor at
University, Henan, China, in 2014, the M.S. degree the School of Computing, National University of
from the Fourth Military Medical University, Xi’an, Singapore. He was the Acting and Founding Dean
China, in 2017, and is currently working toward the of the School from 1998-2000. Dr Chua’s main re-
Ph.D. degree at Xi’an Jiaotong University, Xi’an, search interest is in multimedia information retrieval
China. She is now a visiting student at the National and social media analytics. In particular, his research
University of Singapore. Her research interests in- focuses on the extraction, retrieval and question-
clude social multimedia mining, recommender sys- answering (QA) of text and rich media arising from
tems and natural language processing. the Web and multiple social networks. He is the co-
Director of NExT, a joint Center between NUS and
Tsinghua University to develop technologies for live
Lizi Liao is an assistant professor with Singapore social media search.
Management University. She received the Ph.D. Dr Chua is the 2015 winner of the prestigious ACM SIGMM award for
degree in 2019 from NUS Graduate School for Outstanding Technical Contributions to Multimedia Computing, Communi-
Integrative Sciences and Engineering at the Na- cations and Applications. He is the Chair of steering committee of ACM
tional University of Singapore. Her research interests International Conference on Multimedia Retrieval (ICMR) and Multimedia
include conversational system, multimedia analysis Modeling (MMM) conference series. Dr Chua is also the General Co-Chair
and recommendation. Her works have appeared in of ACM Multimedia 2005, ACM CIVR (now ACM ICMR) 2005, ACM SIGIR
top-tier conferences such as MM, WWW, ICDE, 2008, and ACM Web Science 2015. He serves in the editorial boards of four
ACL, IJCAI and AAAI, and top-tier journals such international journals. Dr. Chua is the co-Founder of two technology startup
as TKDE. She received the Best Paper Award Hon- companies in Singapore. He holds a PhD from the University of Leeds, UK.
orable Mention of ACM MM 2018. Moreover, she
has served as the PC member for international conferences including SIGIR,
WSDM, ACL, and the invited reviewer for journals including TKDE, TMM
and KBS.