0% found this document useful (0 votes)

11 views12 pages

22 - State Graph Reasoning For Multimodal Conversational Recommendation

Uploaded by

marcelovera.sci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views12 pages

22 - State Graph Reasoning For Multimodal Conversational Recommendation

Uploaded by

marcelovera.sci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1

State Graph Reasoning for Multimodal

Conversational Recommendation
Yuxia Wu, Lizi Liao? , Gangyi Zhang, Wenqiang Lei, Guoshuai Zhao, Xueming Qian? , Tat-Seng Chua
.

Abstract—Conversational recommendation system (CRS) at- predict users’ preferences towards items by analyzing their
tracts increasing attention in various application domains such past behaviors such as click history, visit log, and ratings on
as retail and travel. It offers an effective way to capture users’ items, etc, which are widely applied in many domains such as
dynamic preferences with multi-turn conversations. However,
most current studies center on the recommendation aspect While with the help of multi-turn conversations, the system
while over-simplifying the conversation process. The negligence can further capture the detailed and dynamic preferences of
of complexity in data structure and conversation flow hinders users, which may lead to better recommendation results and
their practicality and utility. In reality, there exist various user experience [28].
relationships among slots and values, while users’ requirements There have been many efforts centered on integrating con-
may dynamically adjust or change. Moreover, the conversation
often involves visual modality to facilitate the conversation. These versation modelling into recommendation systems. From a
actually call for a more advanced internal state representation of broader perspective, pioneering works emerge from tag-based
the dialogue and a proper reasoning scheme to guide the decision interaction between user and systems where the interaction is
making process. mainly realized by tags [5]. To further improve the conve-
In this paper, we explore multiple facets of multimodal conver- nience of developed systems, more efforts focus on multi-turn
sational recommendation and try to address the above mentioned
challenges. In particular, we represent the structured back-end conversations with natural language as both system’s input and
database as a multimodal knowledge graph which captures the output [16, 32, 41]. Generally speaking, existing methods have
various relations and evidence in different modalities. The user emphasized on three broad directions: a) Since the key advan-
preferences expressed via conversation utterances will then be tage of CRS is being able to ask questions, a line of studies
gradually updated to the state graph with clear polarity. Based work on learning to ask appropriate attributes/topics/categories
on these, we train an end-to-end State Graph-based Reasoning
model (SGR) to perform reasoning over the whole state graph. of items to narrow down the candidate items [5, 38, 39, 42];
The prediction of our proposed model benefits from the structure b) Another line of efforts target at learning better strategy for
of the graph. It not only allows for zero-shot reasoning for items making successful recommendations with less turns of inter-
unseen in training conversations, but also provides a natural actions [14, 15, 32]; c) There are also works that further delve
way to explain the policies. Extensive experiments show that into in-depth dialogue understanding and response generation
our model achieves better performance compared with existing
methods. [4, 16, 21, 37, 41].
However, there exist several shortages of these current meth-
Index Terms—Recommendation systems, conversation, knowl-
ods. First, most of these methods directly generate responses
edge graph
via action prediction and entity linking, while ignoring the
various relationships among slot values and its relation to
I. I NTRODUCTION the explicit dialogue state representations [14, 16], shown as
ONVERSATIONAL recommendation system (CRS) has
C become an emerging research topic in information seek-
ing. It integrates the strength of recommendation systems and
Fig. 1(a). This would lead to less informative representation
of user preference and thus harm the recommendation perfor-
mance. For example, when one user wants to find a cheap
conversation techniques. In general, recommendation systems item, it indicates that the user has negative preference on the
other price values. Second, the modelling of dynamic user
This work was supported in part by the scholarship from China Scholarship
Council (CSC) under Grant 202006280325; in part by the NSFC, China under preference change is relatively weak. Although some efforts
Grants 61902309, 61701391, and 61772407; in part by ShaanXi Province try to achieve this with graph based methods [7, 15, 23, 37]
under Grant 2018JM6092; in part by the Fundamental Research Funds for as shown in Fig. 1(b), they fail to represent the dynamic
the Central Universities, China (xxj022019003); in part by China Postdoctoral
Science Foundation (2020M683496); in part by the National Postdoctoral change in an effective way. For example, the model in [15]
Innovative Talents Support Program, China (BX20190273); and in part by the simply deleted the mentioned attributes or items negated in
Science and Technology Program of Xi’an, China under Grant 21RGZN0017. historical turns without updating any user or item representa-
Yuxia Wu, Guoshuai Zhao and Xueming Qian are with the Ministry
of Education Key Laboratory for Intelligent Networks and Network Se- tions. Meanwhile, the model in [37] only considered the most
curity, Xi’an Jiaotong University (e-mail: [email protected], gu- updated user requirements for each slot without remembering
[email protected] [email protected]) any former denied ones.[7] Last but not the least, most of the
Lizi Liao is with the Singapore Management University (e-mail:
[email protected]). existing works focus on textual conversations. However, there
Gangyi Zhang is with the University of Science and Technology of China is a growing demand for multimodal conversations to facilitate
(e-mail: [email protected]) recommendation in domains like e-commerce retail and travel.
Wenqiang Lei and Tat-Seng Chua are with the National University of
Singapore (e-mail: [email protected], [email protected]) Although there are some initial works [19, 25, 28], the over-
? Lizi Liao and Xueming Qian are the corresponding authors. simplified usage of image information hinders the image
2

Dialogue history Dialogue history Dialogue history

user system user system user system

Action prediction Action prediction

System Entity ranking System Entity ranking
System Action prediction response
Entity ranking response
response -
+ +
+

(a) Corpus-based methods [14, 16] (b) Graph-based methods [7, 15, 23, 37] (c) Our method

Fig. 1: The workflow of the existing methods and ours.

modality’s contribution in conversational recommendation. between these lines of research and emphasize the research gap
To address the above-mentioned issues, we focus on a new targeted by this work.
scheme as shown in Fig. 1(c). We explore the complex rela-
tionship among slots and values in the back-end item database A. Conversational Recommendation
and represent the structured data into a multimodal knowledge Conversational recommendation aims at providing inter-
graph. Then the graph is updated with user preferences rep- active recommendations through dialogues. Compared with
resented in the dialogue states with the conversation goes on. traditional static recommender systems, it has the advantage
Based on which, the actions to take, such as recommendation of capturing users’ dynamic preferences from the multi-turn
or further inquiry, can be generated via performing reasoning utterances [13].
on the state graph. Specifically, the state graph is initiated as a The existing works on conversational recommender systems
signed graph containing positive and negative links to model fall into three broad categories. One line of efforts was
the complicated relationships among items, slots and values, largely question driven. They focused on learning to ask
as well as the rich modalities of information in the back-end attributes/topics/categories of items to reduce the search space
database. Then we gradually update the state graph to capture [5, 38, 39, 42]. For example, Multi-Memory Network (MMN)
the dynamic user requirements based on the user intention [39] was a unified model integrating query/item representation
harvested from the conversation history. The state graph is learning and conversational search/recommendation. It learned
updated in an explicit way by adding, deleting or changing user preference by asking questions. However, it did not
the links between the user and and other nodes based on contain any special policy network to decide when to ask or
the dynamic user preference. Basically, the state graph keeps recommend. Another line of studies targeted at better strategy
track of the conversation progress and serves as a base for for making successful recommendations in less turns. For
performing reasoning about entity ranking of the nodes. We instance, Conversational Recommender Model (CRM) [32]
then train an end-to-end graph reasoning model SGR to infer applied reinforcement learning to decide when to ask or rec-
reasoning which explicitly differentiates between positive and ommend items based on users’ current preference learned by a
negative user preferences. Since the prediction results inherit belief tracker. However, they would only recommend one time,
the graph structure, it caters for cold start venues and provides which would fail if the user gave negative feedback on rec-
natural explanations. ommendations. To solve these problems, Estimation–Action–
We summarize our contributions as follows: Reflection (EAR) [14] was proposed to learn user preferences
• We explicitly model users’ dynamic preferences and and rank the items and attributes. A policy network was
integrate it with a multimodal knowledge graph for better applied to determine whether to ask attributes further or
state representation. The update of the graph reflects the recommend items. The model would also be updated when
real change of users’ preferences based on both textual the user rejects the recommendations. Later on, the authors
and visual modality. further devised an interactive path reasoning mechanism to
• We design a state graph reasoning model to capture the help ranking of item and attributes [15]. Beyond emphasizing
various evidence in the multimodal knowledge graph, and on the recommendation part, the third line of efforts delved
generate more accurate agent behavior predictions. into in-depth dialogue understanding and response generation
• Extensive experiments demonstrate the effectiveness of [4, 16, 21, 37, 41]. For example, knowledge graph was in-
our proposed method. Qualitative results also show that troduced in Knowledge-Based Recommender Dialog (KBRD)
the proposed method not only handles zero-shot situations [4] to bridge the recommender system and the dialogue system
well but also offers good explainability. in an end-to-end manners. The model linked the entities in
dialogue history to the external knowledge graph to enhance
II. R ELATED W ORK the representation of users’ preferences. Then the dialogue
system can generate responses that are consistent with users’
Our work is closely related to three lines of researches: con- interest. Similarly, the model in [40] incorporated both word-
versational recommendation, graph reasoning and multimodal based and entity-based Knowledge Graph (KG) to enhance the
knowledge graph. Here, we will briefly discuss the connections semantic representations in CRS. However, these models lack
3

dialogue state management. In our work, we incorporate the C. Multimodal Knowledge Graph
knowledge graph into our internal dialogue state representation As we incorporate multimodal information into knowledge
and perform reasoning on it to yield better results. graph to perform reasoning, our work is also closely related
to the works on Multimodal Knowledge Graph (MMKG).
MMKG integrates multimodal data (such as images and texts)
into the knowledge graph and treats the image or text as an
B. Graph Reasoning
entity or an attribute of the entity [20, 33]. In general, MMKG
With knowledge graph structures, there are a lot of efforts representation learning can be divided into two categories:
trying to do reasoning over the graph to enhance conver- feature-based methods [24, 36] and entity-based methods [26].
sational recommendation performance. Actually, graph rea- The former kind treated the visual information as the features
soning has been successfully applied to many tasks such as of entities. They modify TransE model [2] to integrate the
social network analysis, question answering, recommendation, visual features of entities. However, this kind of methods
and so on. For conversational recommendation, Open-ended requires each entity to provide visual information, which was
Dialog KG (OpenDialKG) [23] learns the optimal path of not suitable for many tasks. The later one [26] constructs
dialogue context within a large common-sense KG for open- the MMKG by adding extra relations on the original KG,
ended dialogue system. The system would ask questions about such as hasImage, hasDescription. The multimodal infor-
the attributes of items and also chat with the users. It applies mation could be aggregated into its neighbor entity. Then
attention-based graph decoder to rank the candidate entities GCN or R-GCN was applied to learn the representations of
from the KG. During path walking, it prunes unattended the entities. For instance, the researchers in [34] introduced
paths to effectively reduce the search space. A Simple Con- Knowledge-driven Multimodal Graph Convolutional Network
versational Path Reasoning (CPR) [15] is an extension of (KMGCN) to model the semantic representations of textual
EAR which utilizes user’s attribute feedback explicitly and information, knowledge concepts and visual information for
converts conversational recommendation as an interactive path fake news detection. [31] was the first work that incorporated
reasoning problem over graph. It relies on users’ historical MMKG into recommender systems. In this work, multimodal
interaction records to learn user and item representations knowledge graph attention network was proposed to learn
by offline training. Besides, it only considers the one-hop the representations of entities. The experiments demonstrated
neighbor entities of the attributes or items while ignoring that multimodal features outperform any single-modal fea-
different slots and values of the items. tures. However, the application of MMKG in conversational
recommendation is currently under-explored. In this work,
To capture users’ dynamic preference, researchers proposed
we aim to make use of the MMKG to boost conversational
user memory reasoning for conversational recommendation
recommendation performance.
[37]. They constructed user memory graph by users’ past pref-
erences and the current requests during conversation. When
III. M ETHODOLOGY
updating the memory graph, they just considered the sentiment
relations. Relational Graph Convolutional Network (R-GCN) The overall framework is illustrated in Fig. 2. The proposed
was applied to learn the hidden state of each entity and SGR model starts from 1) constructing an MMKG. Then for
predict dialogue action. More recently, an adaptive reinforce- each dialogue, it 2) updates the MMKG-based state graph
ment learning framework, namely UNIfied COnversational turn by turn; and 3) reasons over the state graph for detailed
RecommeNder (UNICORN) was proposed in [7]. The model decision making.
integrated three separated decision-making processes in con- Specially, as each conversation begins and goes on, we
versational recommender systems as a unified policy learning update the state graph gradually to introduce user preferences
problem. A dynamic weighted graph was designed to capture expressed in both textual and visual modalities. The update
the sequential information of the dialogue history which is module includes add, change and negate operations, which
beneficial to learn user’s preference on items. However it also helps to capture user’s dynamically changing requirements in
only uses weighted entropy to select the candidate slots and a convenient and explicit way. Based on the up-to-date state
values similar to SCPR. graph, we conduct reasoning over it via signed graph convo-
lutional neural networks. It integrates evidence from the inter-
Our work is close to these works but has several key
connections of nodes and captures the preference polarities of
differences. First, the existing work ignores the complicated
user. Guided by the detailed intent actions predicted via pre-
relations among items, slots and values. The inter-connections
trained GPT-2 model, the corresponding entities such as the
among them also provides evidence for reasoning. For in-
slots, values or venues are then ranked via the learned node
stance, when one user prefers cheap venue, it signals negative
representations accordingly. In what follows, we introduce
tendencies towards the venues which are connected to other
these modules in detail.
price categories. To this end, we propose the signed Graph
Convolutional Network (GCN) based method to better model
the polarity in user preferences. Second, we update the state A. Constructing MMKG
graph to model the dynamic change of user preferences in an Different from the current dialogue research using database
explicit way. Our model performs reasoning over the global query for the target, we aim at building graph structure to
state graph instead of local one as in [15]. capture various information for reasoning the target. Hence, we
4

Dialogue state Reasoning over state graph

Predicting dialogue actions
<action> … <endofaction>
Updating MMKG-based
state graph GPT-2
TASTE MELLBEN
PARADISE SEAFOOD
cosine <belief> … <endofbelief>
Output
similarity Rec. Rank of venues

Reasoning for details Inform Rank of values

TASTE MELLBEN Request Rank of slots
Constructing MMKG cheap orchard good expensive PARADISE SEAFOOD
score
cosine
TASTE MELLBEN
+ + similarity
PARADISE SEAFOOD - +
user
: positive link cheap orchard good expensive
: negative link score
+ +
cheap orchard good expensive - +
score user

Fig. 2: The overall architecture of the proposed SGR model. It first builds the initial MMKG to capture back-end database
knowledge. Then based on gradually updated state graph during the conversation process, it performs reasoning over the state
graph to generate agent action decisions.

expect it to have the following characteristics: 1) it should have inform-price-expensive which represents the action, slot and
the ability to capture different modalities, such as texts and value, respectively) into state graph initiated by the MMKG
images; 2) it should be able to represent various relationships and update it gradually as the conversation goes on. We add a
in the back-end database, such as exclusive relations among node to represent the current user and use the signed links
values of the same slot; and 3) we also expect it to be able to to denote the user preference polarity towards the entities
handle users’ preference polarities such as like or dislike. in the MMKG. Note that our constructed MMKG contains
To achieve these, we construct the general MMKG to information in both modalities, and the user dialogues are also
represent the back-end database. Formally, we represent the flexible in modality usage. To give a clear view of how to
graph as G = (, γ), where denotes the set of nodes, and γ update the state graph, we illustrate the process according to
represents the links in the graph. Generally, the nodes represent information modality separately.
the items and the attributes in the database. The attributes can 1) Update Textual Slots: The textual information conveys
be some textual terms or images belonging to the items. To users’ requirements and preferences. It is natural for users
cover different modalities, we follow the entity-based MMKG to change their requirements as the conversation goes on.
method [26] to treat the images as the nodes in G. The links Therefore, we update the state graph dynamically including:
connect the items and its attributes. They can be of different add, change, negate. When the user provides new requirements
types indicating different relationships among attributes. The (see turn 2 in Fig. 3), we add signed links between the user and
links can also represent different polarities such as positive or the corresponding slot-value nodes. When the user changes
negative links. For example, for the scenario of conversational requirements, we can also change the links for further update.
recommendation on the task of finding places, the nodes are It can be seen that the state graph can effectively reflect users’
the venues and the slot-value pairs of these venues (here dynamic preferences in an explicit way. And it is convenient
the venues refer to the aforementioned items, and the slot- to change the links based on the dialogue state. By adding the
value pairs refer to the attributes such as price-cheap). There signed edges between the user and the attributes, we can show
exists various relationships among the values for the same users’ like and dislike clearly in the state graph.
slot: 1) slots containing mutually exclusive values (such as
2) Update Intention via Image: In our multimodal conver-
price with value candidates cheap, moderate and expensive
sational recommendation task, users usually offer images to
etc); 2) slots without mutually exclusive values (such as has
express their intention conveniently. To understand users’ in-
image). To represent the exclusive relationships, we update
tention based on images, we apply a layer-by-layer taxonomy-
these link combinations between the venue and the slot-values.
based ResNet [10] classifier to learn the visual features of the
For example, if the attributes of one venue contain price-
images [17]. To update users’ intention into the state graph,
cheap, then there will be positive link between the venue and
we compute the cosine similarities between the user provided
price-cheap and negative link between the venue and the other
images and the images in the MMKG. If there exist images
price-value candidates.
in the MMKG with similarity scores exceeding a pre-defined
threshold, we will update the state graph by adding a positive
B. Updating MMKG-based State Graph link between the user and the image in the MMKG (see turn
As the conversation begins and goes on, it is essential for 3 in Fig. 3). In this way, the user will be quickly connected to
the agent to understand users’ intention with textual or visual a specific venue closely. Also, if the user negates an image of
modalities. We thus transform the dialogue states (such as an venue provided by the agent, we will also add an negative
5

link between the user and the very similar images in MMKG. where nk is the length of the instance g k .
Then the related venue node would easily receive negative 2) Reasoning for Details: After the action prediction mod-
tendencies from the user via the links. ule, the model manages to predict a set of actions A =
(a1 , a2 , · · · , an ). The next step is to enrich these predicted
C. Reasoning over State Graph actions with detailed slots and values. In detail, if ai is inform,
To facilitate the description of the following modules, we then the top-1 slot-value pair (si , vi ) will be selected to yield
define Bt = {b1 , b2 , · · · , bm } as the belief states of turn t. bt a detailed tuple (inf orm, si , vi ). When ai is request, then the
summarizes the dialogue history up to the current turn t. Each tuple will be (request, si , null) where si is the top ranked
state consists of tuples like {a, s, v}, where a is action, s is slot. Similarly, when ai is recommend, the tuple will be
slot and v is the value. (recommend, null, vi ) where the vi is the top ranked venue.
For turn t, given the historical dialogue state Bt and the To do reasoning, we first learn the node representations of
previous state graph Gt−1 of turn t-1, the target of our model is the state graph. Considering that the graph has both positive
to predict a series of tuples Y = {y1 , y2 , · · · , yn }, where yi = and negative links, it is not suitable to apply the traditional
{ai , si , vi }. If ai = request, the vi will be set to null. If ai = GCN. Actually, to deal with this problem, researchers have
recommend, the si will be set to null. We first predict the carried out extensive explorations and proposed signed GCN
actions and then get the detail arguments of the corresponding [8]. To properly integrate the positive and negative tendencies
dialogue actions. We introduce the details of each step in the during the aggregation process, we leverage multiple layers
following sections. of signed GCN [8] over the graph and obtain the hidden
1) Predicting Dialogue Actions: The dialogue actions of representations of all nodes.
the agent have a strong dependence on the Bt . Many of In the signed state graph, for each node i there are
the existing works apply classification model to predict the two kinds of neighbors: positive-linked neighbors N+ i and
dialogue act for each turn. However, this oversimplifies the negative-linked neighbors N− i . It is not sufficient to learn one
conversational recommendation scenarios as demonstrated in single representation for each node as in traditional unsigned
[18]. Human agents often perform more than one action in a GCN. Thus we maintain two kinds of representations of each
single turn. As it is unrealistic to pre-define the number of Q
node. We define hP i and hi as the positive and negative
actions to perform in each turn, we cast the action prediction representations of i , respectively. The representations are ag-
as a sequence generation task, in which the model can au- gregated layer by layer. To incorporate the signed information
tomatically decide how many actions in one turn to perform during aggregation, we follow the balanced theory mentioned
based on the context. in [8]. The aggregation of the neighbor information comes
With the development of the NLP techniques, various pre- from two parts: the information from the neighbors N+ i and
training models are emerging such as Bert, Transformer, GPT- the information from the neighbors N− i . The detail aggregation
2 [27]. Inspired by SimpleTOD [12], we recast the action process is shown as follows:
prediction task as a sequence-to-sequence generation problem
where Bt is treated as the input and the target actions are
treated as the output. The model will learn to automatically P (l)
X hP
j
(l−1)
X hQ(l−1) P (l−1)
hi = σ(W P (l) [ , k
, hi ]),
generate the end token if it feels confident enough. The pre- +
N+i −
N−
i
j∈Ni k∈Ni
trained GPT-2 model is leveraged.
To adapt our input Bt to the GPT-2 model, we first transfer X hQ(l−1) X hP (l−1) Q(l−1)
Q(l) Q(l) j k
the dialogue states Bt into a sequence containing a list of hi = σ(W [ , , hi ]),
triplets: x = “ < |belief | > a1 , s1 , v1 ; · · · , am , sm , vm , < +
+
Ni −
N−i
j∈Ni k∈Ni
|endof belief | > ”. The output is also a sequence: y = “ <
P (l) Q(l)
|action| > a1 , · · · , an < |endof action| > ”. A training where hi and hi are the positive and negative represen-
sequence is the concatenation [x; y] of the x and y. We denote tations at layer l respectively; and W P (1) and W Q(1) are the
it as g = (g1 , g2 , · · · , g|g| ). Given a training instance like this, linear transformation matrices. The aggregation starts from h0i
the joint probability of the sequence is calculated as: which is the initial representation of i .
|g| The final representation of the node i is the concatenation
of the positive and negative representations:
Y
p(g) = p(gi |g<i ).
i=1 P (l) Q(l)
hli = [hi , hi ].
Given a dataset with |K| training instances, the loss for
training the generator ( a neural network with parameters θ) Then we calculate the loss LH of node representation learn-
is the negative log-likelihood over the whole training data. We ing like [8]. The loss is designed to capture the relationships
aim to minimize the loss as follows: among the nodes. We construct a set M containing triplets
|K| nk
X X (i , j , z) where {+, −, ?} denotes positive, negative and no
LA = − logpθ (gik |g<i
k
), link, respectively. For each pair of linked nodes (i , j ), we
k=1 i=1 sample a non-linked node k . The first term is a weighted
1 Here the user provided image is used for linking the similar image in
multinomial logistic regression (MLG) classifier to classify
MMKG. On the other side, the content of user provided image is captured in the relationship z ∈ {+, −, ?} of two nodes. The second term
action prediction. is to guarantee the distance of positive-linked nodes is closer
6

TASTE MELLBEN TASTE MELLBEN TASTE MELLBEN

PARADISE SEAFOOD PARADISE SEAFOOD PARADISE SEAFOOD
cosine
similarity

cheap orchard good expensive cheap orchard good expensive cheap orchard good expensive
score score score
+ + + +
+
- - +
user user user

Hi, I am looking for a restaurant in I would like expensive ones. I want to try something like this.
orchard road.

What price range do you prefer? Is there anything else you would like? How about the Mellben Seafood ?

Fig. 3: The update of KG-based state graph1

than that of the non-linked nodes, and the distance of the non- where y denotes the ranking result (e.g. yS , yV , yC ) and y ∗ is
linked nodes is closer than that of the negative-linked nodes. the ground truth result accordingly. Finally we obtain the total
loss of ranking as : LR = δS LS + δV LV + δC LC . δS , δV and
1 X exp([hi ; hj ]θzM LG ) δC denote the balancing coefficients. LS , LV and LC are the
LH = − wz log P M LG ) loss of the slots, slot-values and venues.
M q∈{+,−,?} exp([hi ; hj ]θq
(i ,j ,z)∈M
3) Joint Training: For joint training, the total loss of the
1 X 2 2 node presentation learning and graph reasoning is as follows:
+ λ[ max(0, (khi − hj k2 − khi − hk k2 ))
M(+,?) f ∈M
ijk (+,?)
L = αLH + βLR , (2)
1 X 2 2
+ max(0, (khi − hj k2 − khi − hk k2 ))] where L and L represent the loss of the node representation
M(−,?) f ∈M H R
ijk (−,?) learning and ranking, respectively. α and β are the coefficients.
+ Reg(θW , θM LG ),
where wz denotes the weights of class z. θW and θM LG are IV. E XPERIMENTS
the parameters of the signed GCN and MLG. fijk denotes the In this section, we will evaluate our proposed model. For
nodes (i , j , k ) in M(+,?) and M(−,?) , which are the set better explain and analyze our model, the following questions
of paired nodes including the linked nodes (i , j ) and non- are used to guide the analysis of the experiment.
liked nodes (i , k ). Reg stands for the regularization of the
• RQ1. Compared with the state-of-the-art conversational
parameters.
recommendation methods, how does our method per-
After obtaining the node representations, we compute the
form?
ranking score of the corresponding slots and values. For the
• RQ2. Is our model robust to different settings, and which
ranking of slots S, considering that there are only venues and
design of our model has more significant effects?
slot-values pairs in our graph, we compute the scores of slots
• RQ3. Whether our model can handle zero-shot scenarios
by aggregating the scores of the slot-value pairs belonging to
and provide explanations for decision making?
the same slot and use softmax to get the normalized score of all
the slots. For the ranking of slot-values and venues, We apply
multi-layer perception (MLP) based on the concatenation of A. Experiments Setup
the representations of the user and the corresponding nodes 1) Dataset: There are several multimodal dialogue datasets
in MMKG. We denote the ranking scores of slot, slot-values contributed. However, to the best of our knowledge, most of
and venue names as yS , yV and yC , respectively, which are them are not suitable for our conversational recommendation
computed as follows: with MMKG scenario. For example, MMD [28] comes with
X
yS = Sof tmax( yv ), no dialogue state or dialogue act annotation. MDMMD++
v∈Si [9] looks promising but the dataset is not publicly available
yet. SIMMC [22] comes with state and action annotations
yV = Sigmoid(M LP ([hu , hV ]),
but it is not a recommendation setting. Fortunately, Liao et
yC = Sigmoid(M LP ([hu , hC ]), al [18] propose a fully annotated task-oriented Multimodal
where MLP is the multi-layer perception. hu , hV and hC are Multi-domain Conversational dataset (MMConv2 ) which pro-
the representations of the user, slot-values and venues. vides realistic conversational recommendation scenarios. The
Here we apply cross entropy to calculate the loss functions dataset contains large-scale multi-turn dialogues covering five
for ranking: domains: food, hotel, nightlife, shopping mall and sightseeing.

L = CrossEntropyLoss(y, y ∗ ), (1) 2 https://github.com/liziliao/MMConv

|Zgt
i i
∩Zpre |
(
It also contains a structured venue database and annotated , if the predicted actions are correct
EM Ri = |Zgt
i
|
images. During the conversation, both the agent and the user
0, otherwise
can provide images to each other. The statistics of the dataset
is shown in Tab. I. The conversations between the user and i
where EM Ri is the EMR score of the i-th sample. The Zgt
the agent are designed based on real user settings. The goal of i
and Zpre is the ground truth entity set and the top-k predicted
the agent is to recommend the target venues to the user. The entity set for all actions, respectively.
dialogues are fully annotated with dialogue belief states of the IMR stands for dialogue-level item matching rate, which
user and tuples of the agent such as inform-price-cheap. evaluates the predicted venues against the ground-truth across
all turns in a dialogue. For each dialogue, we maintain a venue
TABLE I: Statistics of the Dataset i
set Jpre which stores the top-1 predicted venue of the turn
# Dials # of Turns # of venues # of reviews # of images whose predicted actions contain “recommendation”, the IMR
5,106 39,759 1,772 39,772 103,773 is calculated as:
N i i
1 X Jgt ∩ Jpre
IM R = i
TABLE II: Statistics of the Dataset Split N i=1 Jgt
Dataset # of Dials # of Turns Avg. # of Turns
i i
where Jgt and Jpre is the ground truth item set and the top-1
Train 3,500 26,869 7.677
Val 606 4,931 8.137 predicted item set, respectively.
Test 1,000 7,959 7.959 For online evaluation, we use user simulator to evaluate
the performance of recommendation like [8]. We simulate
2) Training Details: We split the dataset for training, the interaction process between the user and the agent. The
validation and testing. To facilitate the investigation of the user will randomly inform a slot-value at the first turn, and
zero-shot situation, we split the dataset by different goals of then the agent provides the response to the user based on
the dialogues. We ensure that there are no overlapping goals the output of our model. After the multi-turn interactions, the
in training, validation and testing datasets. The statistics is dialogue will finish when the agent successfully recommends
shown in Tab. II. The input to the action prediction model the target venues or the dialogue reaches to the predefined
is tokenized with pretrained BPE codes [30] associated with number of turns. We use SR@t to measure the cumulative
DistilGPT2 [29]. We use default hyper parameters for GPT-2 ratio of conversation completion by turn t of dialogue. SR is
and DistilGPT2 in Huggingface Transformers [35]. Text se- mainly used to evaluate whether the user’s ground truth items
quences longer than 1024 tokens are truncated. For reasoning can be found quickly in an interactive scenario.
part, the layer number L of the signed GCN is set to 2. The 4) Baselines: Several baselines on conversational recom-
dimension of the node features in signed GCN is 128 and the mendation are used for comparison.
batch size is set to 64. The learning rate is set to 0.001. The Max Entropy. This method designs the rules to perform the
initial representations of the entities in the MMKG and the user actions of the dialogue. When generating questions, it always
in state graph are set to random vectors. The maximum number chooses the attribute that has the maximum entropy among
of training epoch is 100. The maximum number of turns of the candidate attributes in each turn. The method makes a
online evaluation is set to 15. To define the image similarity recommendation based on the number of candidate item sets
threshold when updating MMKG via images, we apply simple with a certain probability.
greedy search method to validate the performance based on Abs Greedy [6]. This method only performs the recom-
the candidate thresholds {0.5, 0.7, 0.9}. All the parameters are mendation and it recommends item in each turn until it makes
tuned on the validation set. successful recommendation. It updates the model and takes the
3) Evaluation Metrics: Our goal is to predict the dialogue user rejected items as negative samples. The method achieves
act of the agent and also provide the detailed content of act to equal or better performance than bandit algorithm such as
the user such as inform slot value, request slot or recommend Upper Confidence Bounds [1] and Thompson Sampling [3].
a venue. We measure the performance of our model by offline SCPR [15]. It is a graph-based path reasoning method to
and online evaluation similar to [8]. model the multi-turn conversational recommendation. It starts
For offline evaluation, the Act Accuracy is the proportion of from the user vertex and then walks through the attribute
the correct predicted samples to the total samples. Considering vertices on the graph based on user feedbacks.
that the result contains more than one actions, we regard it UMGR [37]. This method represents user preference by
correct when all the predicted actions are strictly equal to the user memory graph and then applies graph reasoning to model
ground truth omitting the order of the actions. EMR stands multi-turn conversational recommendation. The dialogue acts
for turn-level entity matching rate, which compares predicted are predicted based on the hidden state of the user memory
entities (slots, values, venues) against annotated ones when the graph.
dialogue act is predicted correctly. Given the testing set with UNICORN [7]. This method is a unified conversational
R turns, the EMR is calculated as follows: recommendation policy learning method. The authors leverage
R
a dynamic weighted graph based RL method to capture
1 X dynamic user preferences and learn the action selection strate-
EM R@k = EM Ri ,
R i=1 gies at each conversation turn. They apply preference-based
8

TABLE III: Performance Comparison With Baselines

Offline Evaluation Online Evaluation
Methods Act Accuracy EMR IMR SR
EMR@1 EMR@3 EMR@5 SR@5 SR@10 SR@15
Max Entropy 27.76 0.97 12.41 15.08 4.88 8 28.22 44.56
Abs Greedy 21.85 - - - 2.66 18.89 29.56 38.33
SCPR 26.56 2.98 20.08 29.03 5.28 11.78 37.78 49.44
UMGR 23.96 10.01 11.80 15.91 9.58 24.73 38.04 45.52
UNICORN 24.49 12.25 16.38 19.19 10.08 12 24.11 40.33
SGR 37.2 17.11 23.49 25.28 11.7 38.21 50.04 54.60

item selection and weighted entropy-based attribute selection From the online evaluation, it can be seen that SCPR has
strategies to obtain the detailed actions. superior results than Max Entropy and Abs Greedy because
SCPR uses reinforcement learning to learn a well-designed
B. Quantitative Results policy that is responsible for performing the appropriate act to
1) Main Results: The performance of all methods on the interact with the user. Intuitively, reinforcement learning can
dataset is presented in Tab. III. We can observe that our method make better use of feedback in the process of user interaction.
outperforms the baselines on most of the metrics. From the The performance of UNICORN is relatively lower than that
results of the offline evaluations, we have the following discov- of other baselines. The reason is that the UNICORN applies
eries: First of all, it is important to see that our model shows weighted entropy to obtain the requested slots which is more
higher Act Accuracy compared to other baselines. Our model difficult to select the candidate items in less turns. Our model
is designed to be more suitable for natural dialogue scenarios. has a significant improvement compared with the baselines.
In each turn, our model is able to generate several actions at It indicates that our model can identify the ground truth item
the same time. However, the baselines can only consider one in a shorter number of dialogue turns, which shows that our
action at one turn. Moreover, our model also obtains better design can adapt to more flexible interaction scenarios. We
EMR and IMR, especially EMR@1 and EMR@3, which is a suspect that the multi-action prediction also contributes to the
significant improvement over all the comparison algorithms. better performance than other baselines.
This effectively demonstrates the superiority of our method in 2) Ablation Studies: (I) Analysis of signed links. To ex-
offline prediction. Specifically, it can be found that our model plore the complex relationship among the slots and values of
achieves a higher improvement under EMR@1. This indicates the venues, we design positive links and negative links in the
that our algorithm can effectively select the most appropriate database MMKG and user state graph. During reasoning, we
slot, slot-value pair or venues while maintaining a high act leverage signed GCN to learn the node features which can
accuracy. This means that our model is not only closer to the better capture the signed relationship among different nodes.
natural form of dialogue, but can also ask the most effective To demonstrate the effective of signed links, we perform
questions for the user. experiment on a variant model which transfers the signed links
Compared with our algorithm, other algorithms have lower into unsigned links. We delete the negative links and construct
offline evaluation results. Max Entropy uses simple rule-based unsigned graph for database and user state graph. Then we
protocol to select the actions of the dialogue with probability. apply a widely used GCN-based method named LightGCN
Such a policy has high randomness and does not maintain [11] to learn the node representations. To be fair, the rest of
the coherence of the dialogue. SCPR takes advantage of the our model remains the same. The result is shown in Fig. 4. The
structural information of the graph to effectively filter out some SGR LightGCN represents the variant model using LightGCN
useless slot-values, so its EMR and IMR are higher than that to learn the node features. We can observe that although there’s
of Max Entropy and Abs Greedy. However, compared with no obvious difference under EMR between the two method,
our model, SCPR has lower EMR@1, EMR@3, IMR That’s our SGR performs much better than SGR LightGCN under
because our model has better performance on slot and venue IMR. It indicates that the signed links can better recommend
ranking. SCPR uses an algorithm similar to max entropy to proper venues to the users. That’s because the slots with
sort the entities. It cannot well select the most appropriate slots exclusive values increase the number of paths between the
and venues according to the current state of the dialogue. For user and venues by positive and negative links. Thus with the
UMGR, the ranking of entities is only related with the hidden help of signed links, the relationship between the user and
state of entities without considering the user information, venues are enhanced.
which makes it less sensitive to the exact dialogue situation. (II) Effectiveness of the ratio of negative links. We conduct
The EMR@1 and IMR of UNICORN is higher than other experiments under different proportion of negative links of
baselines. That’s because the UNICORN applied dynamic each slots. The proportion ranges from 0 to 1, where 0
weighted graph to capture the sequential information of the means there is no negative links in the MMKG and 1 means
dialogue history which is beneficial to learn user’s preference that we leverage all the negative links of each slot in the
on items. However it still uses weighted entropy to select MMKG. As shown on Fig. 5, with the proportion increasing,
the candidate slots and values similar to SCPR. It also does the performance improved and then degraded with a larger
not consider the complex relationship among slots and values. proportion. We suspect that the negative links provide more
Therefore, the performance of UNICORN is worse than that evidence about users’ preferences. Thus the introduction of
of our model. negative links with proper proportion will help the model do
9

user and the similar image. Note that the image is also linked
with the venue it belongs to in the graph. In this way, we link
the user and candidate venue in the state graph. By leveraging
signed GCN, we integrate the information of the venue into
the user node, which is beneficial for the model to better rank
the entities as well as the venues.
3) The performance of different domains: We conduct
experiment about the performance of different domains in the
dataset used in our paper. The result is shown in Tab. IV.

TABLE IV: The performance of different domains.

Fig. 4: The effectiveness of signed links.
domains #samples EMR@1 EMR@3 EMR@5 IMR
food 56.7% 18.42 22.87 24.39 7.73
better reasoning. However, when there are too many negative hotel 11.1% 15.29 18.66 21.85 16.28
links, the MMKG becomes larger and more complicated. The nightlife 10.2% 17.97 21.74 23.89 12.04
node will aggregate more information from the neighbors shopping mall 7.1% 21.94 27.31 28.60 36.15
sightseeing 14.9% 14.06 19.78 22.01 12.92
which may bring noise into the learning process.
We can observe that the number of venues and the dis-
tinctiveness of attributes affect the performance. For example,
the IMR score for the food domain is smaller than that of
others, which is due to its largest number of venues. On the
contrary, the EMR and IMR for the shopping mall domain
is better than those of others. We suppose this is because
the number of shopping malls is relatively small and the slot
values for it are rather distinctive. It is easier for the model
to learn the preferences of users and recommend the proper
shopping mall. For example, the locations of shopping malls
are relatively scattered, and several popular shopping malls are
Fig. 5: The effectiveness of the ratio of negative links.
quite distinctive in certain aspects.

C. Qualitative Results

WAH LOK JUMBO SEAFOOD

CANTONESE RESTAURANT RESTAURANT

downtown average dessert family shark good tanglin

region score friendly fin soup score region
Fig. 6: The effectiveness of images. - ++ + +
(III) Effectiveness of images. Compared with traditional - +
textual conversation, users and agent are allowed to commu- user
nicate with images in our task of multimodal conversational Fig. 7: The case study of the zero-shot situation.
recommendation. As the saying goes, “A picture is worth
a thousand words”. The images convey more information 1) Zero-shot Case Study: Our model has the ability to
about users’ preferences which help the agent recommend handle zero-shot items. That’s because we can learn the
proper venues to them. To demonstrate the effectiveness of inherent features of venues based on the MMKG of the
images, we perform experiments with a variant mode named database via signed GCN. When a user intends to look for
SGR wo image. It denotes the variant model by omitting a new venue, we can match user’s preference by reasoning
the image information in dialogues and only considering the on the updated graph according to the dialogue state. The
textual conversation. IMR for old venues and new venues are about 21.25 and
The performance on EMR and IMR is shown in Fig. 6. We 7.45, respectively. Although the performance of new venues
can observe that the performance is decreased when we only is not as good as that of old venues, the graph in our model
consider the textual conversation. It indicates that the images enables the prediction of new venues, while other non-graph
can help better represent users’ preference in some cases. This based methods cannot handle new venues. We show a case
is because when user provides an image, we can search for study for the zero-shot situation in Fig. 7. In the example,
similar images in the database and add a positive edge between the target venue JUMBO SEAFOOD RESTAURANT is a new
10

venue never seen by the model during the training procedure.

The venue is linked with many attributes in MMKG. Here we SRI
SWEE CHOON CHOCOLAT KAMALA FORTY MUTHU'S
just show part of the attributes related to user’s preferences. TIM SUM N' SPICE VILAS HANDS CURRY
Based on the dialogue states up to the current turn, the user
is positively linked with (dessert, family friendly, shark fin
soup, good score, tanglin region) and negatively linked with
downtown region, average score. According to the information
aggregation and propagation in signed GCN, the representation pancakes breakfast little India eggs central
region
of the venue JUMBO SEAFOOD RESTAURANT integrates the
0.38 0.41 0.42
information of the attributes and the user. The venue WAH +
+
LOK CANTONESE RESTAURANT integrates the negative 0.40 0.33
1 user
information from the user. Therefore, the target venue JUMBO
SEAFOOD RESTAURANT is more likely to be recommended
SRI
to the user. SWEE CHOON CHOCOLAT KAMALA FORTY MUTHU'S
2) Explainability via State Graph: To show the explain- TIM SUM N' SPICE VILAS HANDS CURRY
ability, we give a simple example in Fig. 8. The user wants
to find a place with eggs to have breakfast. We show the state
graph in dashed lines. In the first step, the user is positively
linked with the attribute breakfast and eggs to illustrate the
user’s preferences. The venues with these corresponding at- pancakes breakfast little India eggs central
region
tributes are activated (we just show parts of the candidates 0.48
0.20 + 0.31
here) and receive the preference information from the user in +
+ -
0.39 0.35
the MMKG. After reasoning over the state graph via signed 2
graph convolution, we obtain the scores of the candidate user
venues which are shown on the arrow lines linking the user
and the venues. The top three venues are FORTY HANDS, Fig. 8: The sample for explainability of our model.
SRI KAMALA VILAS and SWEE CHOON TIM SUM with the
scores 0.42, 0.41 and 0.40, respectively. As the conversation the images in multimodal conversational recommendation.
proceeds, the agent learns more about the user’s preferences. Besides, when user provides less preference at the first sev-
The user further gives his/her preference on region little India eral turns, how to conduct proper dialogue policy is also a
in the following step. Then the representations of the venues in challenge to be considered.
little India are enhanced. During model reasoning, the user’s
new preferences are integrated and propagated in the updated R EFERENCES
MMKG and the scores of the candidate venues are changed.
We can observe that the venue SRI KAMALA VILAS better [1] P. Auer, “Using confidence bounds for exploitation-exploration
trade-offs,” Journal of Machine Learning Research, vol. 3, no.
matches the representation of the user with a higher score Nov, pp. 397–422, 2002.
0.48. Therefore, the agent is more likely to recommend the [2] A. Bordes, N. Usunier, A. Garcia-Durán, J. Weston, and
SRI KAMALA VILAS to the user. O. Yakhnenko, “Translating embeddings for modeling multi-
relational data,” in Proceedings of the 26th International Con-
V. C ONCLUSION ference on Neural Information Processing Systems-Volume 2,
2013, pp. 2787–2795.
In this work, we explored a new type of internal state rep- [3] O. Chapelle and L. Li, “An empirical evaluation of thompson
resentation for multimodal dialogues and proposed a signed- sampling,” in Proceedings of the 24th International Conference
graph based reasoning model over it to guide the decision mak- on Neural Information Processing Systems, 2011, pp. 2249–
ing process. Specifically, inspired from studies on multimodal 2257.
knowledge graph, we construct multimodal state graph and [4] Q. Chen, J. Lin, Y. Zhang, M. Ding, Y. Cen, H. Yang, and
J. Tang, “Towards knowledge-based recommender dialog sys-
gradually update it to keep track of dynamic user preferences. tem,” in Proceedings of the 2019 Conference on Empirical
Based on the state graph, an end-to-end graph convolutional Methods in Natural Language Processing and the 9th Inter-
neural network integrates preference tendencies from the graph national Joint Conference on Natural Language Processing
and reasons about the next action to perform. As the prediction (EMNLP-IJCNLP), 2019, pp. 1803–1813.
results inherit the graph structure, it caters for cold start [5] K. Christakopoulou, A. Beutel, R. Li, S. Jain, and E. H. Chi,
“Q&r: A two-stage approach toward interactive recommenda-
venues and naturally provides explanations for predictions. We tion,” in Proceedings of the 24th ACM SIGKDD International
conducted experiments on a public multmodal conversational Conference on Knowledge Discovery & Data Mining, 2018, pp.
recommendation dataset. Both quantitative and qualitative 139–148.
results demonstrate the effectiveness of our proposed method [6] K. Christakopoulou, F. Radlinski, and K. Hofmann, “Towards
while also show its ability in handling zero-shot cases and conversational recommender systems,” in Proceedings of the
22nd ACM SIGKDD international conference on knowledge
enhancing explainability. discovery and data mining, 2016, pp. 815–824.
In future work, there are several problems can be further [7] Y. Deng, Y. Li, F. Sun, B. Ding, and W. Lam, “Unified
explored. For example, how to make better integration of conversational recommendation policy learning via graph-based
11

reinforcement learning,” in Proceedings of the 44th Interna- over knowledge graphs,” in Proceedings of the 57th Annual
tional ACM SIGIR Conference on Research and Development Meeting of the Association for Computational Linguistics, 2019,
in Information Retrieval, 2021, pp. 1431–1441. pp. 845–854.
[8] T. Derr, Y. Ma, and J. Tang, “Signed graph convolutional [24] H. Mousselly-Sergieh, T. Botschen, I. Gurevych, and S. Roth,
networks,” in 2018 IEEE International Conference on Data “A multimodal translation-based approach for knowledge graph
Mining (ICDM), 2018, pp. 929–934. representation learning,” in Proceedings of the Seventh Joint
[9] M. Firdaus, N. Thakur, and A. Ekbal, “Aspect-aware response Conference on Lexical and Computational Semantics, 2018, pp.
generation for multimodal dialogue system,” ACM Transactions 225–234.
on Intelligent Systems and Technology (TIST), vol. 12, no. 2, pp. [25] L. Nie, W. Wang, R. Hong, M. Wang, and Q. Tian, “Multimodal
1–33, 2021. dialog system: Generating responses via adaptive decoders,”
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for in Proceedings of the 27th ACM International Conference on
image recognition,” in Proceedings of the IEEE conference on Multimedia, 2019, pp. 1098–1106.
Computer Vision and Pattern Recognition, 2016, pp. 770–778. [26] P. Pezeshkpour, L. Chen, and S. Singh, “Embedding multimodal
[11] X. He, K. Deng, X. Wang, Y. Li, Y. Zhang, and M. Wang, relational data for knowledge base completion,” in Proceedings
“Lightgcn: Simplifying and powering graph convolution net- of the 2018 Conference on Empirical Methods in Natural
work for recommendation,” in Proceedings of the 43rd Interna- Language Processing, 2018, pp. 3208–3218.
tional ACM SIGIR conference on research and development in [27] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever
Information Retrieval, 2020, pp. 639–648. et al., “Language models are unsupervised multitask learners,”
[12] E. Hosseini-Asl, B. McCann, C.-S. Wu, S. Yavuz, and OpenAI blog, vol. 1, no. 8, pp. 1–24, 2019.
R. Socher, “A simple language model for task-oriented dia- [28] A. Saha, M. Khapra, and K. Sankaranarayanan, “Towards
logue,” in Proceedings of the 34th International Conference building large scale multimodal domain-aware conversation
on Neural Information Processing Systems, 2020, pp. 20 179– systems,” in Proceedings of the AAAI Conference on Artificial
20 191. Intelligence, 2018, pp. 696–704.
[13] W. Lei, X. He, M. de Rijke, and T.-S. Chua, “Conversational [29] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a
recommendation: Formulation, methods, and evaluation,” in distilled version of bert: smaller, faster, cheaper and lighter,” in
Proceedings of the 43rd International ACM SIGIR Conference the 5th Workshop on Energy Efficient Machine Learning and
on Research and Development in Information Retrieval, 2020, Cognitive Computing, 2019, pp. 1–5.
pp. 2425–2428. [30] R. Sennrich, B. Haddow, and A. Birch, “Neural machine trans-
[14] W. Lei, X. He, Y. Miao, Q. Wu, R. Hong, M.-Y. Kan, and T.-S. lation of rare words with subword units,” in Proceedings of
Chua, “Estimation-action-reflection: Towards deep interaction the 54th Annual Meeting of the Association for Computational
between conversational and recommender systems,” in Proceed- Linguistics, 2016, pp. 1715–1725.
ings of the 13th International Conference on Web Search and [31] R. Sun, X. Cao, Y. Zhao, J. Wan, K. Zhou, F. Zhang, Z. Wang,
Data Mining, 2020, pp. 304–312. and K. Zheng, “Multi-modal knowledge graphs for recom-
[15] W. Lei, G. Zhang, X. He, Y. Miao, X. Wang, L. Chen, and T.-S. mender systems,” in Proceedings of the 29th ACM International
Chua, “Interactive path reasoning on graph for conversational Conference on Information & Knowledge Management, 2020,
recommendation,” in Proceedings of the 26th ACM SIGKDD pp. 1405–1414.
International Conference on Knowledge Discovery & Data [32] Y. Sun and Y. Zhang, “Conversational recommender system,”
Mining, 2020, pp. 2073–2083. in The 41st international acm sigir conference on research &
[16] R. Li, S. Kahou, H. Schulz, V. Michalski, L. Charlin, and development in information retrieval, 2018, pp. 235–244.
C. Pal, “Towards deep conversational recommendations,” in [33] M. Wang, H. Wang, G. Qi, and Q. Zheng, “Richpedia: a large-
Proceedings of the 32nd International Conference on Neural scale, comprehensive multi-modal knowledge graph,” Big Data
Information Processing Systems, 2018, pp. 9748–9758. Research, vol. 22, p. 100159, 2020.
[17] L. Liao, L. Kennedy, L. Wilcox, and T.-S. Chua, “Crowd knowl- [34] Y. Wang, S. Qian, J. Hu, Q. Fang, and C. Xu, “Fake news
edge enhanced multimodal conversational assistant in travel detection via knowledge-driven multimodal graph convolutional
domain,” in International Conference on Multimedia Modeling, networks,” in Proceedings of the 2020 International Conference
2020, pp. 405–418. on Multimedia Retrieval, 2020, pp. 540–547.
[18] L. Liao, L. H. Long, Z. Zhang, M. Huang, and T.-S. Chua, [35] T. Wolf, J. Chaumond, L. Debut, V. Sanh, C. Delangue,
“Mmconv: An environment for multimodal conversational A. Moi, P. Cistac, M. Funtowicz, J. Davison, S. Shleifer et al.,
search across multiple domains,” in Proceedings of the 44th “Transformers: State-of-the-art natural language processing,” in
International ACM SIGIR Conference on Research and Devel- Proceedings of the 2020 Conference on Empirical Methods in
opment in Information Retrieval, 2021, pp. 675–684. Natural Language Processing: System Demonstrations, 2020,
[19] L. Liao, Y. Ma, X. He, R. Hong, and T.-s. Chua, “Knowledge- pp. 38–45.
aware multimodal dialogue systems,” in Proceedings of the 26th [36] R. Xie, Z. Liu, H. Luan, and M. Sun, “Image-embodied
ACM international conference on Multimedia, 2018, pp. 801– knowledge representation learning,” in Proceedings of the 26th
809. International Joint Conference on Artificial Intelligence, 2017,
[20] Y. Liu, H. Li, A. Garcia-Duran, M. Niepert, D. Onoro-Rubio, pp. 3140–3146.
and D. S. Rosenblum, “Mmkg: multi-modal knowledge graphs,” [37] H. Xu, S. Moon, H. Liu, B. Liu, P. Shah, and S. Y. Philip,
in European Semantic Web Conference, 2019, pp. 459–474. “User memory reasoning for conversational recommendation,”
[21] Z. Liu, H. Wang, Z.-Y. Niu, H. Wu, W. Che, and T. Liu, in Proceedings of the 28th International Conference on Com-
“Towards conversational recommendation over multi-type di- putational Linguistics, 2020, pp. 5288–5308.
alogs,” in Proceedings of the 58th International Conference on [38] Y. Zhang, Z. Ou, and Z. Yu, “Task-oriented dialog systems
Computational Linguistics, 2020, pp. 1036–1049. that consider multiple appropriate responses under the same
[22] S. Moon, S. Kottur, P. A. Crook, A. De, S. Poddar, T. Levin, context,” in Proceedings of the AAAI Conference on Artificial
D. Whitney, D. Difranco, A. Beirami, E. Cho et al., “Situated Intelligence, 2020, pp. 9604–9611.
and interactive multimodal conversations,” in Proceedings of the [39] Y. Zhang, X. Chen, Q. Ai, L. Yang, and W. B. Croft, “Towards
28th International Conference on Computational Linguistics, conversational search and recommendation: System ask, user
2020, pp. 1103–1121. respond,” in Proceedings of the 27th acm international con-
[23] S. Moon, P. Shah, A. Kumar, and R. Subba, “Opendialkg: Ex- ference on information and knowledge management, 2018, pp.
plainable conversational reasoning with attention-based walks 177–186.
12

[40] K. Zhou, W. X. Zhao, S. Bian, Y. Zhou, J.-R. Wen, and Xueming Qian (M’10) received the B.S. and M.S.
J. Yu, “Improving conversational recommender systems via degrees from the Xi’an University of Technology,
knowledge graph based semantic fusion,” in Proceedings of the Xi’an, China, in 1999 and 2004, respectively, and
26th ACM SIGKDD International Conference on Knowledge the Ph.D. degree in electronics and information
Discovery & Data Mining, 2020, pp. 1006–1014. engineering from Xi’an Jiaotong University, Xi’an,
China, in 2008. He was a Visiting Scholar with
[41] K. Zhou, Y. Zhou, W. X. Zhao, X. Wang, and J.-R. Wen, Microsoft Research Asia, Beijing, China, from 2010
“Towards topic-guided conversational recommender system,” in to 2011. He was previously an Assistant Professor
Proceedings of the 28th International Conference on Computa- at Xi’an Jiaotong University, where he was an Asso-
tional Linguistics, 2020, pp. 4128–4139. ciate Professor from 2011 to 2014, and is currently a
[42] J. Zou, Y. Chen, and E. Kanoulas, “Towards question-based Full Professor. He is also the Director of the Smiles
recommender systems,” in Proceedings of the 43rd International Laboratory, Xi’an Jiaotong University. His rsearch interests include social
ACM SIGIR Conference on Research and Development in media big data mining and search.
Information Retrieval, 2020, pp. 881–890.

Yuxia Wu received the B.S. degree from Zhengzhou Tat-Seng Chua is the KITHCT Chair Professor at
University, Henan, China, in 2014, the M.S. degree the School of Computing, National University of
from the Fourth Military Medical University, Xi’an, Singapore. He was the Acting and Founding Dean
China, in 2017, and is currently working toward the of the School from 1998-2000. Dr Chua’s main re-
Ph.D. degree at Xi’an Jiaotong University, Xi’an, search interest is in multimedia information retrieval
China. She is now a visiting student at the National and social media analytics. In particular, his research
University of Singapore. Her research interests in- focuses on the extraction, retrieval and question-
clude social multimedia mining, recommender sys- answering (QA) of text and rich media arising from
tems and natural language processing. the Web and multiple social networks. He is the co-
Director of NExT, a joint Center between NUS and
Tsinghua University to develop technologies for live
Lizi Liao is an assistant professor with Singapore social media search.
Management University. She received the Ph.D. Dr Chua is the 2015 winner of the prestigious ACM SIGMM award for
degree in 2019 from NUS Graduate School for Outstanding Technical Contributions to Multimedia Computing, Communi-
Integrative Sciences and Engineering at the Na- cations and Applications. He is the Chair of steering committee of ACM
tional University of Singapore. Her research interests International Conference on Multimedia Retrieval (ICMR) and Multimedia
include conversational system, multimedia analysis Modeling (MMM) conference series. Dr Chua is also the General Co-Chair
and recommendation. Her works have appeared in of ACM Multimedia 2005, ACM CIVR (now ACM ICMR) 2005, ACM SIGIR
top-tier conferences such as MM, WWW, ICDE, 2008, and ACM Web Science 2015. He serves in the editorial boards of four
ACL, IJCAI and AAAI, and top-tier journals such international journals. Dr. Chua is the co-Founder of two technology startup
as TKDE. She received the Best Paper Award Hon- companies in Singapore. He holds a PhD from the University of Leeds, UK.
orable Mention of ACM MM 2018. Moreover, she
has served as the PC member for international conferences including SIGIR,
WSDM, ACL, and the invited reviewer for journals including TKDE, TMM
and KBS.

Gangyi Zhang received the bachelor’s degree at

University of Electronic Science and Technology of
China, Chengdu, China, in 2020. He is currently
pursuing the master’s degree with the School of Data
Science, University of Science and Technology of
China, Hefei, China. His main research interests in-
clude conversational recommendation systems, ma-
chine learning and data mining techniques.

Wenqiang Lei is a Postdoc Research Fellow in

School of Computing, National University of Sin-
gapore. He received his Ph.D. degree from National
University of Singapore in 2019. His research inter-
ests focus on conversational AI, inclusive of conver-
sational recommendation, dialogue and QA system,
user feedback modeling. He has published relevant
papers at top venues such as KDD, WSDM, TOIS,
ACL, EMNLP and the winner of ACM MM 2020
best paper award. He has also actively give tutorials
on the topic of conversational recommendation at
multiple conferences: RecSys 2021, SIGIR 2020, CCL 2020, CCIR 2020.

Guoshuai Zhao received the BE degree from Hei-

longjiang University, Harbin, China, in 2012, and the
MS and PhD degrees from Xi’an Jiaotong Univer-
sity, Xi’an, China, in 2015 and 2019, respectively.
He was an intern with the Social Computing Group,
Microsoft Research Asia from January 2017 to July
2017, and was a visiting scholar with Northeastern
University, from October 2017 to October 2018 and
with MIT, from June 2019 to December 2019. Now,
he is an associate professor with Xi’an Jiaotong Uni-
versity. His research interests include social media
big data analysis, recommender systems, and natural language generation.

Economics
No ratings yet
Economics
322 pages
Rma DLP
No ratings yet
Rma DLP
2 pages
Hirac (Manhole Installation)
No ratings yet
Hirac (Manhole Installation)
7 pages
Cooling Tower Performance Test (Id CT) : Manual Input Sheet Station: Report Date: Unit: Test Date
No ratings yet
Cooling Tower Performance Test (Id CT) : Manual Input Sheet Station: Report Date: Unit: Test Date
12 pages
Amity University, Mumbai Aibas: Title: Learned Optimism Scale
No ratings yet
Amity University, Mumbai Aibas: Title: Learned Optimism Scale
9 pages
EAPP A Sample Critique Paper
No ratings yet
EAPP A Sample Critique Paper
4 pages
Ch-5 Measurement of Length and Motion
No ratings yet
Ch-5 Measurement of Length and Motion
6 pages
Process, Assumptions, Values N Beliefs of OD
100% (2)
Process, Assumptions, Values N Beliefs of OD
21 pages
A Literature Review and Classification of Recommender Systems Research PDF
100% (2)
A Literature Review and Classification of Recommender Systems Research PDF
7 pages
The Death of The Moth
No ratings yet
The Death of The Moth
5 pages
Recommendation Systems Paper AI
No ratings yet
Recommendation Systems Paper AI
7 pages
24 - Effective Multi-Modal Conversational Recommendation
No ratings yet
24 - Effective Multi-Modal Conversational Recommendation
193 pages
MELC IA Masonry G7-8
No ratings yet
MELC IA Masonry G7-8
3 pages
Case Based Rec Lee
No ratings yet
Case Based Rec Lee
21 pages
Essay On Technology and Economy
No ratings yet
Essay On Technology and Economy
4 pages
Animalia Kingdom
No ratings yet
Animalia Kingdom
19 pages
Delhi Public School: Name-Vipanshu Class 10th G Summited To
No ratings yet
Delhi Public School: Name-Vipanshu Class 10th G Summited To
19 pages
Shoulder Pain and Disability Index
No ratings yet
Shoulder Pain and Disability Index
2 pages
105 A Survey On Conversational Recommender Systems
No ratings yet
105 A Survey On Conversational Recommender Systems
35 pages
Reviewer Communication
No ratings yet
Reviewer Communication
2 pages
Toward The Next Generation of Recommender Systems A Survey of The State-Of-The-Art and Possible Exte-7vd PDF
No ratings yet
Toward The Next Generation of Recommender Systems A Survey of The State-Of-The-Art and Possible Exte-7vd PDF
16 pages
Rs llm2
No ratings yet
Rs llm2
45 pages
Astm C 110 Pruebas Fisicas O y OH de Ca PDF
No ratings yet
Astm C 110 Pruebas Fisicas O y OH de Ca PDF
21 pages
A Recommender System-Using Novel Deep Network Collaborative Filtering
No ratings yet
A Recommender System-Using Novel Deep Network Collaborative Filtering
12 pages
Survey On The Objectives of Recommender System: Measures, Solutions, Evaluation Methodology, and New Perspectives
No ratings yet
Survey On The Objectives of Recommender System: Measures, Solutions, Evaluation Methodology, and New Perspectives
37 pages
Oliver Twist Essay Questions
100% (2)
Oliver Twist Essay Questions
4 pages
19.sciebo2020 TKDE Guo A Survey On Knowledge Graph-Based Recommender Systems
No ratings yet
19.sciebo2020 TKDE Guo A Survey On Knowledge Graph-Based Recommender Systems
20 pages
Deeplsgr: Neural Collaborative Filtering For Recommendation Systems in Smart Community
No ratings yet
Deeplsgr: Neural Collaborative Filtering For Recommendation Systems in Smart Community
20 pages
Improving Conversational Recommendation System Through Personalized Preference Modeling and Know
No ratings yet
Improving Conversational Recommendation System Through Personalized Preference Modeling and Know
12 pages
Rethinking The Evaluation For Conversational Recommendation in The Era of Large Language Models
No ratings yet
Rethinking The Evaluation For Conversational Recommendation in The Era of Large Language Models
17 pages
Hybrid Recommendation System Using Clustering and Collaborative Filtering
No ratings yet
Hybrid Recommendation System Using Clustering and Collaborative Filtering
6 pages
2024 SynerGraph
No ratings yet
2024 SynerGraph
24 pages
Travel Companion: Keywords:-Blockchain, Machine Learning, Hybrid Filtering
No ratings yet
Travel Companion: Keywords:-Blockchain, Machine Learning, Hybrid Filtering
5 pages
Efficient Rec Springer
No ratings yet
Efficient Rec Springer
16 pages
The Recommender System A Survey PDF
No ratings yet
The Recommender System A Survey PDF
6 pages
Mirza 2003
No ratings yet
Mirza 2003
30 pages
Digital Filter Structures Digital Filter Structures
No ratings yet
Digital Filter Structures Digital Filter Structures
10 pages
A New Improved KNN Based Recommender System
No ratings yet
A New Improved KNN Based Recommender System
35 pages
Using Semantic Recommenders For Personalized Recommendations
No ratings yet
Using Semantic Recommenders For Personalized Recommendations
4 pages
Web Crawling Based Context Aware Recommender Syste
No ratings yet
Web Crawling Based Context Aware Recommender Syste
25 pages
A Survey On Knowledge Graph-Based Recommender Systems
No ratings yet
A Survey On Knowledge Graph-Based Recommender Systems
17 pages
Alibaba 202304 Is ChatGPT A Good Recommender A Preliminary Study
No ratings yet
Alibaba 202304 Is ChatGPT A Good Recommender A Preliminary Study
10 pages
E-Recruitment Recommender Systems: A Systematic Review: Mauricio Noris Freire Leandro Nunes de Castro
No ratings yet
E-Recruitment Recommender Systems: A Systematic Review: Mauricio Noris Freire Leandro Nunes de Castro
20 pages
Interactive Path Reasoning On Graph For Conversational
No ratings yet
Interactive Path Reasoning On Graph For Conversational
11 pages
03-Knowledge-Based Recommender Systems - Overview and Research Directions
No ratings yet
03-Knowledge-Based Recommender Systems - Overview and Research Directions
19 pages
Division Memorandum No. 0555, S. 2024 - Reiteration On The Implementation of Modular Distance Learning As Provided in DepEd Order No. 037, S. 2022.
No ratings yet
Division Memorandum No. 0555, S. 2024 - Reiteration On The Implementation of Modular Distance Learning As Provided in DepEd Order No. 037, S. 2022.
2 pages
Survey On Recommender System Using Deep Learning Networks
No ratings yet
Survey On Recommender System Using Deep Learning Networks
18 pages
Bachelor Thesis Proposal Template (DD) - ITS
No ratings yet
Bachelor Thesis Proposal Template (DD) - ITS
31 pages
S R M N P P: Equential Ecommendation Odel FOR EXT Urchase Rediction
No ratings yet
S R M N P P: Equential Ecommendation Odel FOR EXT Urchase Rediction
18 pages
Google LLM Conversational Recs
No ratings yet
Google LLM Conversational Recs
24 pages
E-Learning Recommendation Systems - A Survey: Rubina Parveen, Anant Kr. Jaiswal, Vibhor Kant
No ratings yet
E-Learning Recommendation Systems - A Survey: Rubina Parveen, Anant Kr. Jaiswal, Vibhor Kant
3 pages
推荐系统图提示
No ratings yet
推荐系统图提示
10 pages
Cofee Husk Ash
No ratings yet
Cofee Husk Ash
12 pages
Recommender Systems in The Era of Large Language M
No ratings yet
Recommender Systems in The Era of Large Language M
16 pages
Knowledge Graph Recommender
No ratings yet
Knowledge Graph Recommender
21 pages
Toward The Next Generation of Recommender Systems - A Survey of The State-Of-The-Art and Possible Extensions
No ratings yet
Toward The Next Generation of Recommender Systems - A Survey of The State-Of-The-Art and Possible Extensions
16 pages
MCQS Unit IV Jacobian2
No ratings yet
MCQS Unit IV Jacobian2
6 pages
Is Chatgpt A Good Recommender? A Preliminary Study: Junling Liu Chao Liu Peilin Zhou
No ratings yet
Is Chatgpt A Good Recommender? A Preliminary Study: Junling Liu Chao Liu Peilin Zhou
10 pages
To Technical Paper III
No ratings yet
To Technical Paper III
10 pages
1 s2.0 S2212827124000428 Main
No ratings yet
1 s2.0 S2212827124000428 Main
6 pages
21741-Final PDF-25779-1-10-20220715
No ratings yet
21741-Final PDF-25779-1-10-20220715
6 pages
Gulzar 2018
No ratings yet
Gulzar 2018
7 pages
Challengesforrs
No ratings yet
Challengesforrs
5 pages
1697mining Web Graphs For Recommendations
No ratings yet
1697mining Web Graphs For Recommendations
12 pages
Brain Works
No ratings yet
Brain Works
2 pages
E - Commerce Website
No ratings yet
E - Commerce Website
12 pages
1 Abstract
No ratings yet
1 Abstract
7 pages
Hybrid Recommendation System Using Graph Neural Network and BERT Embeddings
No ratings yet
Hybrid Recommendation System Using Graph Neural Network and BERT Embeddings
8 pages
ULSD Specs
No ratings yet
ULSD Specs
1 page
Research Paper 1 SSS Jain Published
No ratings yet
Research Paper 1 SSS Jain Published
5 pages
Storytelling Elements
No ratings yet
Storytelling Elements
7 pages
Improving Collaborative Filtering Recommender Systems Using Semantic Information
No ratings yet
Improving Collaborative Filtering Recommender Systems Using Semantic Information
6 pages
Recommender Systems Types
No ratings yet
Recommender Systems Types
4 pages
Icacccn51052 2020 9362962
No ratings yet
Icacccn51052 2020 9362962
5 pages
Recommendation Systems:Issues and Challenges: Soanpet .Sree Lakshmi, Dr.T.Adi Lakshmi
No ratings yet
Recommendation Systems:Issues and Challenges: Soanpet .Sree Lakshmi, Dr.T.Adi Lakshmi
2 pages
A Survey On Movie Recommendation System PDF
No ratings yet
A Survey On Movie Recommendation System PDF
4 pages
IALC
No ratings yet
IALC
9 pages
Childhood Interests and Motivation - CCCHE (PVT) LTD
No ratings yet
Childhood Interests and Motivation - CCCHE (PVT) LTD
9 pages
Youtube Recommendation System IJERTV11IS060071
No ratings yet
Youtube Recommendation System IJERTV11IS060071
4 pages
Legal Aspect Case Study
No ratings yet
Legal Aspect Case Study
2 pages
Lab5 Section B Group2 (Assignment)
No ratings yet
Lab5 Section B Group2 (Assignment)
1 page
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
Web Applications and Their Implications for Modern E-Government Systems: Working Action Research 1St Edition
From Everand
Web Applications and Their Implications for Modern E-Government Systems: Working Action Research 1St Edition
Salman Ben Zayed
No ratings yet
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
From Everand
Introduction to DBMS: Designing and Implementing Databases from Scratch for Absolute Beginners
Dr. Hariram Chavan
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
From Everand
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
Manish Soni
No ratings yet
ChatGPT Application and Integration Guide: Definitive Reference for Developers and Engineers
From Everand
ChatGPT Application and Integration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Artificial Intelligence: Fundamentals and Applications
From Everand
Distributed Artificial Intelligence: Fundamentals and Applications
Fouad Sabry
No ratings yet
Activity Recognition: Fundamentals and Applications
From Everand
Activity Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet

22 - State Graph Reasoning For Multimodal Conversational Recommendation

Uploaded by

22 - State Graph Reasoning For Multimodal Conversational Recommendation

Uploaded by

1

State Graph Reasoning for Multimodal

Dialogue history Dialogue history Dialogue history

user system user system user system

Action prediction Action prediction

Fig. 1: The workflow of the existing methods and ours.

Dialogue state Reasoning over state graph

Reasoning for details Inform Rank of values

TASTE MELLBEN TASTE MELLBEN TASTE MELLBEN

Fig. 3: The update of KG-based state graph1

L = CrossEntropyLoss(y, y ∗ ), (1) 2 https://github.com/liziliao/MMConv

TABLE III: Performance Comparison With Baselines

TABLE IV: The performance of different domains.

WAH LOK JUMBO SEAFOOD

downtown average dessert family shark good tanglin

venue never seen by the model during the training procedure.

Gangyi Zhang received the bachelor’s degree at

Wenqiang Lei is a Postdoc Research Fellow in

Guoshuai Zhao received the BE degree from Hei-

You might also like