Deep Reinforcement Learning Framework For Category-Based Item Recommendation
Deep Reinforcement Learning Framework For Category-Based Item Recommendation
Abstract—Deep reinforcement learning (DRL)-based recom- between a recommender system and a user. The overall recom-
mender systems have recently come into the limelight due to mendation accuracy for all interactions (also called long-term
their ability to optimize long-term user engagement. A signifi- rewards) is crucial in designing these recommender systems.
cant challenge in DRL-based recommender systems is the large
action space required to represent a variety of items. The large Most recommender systems function by estimating the cor-
action space weakens the sampling efficiency and thereby, affects relation between the user and the items, measured using
the recommendation accuracy. In this article, we propose a immediate user feedback, such as ratings or clicks, which rep-
DRL-based method called deep hierarchical category-based rec- resent the immediate rewards. However, optimizing a system
ommender system (DHCRS) to handle the large action space only using such immediate feedback cannot guarantee the
problem. In DHCRS, categories of items are used to recon-
struct the original flat action space into a two-level category-item maximization of the overall recommendation accuracy or
hierarchy. DHCRS uses two deep Q-networks (DQNs): 1) a high- long-term rewards [1].
level DQN for selecting a category and 2) a low-level DQN Recently, reinforcement learning (RL) techniques, specif-
to choose an item in this category for the recommendation. ically deep RL (DRL) algorithms, have been used in rec-
Hence, the action space of each DQN is significantly reduced. ommender systems. DRL works to maximize the long-term
Furthermore, the categorization of items helps capture the users’
preferences more effectively. We also propose a bidirectional cat- rewards, making it suitable for the problem described above.
egory selection (BCS) technique, which explicitly considers the DRL uses a neural network to approximate the optimal policy
category-item relationships. The experiments show that DHCRS and is shown to scale to large state spaces. In most existing
can significantly outperform state-of-the-art methods in terms of DRL-based recommender systems, the process of continuous
hit rate and normalized discounted cumulative gain for long-term recommendation between the system and user is viewed as a
recommendations.
Markov decision process (MDP). Here, the system is an RL
Index Terms—Deep reinforcement learning (DRL), hierarchy, agent, and the user is the environment. The system recom-
large action space, recommender system. mends an item, also referred to as an action, based on the
user’s current state. The current state is then updated based on
I. I NTRODUCTION the user’s feedback about the item. The recommender system
ECOMMENDER systems play a significant role in is trained to maximize the long-term return, for example, the
R recommending products or items to users. Essentially,
a recommendation is a process of continuous interaction
number of recommended products clicked over a period of
time [2]. Depending on the adopted algorithm, the DRL-
based recommender systems can be generally divided into
Manuscript received 15 July 2020; revised 17 April 2021; accepted value-based [3], [4] and policy-based [5]–[7] methods. In par-
11 June 2021. Date of publication 16 August 2021; date of current ver- ticular, the deep Q-network (DQN) [8] is the most popular
sion 17 October 2022. This work was supported in part by the China
Postdoctoral Science Foundation under Grant 2020M67318; in part by value-based DRL method. In DQN, the policy is determinis-
the National Key Research and Development Program of China under tic, and the selected action is the one with the largest Q-value.
Grant 2018AAA0100202; and in part by the National Science Foundation Here, the Q-value represents the quality of performing an
of China under Grant 61976043. This article was recommended by Associate
Editor B. Ribeiro. (Corresponding author: Hong Qu.) action at a given state and it is approximated using a neural
Mingsheng Fu is with the School of Computer Science and network.
Engineering, University of Electronic Science and Technology of China, One of the key challenges faced by DRL-based recom-
Chengdu 610054, China, and also with the School of Electrical and
Electronic Engineering, Nanyang Technological University, Singapore mender systems is a large and discrete action space. In most
(e-mail: [email protected]). recommender systems, the actions representing the items are
Anubha Agrawal and Athirai A. Irissappane are with the School of at least in the order of thousands, if not millions. However,
Engineering and Technology, University of Washington, Tacoma, WA 98402
USA (e-mail: [email protected]; [email protected]). this is not the case for domains other than the recommender
Jie Zhang is with the School of Computer Science and Engineering, systems, where the size of the action space does not exceed
Nanyang Technological University, Singapore (e-mail: [email protected]). more than a few hundred or thousands [8], [9]. As a result,
Liwei Huang and Hong Qu are with the School of Computer
Science and Engineering, University of Electronic Science and Technology DRL-based recommender systems require more training sam-
of China, Chengdu 610054, China (e-mail: [email protected]; ples to achieve statistical efficiency [10]. A large action space
[email protected]). can easily lead the learning algorithm to converge to a subopti-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCYB.2021.3089941. mal policy [11]. These issues can significantly affect efficiency
Digital Object Identifier 10.1109/TCYB.2021.3089941 and even limit prediction accuracy.
2168-2267
c 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12029
To handle the issue of large action space, we propose We discuss the findings and show the experiment results in
the deep hierarchical category-based recommender system Section IV. Finally, Section V presents the conclusion.
(DHCRS). In DHCRS, the original action space is recon-
structed as a two-level hierarchical action space based on the
category-item relationship. In this hierarchy, there are sev- II. R ELATED W ORK
eral category nodes at the higher level, and each category The collaborative filtering (CF) methods using matrix
node links to some item nodes at the lower level. The edge factorization [12], [13] and deep learning [14]–[17] have
between the category node and the item node indicates that dominated recommender systems in the last decade. For
the item belongs to that category. A 2-level DQN architecture example, [14] uses the feedforward neural network to make
is used. The high-level DQN selects the recommended cate- recommendation, [15] employs the recurrent neural network
gory, and the low-level DQN chooses the recommended item to conduct the sequential recommendation, and [17] studies
in the selected category. This relatively reduces the size of the an effective embedding learning algorithm through a genera-
action space for both high-level and low-level DQNs. The size tive adversarial network. Besides, many works also study the
of the action space for high-level DQN is the total number of utilization of category information, such as involving category
categories. Similarly, the size of the action space for the low- hierarchy as the side information [18], constructing category-
level DQN is reduced to the number of items belonging to aware transition [19], [20], etc. Specifically, similar to our
each category. framework, the models with category-aware transition use the
Another advantage of considering category information is idea of divide and conquer, which first predicts the category,
that the search space of items can be effectively constrained and then derives the recommendation result based on the pre-
by identifying the general preference of a user for a partic- dicted category. However, traditional CF methods, including
ular category. Intuitively, most users are only interested in the ones with category-aware transitions, treat recommenda-
a limited number of categories. By heuristically shortlisting tions as one-shot predictions such that the results of previous
the candidates into a specific category, the items preferred by recommendations do not affect subsequent recommendations.
the user can be determined easily. For selecting the preferred Unlike these traditional methods, most emerging DRL-based
category effectively, capturing the relationship between the methods treat continuous recommendations as an MDP [21].
category and item is important. A user’s preference for a cate- They formalize an interactive procedure between a user and
gory is directly correlated with the user’s preferences for items the recommendation agent, where the user’s feedback for the
in this category. For example, a user’s preference for the Star last recommendation can significantly affect the result of the
Wars series may hint at the user’s interest in the SCI-FI cat- next recommendation.
egory. For this, we propose a bidirectional category selection Existing DRL-based recommender systems focus on top-
strategy (BCS) that considers the user’s category and item ics, such as state representation [22], [23]; choice of RL
preferences. algorithm [3], [6], [24]; simulation environment [25], [26];
The experiments on four real-world benchmarks show that and top-N recommendation [1], [27]. Some works also
DHCRS can significantly improve recommendation accuracy. try to employ hierarchical RL (HRL) in the recommender
The experimental results also demonstrate that our model can systems [28], [29]. Zhao et al. [28] described a typical appli-
outperform state-of-the-art DRL-based recommender systems cation of traditional HRL with temporal abstraction [30]. There
in terms of hit rate (HR) and normalized discounted cumu- is a high-level agent with a long-term goal, and a low-level
lative gain (NDCG). In summary, the main contributions of agent performs a series of actions to achieve this goal. In [29],
this work are given as follows: 1) we propose DHCRS, which the DRL-related module is not directly correlated to the final
is a deep hierarchical category-based recommender system. item recommendation, instead, it is used as a filtering com-
It addresses the problem of large action space using a 2-level ponent, such as traditional memory-based or sequential-based
DQN based on the category-item relationship; 2) to effectively CF models. It uses a two-level hierarchy, where the high-
model the category-item relationship, we propose a bidirec- level agent decides whether to filter historical items, and the
tional decision-making strategy for selecting the preferred low-level agent decides which historical items to use. Similar
category as well as the item. In addition, an auxiliary learning to these methods, our proposed framework also uses HRL
algorithm is proposed to improve the effectiveness of cate- to build the recommendation agent. However, the difference
gory selection; and 3) to demonstrate the effectiveness of the between our framework and the above methods is that our
proposed model, we compare the long-term recommendation motivation is to optimize the action selection procedure at
accuracy of DHCRS with the state-of-the-art methods in terms each step. Besides, our action space varies based on user
of HR and NDCG. The experiments are conducted on four preferences while their action space is always fixed.
datasets: Movielens (ML) 1M, 10M, and 20M, and Netflix. There are few works that attempt to address the large action
We evaluate the results on each user and then report the aver- space problem in recommender systems, and they are mainly
age results of all the users. The experimental results show based on the policy-gradient algorithm [6], [31]. In [31],
that the proposed model can outperform the best baseline by the discrete action space is first converted into a continu-
an average of 3.07% and 1.74% in terms of HR and NDCG. ous action space. Then, the deep deterministic policy gradient
The remainder of this article is organized as follows. In (DDPG) [32], a continuous action policy-gradient method, is
Section II, we briefly review the related literature. We present applied to generate an optimal continuous action. Finally, the
the details of our proposed DHCRS method in Section III. discrete action for recommendation is retrieved according to
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12030 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022
TABLE I
M AIN N OTATIONS IN T HIS A RTICLE
from the user about iu,t is sent back to the agent, the state is items for user u at time step t. Similar to [6], we do not allow
updated to su,t+1 during the next time step t+1, and a new item an item to be recommended repeatedly. The item set for the
iu,t+1 is recommended. The difference between our method user u in the first time step consists of the original item set,
and traditional DRL-based methods is the inclusion of the that is, Iu,1 = I. Then, for every next time step t + 1, Iu,t+1
category information, where the category is determined before becomes Iu,t+1 = Iu,t − iu,t .
recommending an item as illustrated in Fig. 1. We formalize Transition: T is the function to obtain the new state su,t+1
the process of recommending an item as an MDP denoted by given the previous state su,t after recommending item iu,t .
a tuple S, I, C, T, R, γ , representing the state space S, two Specifically, su,t+1 = [Hu,t+1 , Fu,t+1 ] is updated with item
different action spaces I and C, transition function T, item iu,t such that Hu,t+1 = Hu,t ∪ {iu,t } and Fu,t+1 = Fu,t ∪ {fu,t }.
reward R, and discount factor γ . Initially, Hu,0 = ∅ and Fu,0 = ∅.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12031
hu,t = Concat phu,t , nhu,t (1)
where GRUp (PIu,t ) and GRUn (NIu,t ) are RNNs with GRU
cells for the positive PIu,t and negative sequence NIu,t , respec-
tively. As shown in Fig. 3(a), for GRUp (PIu,t ), the RNN input
at each time step is the embedding of item piu,p , and the
output phu,t is the RNN state vector for the last time step.
Fig. 2. Illustration of category-item hierarchy, where the red lines indicate a GRUn (NIu,t ) is defined similarly. We concatenate phu,t and
path to reach the recommended item 5 through category 3 using a two-steps nhu,t to obtain hu,t .
selection process, that is, DQNc and DQNi .
The Q-network Qnet(hu,t , iu,t ; I, θ i ) takes the extracted rep-
resentation as input and predicts the Q-values for the actions.
We feed the state hu,t from Snet(su,t ; θ i ), as shown in Fig. 3(b)
Reward: R(su,t , iu,t ) is an immediate reward function to
to acquire the Q-value QiDQNi (su,t , iu,t ; θ i ) of item iu,t in a
evaluate the quality of recommended item iu,t at state su,t .
given action set I
The value of R(su,t , iu,t ) is directly related to the feedback
fu,t , where a positive feedback results in a positive reward and QiDQNi su,t , iu,t ; θ i = Qnet hu,t , iu,t ; I, θ i
a negative feedback in a negative reward.
Discount Factor: γ ∈ [0, 1) is a balance factor to balance = Qnet Snet su,t ; θ i , iu,t ; I, θ i . (2)
the immediate reward and future cumulative rewards. 2) High-Level Category DQN: DQNc (su,t , cu,t ; C, θ c ) is
the high-level DQN used in selecting category cu,t . It also uses
B. Two-Level DQN Hierarchy the same neural network architecture shown in Fig. 3, but uses
the category action space C instead of item action space I. In
To avoid choosing directly from all items, we convert the
addition, it has different learnable parameters θ c . The Q-value
flat action space of items into a two-level hierarchical struc-
QcDQNc (su,t , cu,t ; θ c ) for a category cu,t is denoted as
ture based on the category-item relationship. In this two-level
hierarchy, the high-level nodes denote the categories, and the
QcDQNc su,t , cu,t ; θ c = Qnet hu,t , iu,t ; C, θ c
low-level nodes represent the items. The edge between a high-
level category node and a low-level item node indicates that = Qnet Snet su,t ; θ c , cu,t ; C, θ c . (3)
the item belongs to the category. To select an item, the first Once the category cu,t (using the strategies mentioned
one needs to choose a category and then select an item from c for the low-level
below) is selected and the action space Iu,t
the chosen category. The primary motivation for using such a item DQN is determined, that is, Iu,t = Iu,t ∩ I c . The recom-
c
category-item hierarchy architecture is to result in a smaller mended item iu,t with the maximum Q-value is obtained as
action space and model user preferences for the different cat- follows:
egories. We use two separate DQNs, that is: 1) DQNc for
category selection and 2) DQNi for item selection within the iu,t = argmaxj∈Iu,t
c Q
i
s , j; θ i .
DQNi u,t
(4)
chosen category. The two-level DQN hierarchy is illustrated
in Fig. 2, and the details are given as follows. C. Category Selection Strategies
1) Low-Level Item DQN: DQNi (su,t , iu,t ; I, θ i ) is the low- We introduce three kinds of category selection strategies:
level DQN for selecting items. We divide the network archi- 1) strategy I; 2) strategy II; and 3) the BCS strategy. For
tecture of DQNi (su,t , iu,t ; I, θ i ) into two major components: the reader’s convenience, in this section, we omit the learn-
1) the state extraction network shown in Fig. 3(a) and 2) the able parameters, for example, θ i in QiDQNi (su,t , iu,t ; θ i ), θ c in
Q-network shown in Fig. 3(b). The state extraction network
QcDQNc (su,t , cu,t ; θ c ), and subscripts for state, item, and cate-
Snet(su,t ; θ i ) is responsible for obtaining a latent representa-
gory, that is, s ← su,t , i ← iu,t , and c ← cu,t . We also denote
tion of a given state. Like [3], we employ two separate RNNs
the next state su,t+1 as s̃.
with gated recurrent units (GRUs) to handle historical items
1) Strategy I: It uses the category network DQNc to choose
with positive and negative feedbacks, respectively. First, his-
category c with the largest Q-value
torical items Hu,t are divided into two subsequences, based
on positive and negative feedbacks. These subsequences are c = argmaxj∈C QcDQNc (s, j). (5)
ordered by time such that PIu,t consists of items with positive
feedback PIu,t = {piu,1 , piu,2 , . . . , piu,N } and NIu,t consists of 2) Strategy II: It uses the item network DQNi to select the
items with negative feedback NIu,t = {niu,1 , niu,2 , . . . , niu,N }. category with the largest Q-value. The Q-function QiDQNi (s, i)
We adopt RNNs with fixed length T (T = 10 in this arti- for the item DQNi is given by
cle). Incomplete PIu,t or NIu,t vectors are filled with zeros.
When PIu,t or NIu,t contain more than T elements, the newer QiDQNi (s, i) = R(s, i) + γ P(s̃|s, i)V(s̃) (6)
items are used. The overall procedure to obtain the high-level s̃
representation hu,t is shown in (1) as where P(s̃|s, i) is the transition probability from state s to
state s̃ by executing action i. V(s) is value function represent-
phu,t = GRUp PIu,t ing the quality of the overall state s and R(s, i) is the reward
nhu,t = GRUn NIu,t function for recommended item i. We also derive the Q-value
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12032 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022
function and the transition function based on the category. Due Qc (s, c) = α QcDQNc (s,j)
+ (1 − α) Qc (s,j)
.
j∈C e e DQNi
to the relationship between categories and items, Rc (s, c) and j∈C
Pc (s̃|s, c) can be defined as (15)
Rc (s, c) = p(i|s, c)R(s, i) (8) Here, α ∈ [0, 1] is a scalar to balance the effect between the
i∈I c Q-values QcDQNc (s, c) and QcDQNi (s, c), obtained from strat-
egy I and strategy II, respectively. The detailed BCS approach
Pc (s̃|s, c) = p(i|s, c)P(s̃|s, i) (9)
is described in Algorithm 1. We first compute QcDQNi (s, c)
i∈I c
using (10) (lines 1–3). Then, Qc (s, c) is computed using
where, p(i|s, c) is the probability of selecting item i at state s.
a combination of QcDQNc (s, c) and QcDQNi (s, c) (lines 4–6).
Using (8) and (9), (7) can be rewritten such that QcDQNi (s, c)
Finally, category c with the largest Qc (s, c) value and item
can be directly obtained from the Q-values of the items
i with the largest QiDQNi (s, i) value in the chosen category c
belonging to category c
are selected (lines 7 and 8).
QcDQNi (s, c) = Rc (s, c) + γ Pc (s̃|s, c)V(s̃)
s̃
D. Training Procedure for DHCRS
Our overall model requires training of three compo-
= p(i|s, c) R(s, i) + γ P(s̃|s)V(s̃)
nents: 1) DQNc (su,t , cu,t ; C, θ c ) for selecting the category;
i∈I c s̃
2) DQNi (su,t , iu,t ; I, θ i ) for selecting the item; and 3) score
= p(i|s, c)QiDQNi (s, i). (10) function v for estimating the probability p.
i∈I c 1) Training DQN c (su,t , cu,t ; C, θ c ): The learnable param-
Similar to [1], a learnable conditional choice model is adopted eters θ c are trained via Q-learning by minimizing the temporal
to approximate p(i|s, c), as follows: difference error given as follows:
ev(s,i;θ )
p
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12033
= p iu,t |su,t , cu,t R su,t , iu,t Algorithm 2: Overall DHCRS Training Process
Require: Training user set U, interaction length T, number of
+ p j|su,t , cu,t R̂ su,t , j (18) training epochs N, frequency of updating delay parameters K.
c \{i }
j∈Iu,t u,t 1: Initialize θ c , θ i , θ̂ c , θ̂ i and θ p
2: Create an empty replay buffers D and I, count = 0
where R̂(su,t , j) denotes the approximated reward. We assume 3: for epcoch = 1, N do
similar items may receive similar feedback from the same user. 4: I←∅
Therefore, R̂(su,t , j) can be obtained based on the similarity 5: for u in U do
6: Initialize su,1
function sim(j, iu,t ) between the item j and the recommended 7: for t = 1, T do
item iu,t as follows: 8: count += 1
9: Obtain Qi (s , j; θ i ) for each item j in I
R̂ su,t , j = sim j, iu,t R su,t , iu,t (19) DQNi u,t
10: c
Obtain QDQNc (su,t , l; θ c ) for each category l in C
where sim(j, iu,t ) ∈ [0, 1] is the normalized Pearson correla- 11: Get item iu,t and category cu,t via Algorithm 1
12: Acquire feedback fu,t from user u
tion coefficient [34] between items j and iu,t . Finally, θ c is 13: Update su,t to su,t+1 according to fu,t and iu,t
updated using gradient descent with learning rate η 14: Store transition (su,t , cu,t , iu,t , fu,t , su,t+1 ) into D
15: Store transition (su,t , cu,t , iu,t , fu,t , su,t+1 ) into I
θ c = θ c − η∇θ c θ c . (20) 16: if |D| > threshold then
17: Sample a mini-batch B from D
2) Training DQN i (su,t , iu,t ; I, θ i ): This includes two sep- 18: Train DQNc to obtain updated θ c via Eq.20
arate learning procedures: 1) the regular Q-learning method 19: Train DQNi to obtain updated θ i via Eq.23
for the recommended item (with rewards) and 2) an auxiliary 20: Train DQNi with auxiliary learning to obtain updated
learning method for the other items (to tackle the large action θ i via Eq.26
21: end if
space). Similar to DQNc (su,t , cu,t ; C, θ c ), the loss function is 22: if count mod K == 0 then
the temporal difference error i (θ i ) 23: Update the delay parameters: θ̂ c ← θ c and θ̂ i ← θ i
24: end if
i θ i = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D 25: end for
2 26: end for
× yiu,t − QiDQNi su,t , iu,t ; θ i . (21) 27: Calculate probability p
28: while |I| > 0 do
29: Sample a minibatch B from I
The target yiu,t is given by 30: Update θ p via Eq.28 with B
31: I ←I −B
yiu,t = R su,t , iu,t + γ max
c
QiDQNi su,t+1 , j; θ̂ i . (22) 32: end while
j∈Iu,t
33: end for
Finally, the following update rule is used:
θ i = θ i − η∇θ i i θ i . (23) − QcDQNc su,t , cu,t ; θ c − p iu,t |su,t , cu,t QiDQNi su,t , iu,t ; θ i
During the above training process using Q-learning, sufficient ŷiu,t
sampling of state–item pairs is necessary. However, when the
(24)
number of training interactions is limited, it cannot sufficiently
cover all possible state-item pairs, especially when the action where ŷiu,t is the training target. Note that auxiliary learning
space is large. As a result, the unseen state-item pairs are erro- is only applied for the items in the same category cu,t as the
neously estimated to have unrealistic values [35]. This problem recommended item iu,t (excluding it). Therefore, the Q-value
can affect the BCS strategy as QcDQNi (su,t , c) for a category c QiDQNi (su,t , iu,t ; θ i ) for recommended item iu,t is only involved
depends on Q-values of all items in the category. To allevi- as a part of the target, and it would not generate a gradient.
ate this problem, we propose an auxiliary learning algorithm The loss function for auxiliary learning ˆi (θ i ) is defined by
for items other than the recommended item. Supposing that
the Q-value QcDQNc (su,t , cu,t ; θ c ) from DQNc for category cu,t ˆi θ i = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D
is given, it should be closer to QcDQNi (su,t , cu,t ; θ i ) since both ⎡⎛ ⎞2 ⎤
of them are related to the Q-value of category cu,t . Based ⎢ ⎥
× ⎣⎝ŷiu,t − p j|su,t , cu,t QiDQNi su,t , j; θ i ⎠ ⎦
on this assumption, the objective of the auxiliary learning is c ,j =i
j∈Iu,t u,t
to minimize the difference between QcDQNc (su,t , cu,t ; θ c ) and
(25)
QcDQNi (su,t , cu,t ; θ i ) as follows:
θ = θ − λη∇θ i ˆ θ i .
i i
(26)
QcDQNi su,t , cu,t ; θ i − QcDQNc su,t , cu,t ; θ c
The learnable parameter θ i is then updated as shown in (26),
= p j|su,t , cu,t QiDQNi su,t , j; θ i − QcDQNc su,t , cu,t ; θ c where the learning rate for auxiliary learning is λ × η, and λ
c
j∈Iu,t
is a scalar to be the proportion factor.
= p j|su,t , cu,t QiDQNi su,t , j; θ i 3) Learning the Score Function v: We use the historical
c ,j =i
j∈Iu,t u,t interactions in the replay buffer D to obtain v for estimating
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12034 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022
TABLE II
S TATISTICS OF DATASETS the feedback in the dataset into two classes: 1) positive and
2) negative. Positive feedback for an item is considered when
the user has given an arbitrary rating to the item. Otherwise,
if the user ignores or does not interact with an item, then we
treat such behavior as negative feedback. We directly give a
positive reward (equal to 1) to the recommender system if its
recommended item gets positive feedback. Otherwise, a neg-
ative reward (equal to −0.2) is given for the recommended
item with negative feedback. The absolute value of the nega-
probability p. Its loss function p (θ p ) is based on the cross-
tive reward is set to a smaller value than the one for positive
entropy associated with (12), and is shown as follows:
reward because the interaction without rating does not strongly
p θ p = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D indicate the user’s disinterest in the item.
p Similar to [6], we randomly select 10% of users as the
−yu,t log σ v su,t , iu,t ; θ p
p test users and the remaining users as the training users. The
− 1 − yu,t log 1 − σ v su,t , iu,t ; θ p (27) purpose of this dataset division is to facilitate the evaluation
p
where σ is the sigmoid function and the label yu,t = 1, if fu,t of multiple recommendation steps (long-term recommenda-
p
is positive, and yu,t = 0, if fu,t is negative. This loss function tion performance). Here, each user has a historical rating set
encourages the score of the item with positive feedback to including all items that have been rated by the user. For a
be higher than the one with negative feedback. Finally, v is user, we obtain the long-term recommended performance by
optimized by minimizing p (θ p ) as follows: evaluating how many items in his historical rating set can be
hit by multiple recommendation steps. Note that our dataset
θ p = θ p − η∇θ p p θ p . (28) division is different from the traditional leave-one-out proto-
col [36], which holds out the last interaction for each user as
4) Overall DHCRS Training Process: Algorithm 2
the test set while the remaining interactions are the training
describes the overall training process for a given user set U.
set. However, leave-one-out cannot be used for evaluating the
We first initialize the learnable parameters, θ p , θ c , and θ i ,
multistep long-term recommendation performance as it only
and the delay ones θ̂ c and θ̂ i . The training process begins by
considers the one-step recommendation result. For each test
obtaining the category and item Q-values for a given state
user, we evaluate the overall performance of 20 recommen-
su,t (line 8). Using these values in Algorithm 1, an item iu,t
dation steps instead of 1 recommendation step. The number
in category cu,t is recommended to user u and a feedback fu,t
of recommendation steps is equal to the minimum length of
is received (lines 9 and 10). The state su,t is then updated to
interaction history of a user in the datasets, ensuring that the
su,t+1 based on item iu,t and its feedback (line 11). This state
value range of the HR for any user is in [0, 1]. At each time
transition consisting of the current state su,t , category cu,t ,
step t, the recommendation agent recommends an item iu,t to
recommended item iu,t , its feedback fu,t , and the new state
a target user u, and none of the items are allowed to be rec-
su,t+1 is stored in a replay buffer D (line 12). When the replay
ommended repeatedly. The differences between our settings
buffer reaches a threshold, a minibatch B is sampled from D
and [6] are the implicit feedback and the constrained 20 rec-
to update the category DQN (DQNc ) and item DQN (DQNi )
ommendation steps. All results are obtained by averaging over
using Q-learning (lines 14–16). DQNi is trained again using
four runs in which the training and test sets are randomized.
auxiliary learning to obtain θ i for the unobserved items with-
We adopt two evaluation metrics: 1) HR and 2) NDCG,
out rewards (line 17). Delay parameters θ̂ c and θ̂ i are updated
to evaluate the recommendation performance. Let L =
after every K intervals (lines 19–21). This learning process is
{iu,1 , iu,2 , . . . , iu,K } be the recommendation sequence pre-
executed for every user u in U (lines 4–23). Next, the item
dicted by the agent for a target user u. Associated labels
selection probability p is calculated (line 24). θ p is updated
Y = {yu,1 , yu,2 , . . . , yu,k }, where yu,j = 1 indicates the rec-
using a minibatch B stored in the current epoch (lines 26–31).
ommended item iu,j is in the user’s interaction history, and
yu,j = 0 indicates otherwise.
IV. E XPERIMENTS HR for the k-steps recommendation can be obtained by
For our experiments, we use four public datasets, Netflix,
1 1
K
and the other three provided by ML 1M, 10M, and 20M (see HR@K = yu,k . (29)
Table II). In ML datasets, the categories of movies are pro- |U| K
u∈U k=1
vided, and there are at least 20 interactions for each user. In the
NDCG for the k-steps recommendation is defined as
Netflix dataset, the information on categories of movies is not
provided. Therefore, we crawl the category information from 1 DCG@K(u)
NDCG@K = (30)
the IMDB website relying on the name and the release date |U| IDCG@K(u)
u∈U
of the movie. We removed the users who have less than 20 K
interactions since they cannot be used to effectively evaluate where DCG@K(u) K = k=1 [yu,k /(log2 k + 1)] and
the long-term recommendation performance. IDCG@K(u) = k=1 [1/(log 2 k + 1)]. We consider that
In our experiments, we use the implicit feedback. Following in the ideal sequence, all items are the engaged items for
the typical setting for the implicit feedback [14], we divide calculating IDCG@K(u).
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12035
For both HR@K and NDCG@K, having large values is bet- QcDQNi (su,t , cu,t ; θ i ) in (15); and 3) scalar β = 0.2 to balance
ter. When the HR is the same but NDCG is higher, then it indi- the effect between the historical items with negative feedback
cates that the system can achieve successful recommendations and the positive feedback in (12). The detailed analyses about
early. these three parameters are given in Section IV-G.
We compare our model with state-of-the-art baselines, For a fair comparison, we use the same embedding size
including traditional supervised deep learning methods (64), and hidden layer size (128) for all baselines, and
(GRU4Rec, Caser, and SASRec) and recent DRL methods the other hyperparameters have been followed as per the
(DDPG-KNN, DQN-R, and TPGR). GRU4Rec1 [15] uses suggestion from the methods’ authors or tuned using cross-
GRU with the ranking-based loss to model the users’ sequen- validation. For GRU4Rec, Caser, SASRec, and TPGR, we
tial behaviors. Caser2 [37] employs CNN in both the horizontal use the codes provided by the authors. Similar to our model,
and vertical ways to model the high-order behaviors of users. we tuned the hyperparameters of different baselines via ML
SASRec3 [36] uses a left-to-right transformer model to capture 1M. For GRU4rec, Caser, SASRec, and TPGR with provided
users’ sequential behaviors. DDPG-KNN [31] is an actor– codes, we test the common hyperparameter learning rate from
critic-based method for addressing large-scale discrete action [1e − 2, 1e − 3, 1e − 4, 1e − 5, 1e − 6], and the optimal learn-
space problems. It reduces action space by converting dis- ing rates for GRU4rec, Caser, SASRec, and TPGR are 1e − 6,
crete action space to continuous action space, and it is adapted 1e − 6, 1e − 5, and 1e − 5, respectively. We also tune some
for discrete recommendation by an approximate KNN method. key specific hyperparameters for these methods. For GRU4rec,
DQN-R [3] is a value-based method using DQN. It uses two only the learning rate is required to be justified. For Caser, the
separate RNNs to capture users’ behavior for both positive and optimal length of the sequence is 5 out of [5, 20], and the num-
negative feedback. It utilizes a DQN to estimate the Q-value ber of targets is 3 out of [3, 5]. For the other parameters of
for each item at a given current state. TPGR4 [6] is a policy- Caser, we use the default values, for example, the number of
gradient-based method learned by the REINFORCE algorithm. vertical filters (4) and the number of horizontal filters (16). For
It can reduce the action space by employing a hierarchical SASRec, the optimal number of blocks is 2 out of [1, 2, 3],
tree that is constructed by clustering the items. DHCRS is our and the number of heads is 1 out of [1, 2]. For TPGR, the
proposed model, which uses two-level DQNs for selecting cat- optimal L2 factor is 1e − 3 out of [1e − 2, 1e − 3, 1e − 4], and
egories and items, respectively. It incorporates a BCS strategy the deep of hierarchy is 2, which is equal to our model.
along with an auxiliary learning algorithm. For DQN-R, we use the same DQN architecture as our
The neural network architecture of our DHCRS model model since the main distinctions between our model and
can be divided into the state extraction network Snet and DQN-R are additional DQN for the category and a different
Q-network Qnet, respectively. This neural network architec- category selection procedure. For DDPG-KNN, the corre-
ture is the same as [3] with five layers. The Snet is a sponding critic and policy neural networks also use the same
1 layer RNN with GRU cell, and the Qnet has four lay- architecture as our model and DQN-R, since the corresponding
ers. For the Qnet, the first two layers are separated for paper does not provide the specific architecture of the neural
the positive and negative states extracted from the Snet and networks. Note that the influence of the network architecture is
output the high-level features. The next two layers concate- the way to extract state and the major difference between DQN
nate previously separated features and output the Q-values. and DDPG is the learning algorithm. Therefore, keeping iden-
We test different settings of hyperparameters of the neural tical network architecture can provide a more fair comparison
network, for example, the embedding size from [8, 16, 32, 64], between our model and DDPG-KNN. Moreover, we report the
the RNN layer size from [8, 16, 32, 64], the hidden dense results based on the full-size KNN since a larger k (the num-
layer size from [16, 32, 64, 128], and the learning rate from ber of the nearest neighbors) leads to a better performance [6].
[1e − 2, 1e − 3, 5e − 4, 2.5e − 4, 1e − 4, 1e − 5] for Q-learning For DQN-R and DDPG-KNN, the hyperparameters are almost
and learning score function v. The best parameters are deter- the same as ours, but we tune the learning rate from [1e −
mined via the relatively small dataset ML 1M, and the final 2, 5e − 3, 1e − 3, 5e − 4, 2.5e − 4, 1e − 4, 1e − 5], and the
embedding size is 64, the RNN layer size is 64, the dense layer optimal learning rates are 2.5e − 4 and 5e − 3 for DQN-R and
size is 128, and the learning rate is 2.5e − 4. In addition, we DDPG-KNN, respectively. For the deep learning baselines,
use the Adam optimizer to learn our model with commonly for example, Caser and SASRec, the results reported in the
used batch size 256, and the discount factor γ is set to 0.9 to corresponding papers are obtained through the leave-one-out
strengthen the consideration of long-term returns. Note that our protocol. Hence, we cannot directly use their results. Thus, we
category DQN and item DQN use the same hyperparameters employ the same learning method provided by the correspond-
as shown above. In addition to the common hyperparameters ing papers during training. However, for the test phase, we
for the neural network or RL, there are three unique hyper- focus on their long-term recommendation performance instead
parameters in our models: 1) λ in (26) is 0.01; 2) scalar of just predicting the next recommendation.
α = 0.5 to balance the effect between QcDQNc (su,t , cu,t ; θ c ) and
1 https://github.com/Songweiping/GRU4Rec_TensorFlow
2 https://github.com/graytowne/caser_pytorch
A. Overall Performance
3 https://github.com/kang205/SASRec The comparison results on the four datasets are presented
4 https://github.com/chenhaokun/TPGR in Table III. We observe the following.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12036 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022
TABLE III
C OMPARISON OF HR AND NDCG W ITH 20 S TEPS OF R ECOMMENDATIONS ON 1M, 10M, AND 20M, AND N ETFLIX .
T HE B EST P ERFORMANCE I S H IGHLIGHTED IN B OLD , AND “*” I NDICATES T HAT P-VALUE I S L ESS T HAN 0.05
FOR S IGNIFICANCE T EST (T WO -S IDED T -T EST ) OVER THE B EST BASELINE
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12037
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12038 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022
Fig. 6. Scatter plot of mean (x-axis) and variance (y-axis) of Q-values of items in the categories “comedy” and “drama” of ML 1M at recommendation
steps 5 and 20 within training epochs 5, 10, and 50. These points are obtained by three different methods: 1) DQN; 2) DHCRS with auxiliary learning; and
3) DHCRS without auxiliary learning. (a) “comedy”, step 5, epoch 5. (b) “drama”, step 5, epoch 5. (c) “comedy”, step 5, epoch. 10, (d) “drama”, step 5,
epoch 10. (e) “comedy”, step 5, epoch 50. (f) “drama”, step 5, epoch 50. (g) “comedy”, step 20, epoch 5. (h) “drama”, step 20, epoch 5. (i) “comedy”, step
20, epoch 10. (j) “drama”, step 20, epoch 10. (k)“‘comedy”, step 20, epoch 50. (l) “drama”, step 20, epoch 50.
G. Impact of Hyperparameters
There are three key hyperparameters in our model:
1) α is for balancing the effect of QcDQNi (su,t , cu,t ; θ i ) and
QcDQNc (su,t , cu,t ; θ c ) in (15); 2) β denotes the negative impor-
tant ratio in (12); and 3) λ is the proportion ratio of the learning
rate of our auxiliary learning to the Q-learning. Specifically,
Fig. 8. Performance curves of the methods using different category rewards α can be used to control the proportion of different angles
on ML. (a) 1M. (b) 10M. of category selection strategies (strategies I and II) in BCS.
β can also affect the estimated probability p. As Q-learning
and auxiliary learning can update the Q-value for a certain
In “uniform,” the probability of selecting an item in a cate- item from different learning perspectives, the different learn-
gory obeys the uniform distribution. While “ours w/o neg” is ing rates for them, that is, λ, can result in different Q-values.
similar to our adopted method, it does not involve the negative Below, we investigate the effects of different settings of these
interactions. The performance in terms of HR and NDCG is hyperparameters on ML 1M, respectively.
shown in Fig. 9. From Fig. 9, we observe the following. We show the results with different α from the range
1) “Uniform” also achieves promising results, which can [0.1, 0.9] stepped by 0.2 in Fig. 10(a). We observe that α = 0.3
exceed results obtained by baselines, as shown in can achieve the best performance, and the performance with
Table III. By using the distribution learned from his- the middle values 0.3, 0.5, and 0.7 is better than the ones
torical interactions instead of straightforward uniform with boundary values 0.1 and 0.9. These results suggest that
distribution, the performance can be further enhanced. the setting with moderate α can effectively balance the effect
2) Although “ours w/o neg” can improve the performance, of different kinds of category selection strategies and improve
its increment over “uniform” on ML 10M with respect to the recommendation accuracy. We vary the value of β from 0.0
HR@20 is small. By incorporating the consideration of to 1.0 stepped by 0.2 and present the results in Fig. 10(b). We
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12039
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12040 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022
[11] T. Zahavy, M. Haroush, N. Merlis, D. J. Mankowitz, and S. Mannor, [36] W.-C. Kang and J. McAuley, “Self-attentive sequential recommenda-
“Learn what not to learn: Action elimination with deep reinforcement tion,” in Proc. IEEE Int. Conf. Data Min. (ICDM), Singapore, 2018,
learning,” in Advances in Neural Information Processing Systems. Red pp. 197–206.
Hook, NY, USA: Curran Assoc., Inc., 2018, pp. 3562–3573. [37] J. Tang and K. Wang, “Personalized top-N sequential recommendation
[12] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for via convolutional sequence embedding,” in Proc. 11th ACM Int. Conf.
recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009. Web Search Data Min., 2018, pp. 565–573.
[13] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “BPR:
Bayesian personalized ranking from implicit feedback,” in Proc. 25th
Conf. Uncertainty Artif. Intell., 2009, pp. 452–461.
[14] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural
collaborative filtering,” in Proc. 26th Int. Conf. World Wide Web, 2017,
pp. 173–182.
[15] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session- Mingsheng Fu received the Ph.D. degree in com-
based recommendations with recurrent neural networks,” 2015. [Online]. puter science from the University of Electronic
Available: arXiv:1511.06939. Science and Technology of China, Chengdu, China,
[16] M. Fu, H. Qu, Z. Yi, L. Lu, and Y. Liu, “A novel deep learning-based in 2019.
collaborative filtering model for recommendation system,” IEEE Trans. From 2019 to 2021, he is a Research Fellow with
Cybern., vol. 49, no. 3, pp. 1084–1096, Mar. 2019. the School of Electrical and Electronic Engineering,
[17] Q. Liu, C. Long, J. Zhang, M. Xu, and P. Lv, “TriATNE: Tripartite adver- Nanyang Technological University, Singapore. He
sarial training for network embeddings,” IEEE Trans. Cybern., early is also a Postdoctoral Researcher with School of
access, Mar. 17, 2021, doi: 10.1109/TCYB.2021.3061771. Computer Science and Engineering, University of
[18] Z. Sun et al., “Research commentary on recommendations with side Electronic Science and Technology of China. His
information: A survey and research directions,” Electron. Commerce Res. current research interests are neural networks, rein-
Appl., vol. 37, Sep./Oct. 2019, Art. no. 100879. forcement learning, and recommender systems.
[19] J. He, X. Li, and L. Liao, “Category-aware next point-of-interest rec-
ommendation via listwise Bayesian personalized ranking,” in Proc. 26th
Int. Joint Conf. Artif. Intell. (IJCAI), vol. 17, 2017, pp. 1837–1843.
[20] L. Zhang, Z. Sun, J. Zhang, H. Kloeden, and F. Klanner, “Modeling hier-
archical category transition for next POI recommendation with uncertain
check-ins,” Inf. Sci., vol. 515, pp. 169–190, Apr. 2020. Anubha Agrawal is currently pursuing the M.S.
[21] G. Shani, D. Heckerman, and R. I. Brafman, “An MDP-based rec- degree in computer science and systems with the
ommender system,” J. Mach. Learn. Res., vol. 6, pp. 1265–1295, University of Washington, Tacoma, WA, USA.
Sep. 2005. Her current research interests relate to deep neu-
[22] Y. Lei, Z. Wang, W. Li, and H. Pei, “Social attentive deep Q-network for ral networks, machine learning, and opinion spam
recommendation,” in Proc. 42nd Int. ACM SIGIR Conf. Res. Develop. detection.
Inf. Retrieval, 2019, pp. 1189–1192.
[23] R. Gao, H. Xia, J. Li, D. Liu, S. Chen, and G. Chun, “DRCGR: Deep
reinforcement learning framework incorporating CNN and GAN-based
for interactive recommendation,” in Proc. IEEE Int. Conf. Data Min.
(ICDM), Beijing, China, 2019, pp. 1048–1053.
[24] X. Zhao, L. Xia, D. Yin, and J. Tang, “Whole-chain recommendations,”
2019. [Online]. Available: arXiv:1902.03987. Athirai A. Irissappane received the master’s
[25] X. Bai, J. Guan, and H. Wang, “A model-based reinforcement learning degree from the National University of Singapore,
with adversarial training for online recommendation,” in Advances in Singapore, in 2012, and the Ph.D. degree from
Neural Information Processing Systems, Vancouver, BC, Canada: Neural Nanyang Technological University, Singapore, in
Inf. Process. Syst. Found., Inc., 2019, pp. 10735–10746. 2016.
[26] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative adver- She is currently an Assistant Professor with
sarial user model for reinforcement learning based recommendation the University of Washington, Tacoma, WA, USA.
system,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 1052–1061. She has worked as a Researcher of Dimensional
[27] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi, Mechanics with Rolls-Royce@NTU corporate lab
“Top-K off-policy correction for a REINFORCE recommender system,” and Toshiba, Singapore. Her current research
in Proc. 12th ACM Int. Conf. Web Search Data Min., 2019, pp. 456–464. interests are in reinforcement learning and planning
[28] D. Zhao, L. Zhang, B. Zhang, L. Zheng, Y. Bao, and W. Yan, “Deep focusing on deep reinforcement learning techniques and their applications
hierarchical reinforcement learning based recommendations via multi- toward real-world problems.
goals abstraction,” 2019. [Online]. Available: arXiv:1903.09374.
[29] J. Zhang, B. Hao, B. Chen, C. Li, H. Chen, and J. Sun, “Hierarchical
reinforcement learning for course recommendation in MOOCs,” in Proc.
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 435–442.
[30] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum,
“Hierarchical deep reinforcement learning: Integrating temporal abstrac- Jie Zhang received the Ph.D. degree from the
tion and intrinsic motivation,” in Advances in Neural Information Cheriton School of Computer Science, University
Processing Systems. Red Hook, NY, USA: Curran, 2016, pp. 3675–3683. of Waterloo, Waterloo, ON, Canada, in 2009.
[31] G. Dulac-Arnold et al., “Deep reinforcement learning in large discrete He is an Associate Professor with the School
action spaces,” 2015. [Online]. Available: arXiv:1512.07679. of Computer Science and Engineering, Nanyang
[32] T. P. Lillicrap et al., “Continuous control with deep reinforcement Technological University, Singapore. He is also
learning,” 2015. [Online]. Available: arXiv:1509.02971. an Associate with the Singapore Institute of
[33] A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures Manufacturing Technology, Singapore. During
for deep reinforcement learning,” in Proc. 32nd AAAI Conf. Artif. Intell., Ph.D. studies, he held the prestigious NSERC
2018, pp. 4131–4138. Alexaprestigious. His papers have been published
[34] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, by top journals and conferences and won several
“GroupLens: An open architecture for collaborative filtering of net- best paper awards.
news,” in Proc. ACM Conf. Comput. Supported Cooperative Work, 1994, Dr. Zhang was a recipient of the Alumni Gold Medal at the 2009
pp. 175–186. Convocation Ceremony. Bell Canada Graduate Scholarship is rewarded for
[35] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement top Ph.D. students across Canada. The Gold Medal is awarded once a year
learning without exploration,” in Proc. Int. Conf. Mach. Learn., 2019, to honor the top Ph.D. graduate from the University of Waterloo. He is also
pp. 2052–2062. active in serving research communities.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12041
Liwei Huang received the Ph.D. degree in computer Hong Qu (Member, IEEE) received the Ph.D.
science from the University of Electronic Science degree in computer science from the University
and Technology of China, Chengdu, China, in 2019. of Electronic Science and Technology of China,
She is currently a Postdoctoral Fellow with the Chengdu, China, in 2006.
University of Electronic Science and Technology of From 2007 to 2008, he was a Postdoctoral Fellow
China. Her interests focus on reinforcement learning, with the Advanced Robotics and Intelligent Systems
robotics, and coordination among agents. Lab, School of Engineering, University of Guelph,
Guelph, ON, Canada. From 2014 to 2015, he worked
as a Visiting Scholar with the Potsdam Institute for
Climate Impact Research, Potsdam, Germany, and
the Humboldt University of Berlin, Berlin, Germany.
He is currently a Professor with Computational Intelligence Laboratory,
School of Computer Science and Engineering, University of Electronic
Science and Technology of China. His research interests include neural
networks, machine learning, and big data.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.