0% found this document useful (0 votes)
52 views14 pages

Deep Reinforcement Learning Framework For Category-Based Item Recommendation

This document describes a deep reinforcement learning framework called DHCRS for category-based item recommendation. DHCRS addresses the challenge of large action spaces in deep RL recommender systems by reconstructing the action space into a two-level hierarchy of categories and items. It uses two deep Q-networks, with one selecting a category at the higher level and the other choosing an item in that category at the lower level, significantly reducing the action space size for both networks. Experiments show DHCRS outperforms other methods in terms of long-term recommendation accuracy metrics like hit rate and NDCG.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views14 pages

Deep Reinforcement Learning Framework For Category-Based Item Recommendation

This document describes a deep reinforcement learning framework called DHCRS for category-based item recommendation. DHCRS addresses the challenge of large action spaces in deep RL recommender systems by reconstructing the action space into a two-level hierarchy of categories and items. It uses two deep Q-networks, with one selecting a category at the higher level and the other choosing an item in that category at the lower level, significantly reducing the action space size for both networks. Experiments show DHCRS outperforms other methods in terms of long-term recommendation accuracy metrics like hit rate and NDCG.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

12028 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO.

11, NOVEMBER 2022

Deep Reinforcement Learning Framework for


Category-Based Item Recommendation
Mingsheng Fu , Anubha Agrawal, Athirai A. Irissappane, Jie Zhang, Liwei Huang ,
and Hong Qu , Member, IEEE

Abstract—Deep reinforcement learning (DRL)-based recom- between a recommender system and a user. The overall recom-
mender systems have recently come into the limelight due to mendation accuracy for all interactions (also called long-term
their ability to optimize long-term user engagement. A signifi- rewards) is crucial in designing these recommender systems.
cant challenge in DRL-based recommender systems is the large
action space required to represent a variety of items. The large Most recommender systems function by estimating the cor-
action space weakens the sampling efficiency and thereby, affects relation between the user and the items, measured using
the recommendation accuracy. In this article, we propose a immediate user feedback, such as ratings or clicks, which rep-
DRL-based method called deep hierarchical category-based rec- resent the immediate rewards. However, optimizing a system
ommender system (DHCRS) to handle the large action space only using such immediate feedback cannot guarantee the
problem. In DHCRS, categories of items are used to recon-
struct the original flat action space into a two-level category-item maximization of the overall recommendation accuracy or
hierarchy. DHCRS uses two deep Q-networks (DQNs): 1) a high- long-term rewards [1].
level DQN for selecting a category and 2) a low-level DQN Recently, reinforcement learning (RL) techniques, specif-
to choose an item in this category for the recommendation. ically deep RL (DRL) algorithms, have been used in rec-
Hence, the action space of each DQN is significantly reduced. ommender systems. DRL works to maximize the long-term
Furthermore, the categorization of items helps capture the users’
preferences more effectively. We also propose a bidirectional cat- rewards, making it suitable for the problem described above.
egory selection (BCS) technique, which explicitly considers the DRL uses a neural network to approximate the optimal policy
category-item relationships. The experiments show that DHCRS and is shown to scale to large state spaces. In most existing
can significantly outperform state-of-the-art methods in terms of DRL-based recommender systems, the process of continuous
hit rate and normalized discounted cumulative gain for long-term recommendation between the system and user is viewed as a
recommendations.
Markov decision process (MDP). Here, the system is an RL
Index Terms—Deep reinforcement learning (DRL), hierarchy, agent, and the user is the environment. The system recom-
large action space, recommender system. mends an item, also referred to as an action, based on the
user’s current state. The current state is then updated based on
I. I NTRODUCTION the user’s feedback about the item. The recommender system
ECOMMENDER systems play a significant role in is trained to maximize the long-term return, for example, the
R recommending products or items to users. Essentially,
a recommendation is a process of continuous interaction
number of recommended products clicked over a period of
time [2]. Depending on the adopted algorithm, the DRL-
based recommender systems can be generally divided into
Manuscript received 15 July 2020; revised 17 April 2021; accepted value-based [3], [4] and policy-based [5]–[7] methods. In par-
11 June 2021. Date of publication 16 August 2021; date of current ver- ticular, the deep Q-network (DQN) [8] is the most popular
sion 17 October 2022. This work was supported in part by the China
Postdoctoral Science Foundation under Grant 2020M67318; in part by value-based DRL method. In DQN, the policy is determinis-
the National Key Research and Development Program of China under tic, and the selected action is the one with the largest Q-value.
Grant 2018AAA0100202; and in part by the National Science Foundation Here, the Q-value represents the quality of performing an
of China under Grant 61976043. This article was recommended by Associate
Editor B. Ribeiro. (Corresponding author: Hong Qu.) action at a given state and it is approximated using a neural
Mingsheng Fu is with the School of Computer Science and network.
Engineering, University of Electronic Science and Technology of China, One of the key challenges faced by DRL-based recom-
Chengdu 610054, China, and also with the School of Electrical and
Electronic Engineering, Nanyang Technological University, Singapore mender systems is a large and discrete action space. In most
(e-mail: [email protected]). recommender systems, the actions representing the items are
Anubha Agrawal and Athirai A. Irissappane are with the School of at least in the order of thousands, if not millions. However,
Engineering and Technology, University of Washington, Tacoma, WA 98402
USA (e-mail: [email protected]; [email protected]). this is not the case for domains other than the recommender
Jie Zhang is with the School of Computer Science and Engineering, systems, where the size of the action space does not exceed
Nanyang Technological University, Singapore (e-mail: [email protected]). more than a few hundred or thousands [8], [9]. As a result,
Liwei Huang and Hong Qu are with the School of Computer
Science and Engineering, University of Electronic Science and Technology DRL-based recommender systems require more training sam-
of China, Chengdu 610054, China (e-mail: [email protected]; ples to achieve statistical efficiency [10]. A large action space
[email protected]). can easily lead the learning algorithm to converge to a subopti-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/TCYB.2021.3089941. mal policy [11]. These issues can significantly affect efficiency
Digital Object Identifier 10.1109/TCYB.2021.3089941 and even limit prediction accuracy.
2168-2267 
c 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12029

To handle the issue of large action space, we propose We discuss the findings and show the experiment results in
the deep hierarchical category-based recommender system Section IV. Finally, Section V presents the conclusion.
(DHCRS). In DHCRS, the original action space is recon-
structed as a two-level hierarchical action space based on the
category-item relationship. In this hierarchy, there are sev- II. R ELATED W ORK
eral category nodes at the higher level, and each category The collaborative filtering (CF) methods using matrix
node links to some item nodes at the lower level. The edge factorization [12], [13] and deep learning [14]–[17] have
between the category node and the item node indicates that dominated recommender systems in the last decade. For
the item belongs to that category. A 2-level DQN architecture example, [14] uses the feedforward neural network to make
is used. The high-level DQN selects the recommended cate- recommendation, [15] employs the recurrent neural network
gory, and the low-level DQN chooses the recommended item to conduct the sequential recommendation, and [17] studies
in the selected category. This relatively reduces the size of the an effective embedding learning algorithm through a genera-
action space for both high-level and low-level DQNs. The size tive adversarial network. Besides, many works also study the
of the action space for high-level DQN is the total number of utilization of category information, such as involving category
categories. Similarly, the size of the action space for the low- hierarchy as the side information [18], constructing category-
level DQN is reduced to the number of items belonging to aware transition [19], [20], etc. Specifically, similar to our
each category. framework, the models with category-aware transition use the
Another advantage of considering category information is idea of divide and conquer, which first predicts the category,
that the search space of items can be effectively constrained and then derives the recommendation result based on the pre-
by identifying the general preference of a user for a partic- dicted category. However, traditional CF methods, including
ular category. Intuitively, most users are only interested in the ones with category-aware transitions, treat recommenda-
a limited number of categories. By heuristically shortlisting tions as one-shot predictions such that the results of previous
the candidates into a specific category, the items preferred by recommendations do not affect subsequent recommendations.
the user can be determined easily. For selecting the preferred Unlike these traditional methods, most emerging DRL-based
category effectively, capturing the relationship between the methods treat continuous recommendations as an MDP [21].
category and item is important. A user’s preference for a cate- They formalize an interactive procedure between a user and
gory is directly correlated with the user’s preferences for items the recommendation agent, where the user’s feedback for the
in this category. For example, a user’s preference for the Star last recommendation can significantly affect the result of the
Wars series may hint at the user’s interest in the SCI-FI cat- next recommendation.
egory. For this, we propose a bidirectional category selection Existing DRL-based recommender systems focus on top-
strategy (BCS) that considers the user’s category and item ics, such as state representation [22], [23]; choice of RL
preferences. algorithm [3], [6], [24]; simulation environment [25], [26];
The experiments on four real-world benchmarks show that and top-N recommendation [1], [27]. Some works also
DHCRS can significantly improve recommendation accuracy. try to employ hierarchical RL (HRL) in the recommender
The experimental results also demonstrate that our model can systems [28], [29]. Zhao et al. [28] described a typical appli-
outperform state-of-the-art DRL-based recommender systems cation of traditional HRL with temporal abstraction [30]. There
in terms of hit rate (HR) and normalized discounted cumu- is a high-level agent with a long-term goal, and a low-level
lative gain (NDCG). In summary, the main contributions of agent performs a series of actions to achieve this goal. In [29],
this work are given as follows: 1) we propose DHCRS, which the DRL-related module is not directly correlated to the final
is a deep hierarchical category-based recommender system. item recommendation, instead, it is used as a filtering com-
It addresses the problem of large action space using a 2-level ponent, such as traditional memory-based or sequential-based
DQN based on the category-item relationship; 2) to effectively CF models. It uses a two-level hierarchy, where the high-
model the category-item relationship, we propose a bidirec- level agent decides whether to filter historical items, and the
tional decision-making strategy for selecting the preferred low-level agent decides which historical items to use. Similar
category as well as the item. In addition, an auxiliary learning to these methods, our proposed framework also uses HRL
algorithm is proposed to improve the effectiveness of cate- to build the recommendation agent. However, the difference
gory selection; and 3) to demonstrate the effectiveness of the between our framework and the above methods is that our
proposed model, we compare the long-term recommendation motivation is to optimize the action selection procedure at
accuracy of DHCRS with the state-of-the-art methods in terms each step. Besides, our action space varies based on user
of HR and NDCG. The experiments are conducted on four preferences while their action space is always fixed.
datasets: Movielens (ML) 1M, 10M, and 20M, and Netflix. There are few works that attempt to address the large action
We evaluate the results on each user and then report the aver- space problem in recommender systems, and they are mainly
age results of all the users. The experimental results show based on the policy-gradient algorithm [6], [31]. In [31],
that the proposed model can outperform the best baseline by the discrete action space is first converted into a continu-
an average of 3.07% and 1.74% in terms of HR and NDCG. ous action space. Then, the deep deterministic policy gradient
The remainder of this article is organized as follows. In (DDPG) [32], a continuous action policy-gradient method, is
Section II, we briefly review the related literature. We present applied to generate an optimal continuous action. Finally, the
the details of our proposed DHCRS method in Section III. discrete action for recommendation is retrieved according to

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12030 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022

TABLE I
M AIN N OTATIONS IN T HIS A RTICLE

the similarity between the embeddings of discrete actions and


the continuous action. This strategy of converting from discrete
actions to a continuous action space is widely used in many
works, which are based on policy gradient and actor–critic
methods. However, they suffer from inconsistency issues [33].
Chen et al. [6] proposed a policy-gradient-based method, tree-
structured policy gradient recommendation (TPGR), which
integrates a hierarchical tree to reduce the action space.
However, it ignores the relationship between the category and Fig. 1. Interaction between recommendation agent and user. (a) Traditional
items. In contrast to the above methods, our DHCRS model Recommendation MDP. (b) Our Recommendation MDP.
employs a value-based (not policy based) DRL algorithm and
uses a BCS strategy to explicitly consider the correlation States: su,t ∈ S is a state representing the historical inter-
between items and categories. actions of the user u and the feedback obtained before time
step t [3]. su,t includes two sets Hu,t and Fu,t , that is,
III. P ROPOSED A PPROACH su,t = [Hu,t , Fu,t ]. Specifically, Hu,t = {iu,1 , iu,2 , . . . , iu,t−1 }
consists of the recommended items before time step t. Fu,t =
Here, we first describe the MDP formulation used for rec-
{fu,1 , fu,2 , . . . , fu,t−1 } consists of the corresponding feedback
ommendation. To tackle the issue of large action space, we
for the recommended items, that is, fu,1 represents the feed-
introduce the DHCRS framework. The main notations used in
back for iu,1 . The feedback can be positive or negative,
this article are defined in Table I.
where the positive feedback indicates that user u clicked the
recommended item.
A. MDP Formulation for DHCRS Actions: There are two separate action sets: 1) a category
In the DRL-based recommender system, the recommen- set C and 2) an item set I. At each time step t, DQNc , the
dations are viewed as sequential interactions between the high-level DQN, determines the category cu,t ∈ C for user
recommender system (agent) and the user (environment). For u. Based on cu,t , the item set I c , including all items in cu,t ,
each time step t, the agent recommends an item iu,t to user u is obtained. DQNi , the low-level DQN, then selects an item
based on the current state su,t . When a feedback fu,t (e.g., click) iu,t ∈ Iu,tc , where I c = I
u,t u,t ∩ I , in which Iu,t is the set of
c

from the user about iu,t is sent back to the agent, the state is items for user u at time step t. Similar to [6], we do not allow
updated to su,t+1 during the next time step t+1, and a new item an item to be recommended repeatedly. The item set for the
iu,t+1 is recommended. The difference between our method user u in the first time step consists of the original item set,
and traditional DRL-based methods is the inclusion of the that is, Iu,1 = I. Then, for every next time step t + 1, Iu,t+1
category information, where the category is determined before becomes Iu,t+1 = Iu,t − iu,t .
recommending an item as illustrated in Fig. 1. We formalize Transition: T is the function to obtain the new state su,t+1
the process of recommending an item as an MDP denoted by given the previous state su,t after recommending item iu,t .
a tuple S, I, C, T, R, γ , representing the state space S, two Specifically, su,t+1 = [Hu,t+1 , Fu,t+1 ] is updated with item
different action spaces I and C, transition function T, item iu,t such that Hu,t+1 = Hu,t ∪ {iu,t } and Fu,t+1 = Fu,t ∪ {fu,t }.
reward R, and discount factor γ . Initially, Hu,0 = ∅ and Fu,0 = ∅.

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12031

 
hu,t = Concat phu,t , nhu,t (1)
where GRUp (PIu,t ) and GRUn (NIu,t ) are RNNs with GRU
cells for the positive PIu,t and negative sequence NIu,t , respec-
tively. As shown in Fig. 3(a), for GRUp (PIu,t ), the RNN input
at each time step is the embedding of item piu,p , and the
output phu,t is the RNN state vector for the last time step.
Fig. 2. Illustration of category-item hierarchy, where the red lines indicate a GRUn (NIu,t ) is defined similarly. We concatenate phu,t and
path to reach the recommended item 5 through category 3 using a two-steps nhu,t to obtain hu,t .
selection process, that is, DQNc and DQNi .
The Q-network Qnet(hu,t , iu,t ; I, θ i ) takes the extracted rep-
resentation as input and predicts the Q-values for the actions.
We feed the state hu,t from Snet(su,t ; θ i ), as shown in Fig. 3(b)
Reward: R(su,t , iu,t ) is an immediate reward function to
to acquire the Q-value QiDQNi (su,t , iu,t ; θ i ) of item iu,t in a
evaluate the quality of recommended item iu,t at state su,t .
given action set I
The value of R(su,t , iu,t ) is directly related to the feedback
   
fu,t , where a positive feedback results in a positive reward and QiDQNi su,t , iu,t ; θ i = Qnet hu,t , iu,t ; I, θ i
a negative feedback in a negative reward.    
Discount Factor: γ ∈ [0, 1) is a balance factor to balance = Qnet Snet su,t ; θ i , iu,t ; I, θ i . (2)
the immediate reward and future cumulative rewards. 2) High-Level Category DQN: DQNc (su,t , cu,t ; C, θ c ) is
the high-level DQN used in selecting category cu,t . It also uses
B. Two-Level DQN Hierarchy the same neural network architecture shown in Fig. 3, but uses
the category action space C instead of item action space I. In
To avoid choosing directly from all items, we convert the
addition, it has different learnable parameters θ c . The Q-value
flat action space of items into a two-level hierarchical struc-
QcDQNc (su,t , cu,t ; θ c ) for a category cu,t is denoted as
ture based on the category-item relationship. In this two-level
hierarchy, the high-level nodes denote the categories, and the    
QcDQNc su,t , cu,t ; θ c = Qnet hu,t , iu,t ; C, θ c
low-level nodes represent the items. The edge between a high-    
level category node and a low-level item node indicates that = Qnet Snet su,t ; θ c , cu,t ; C, θ c . (3)
the item belongs to the category. To select an item, the first Once the category cu,t (using the strategies mentioned
one needs to choose a category and then select an item from c for the low-level
below) is selected and the action space Iu,t
the chosen category. The primary motivation for using such a item DQN is determined, that is, Iu,t = Iu,t ∩ I c . The recom-
c
category-item hierarchy architecture is to result in a smaller mended item iu,t with the maximum Q-value is obtained as
action space and model user preferences for the different cat- follows:
egories. We use two separate DQNs, that is: 1) DQNc for  
category selection and 2) DQNi for item selection within the iu,t = argmaxj∈Iu,t
c Q
i
s , j; θ i .
DQNi u,t
(4)
chosen category. The two-level DQN hierarchy is illustrated
in Fig. 2, and the details are given as follows. C. Category Selection Strategies
1) Low-Level Item DQN: DQNi (su,t , iu,t ; I, θ i ) is the low- We introduce three kinds of category selection strategies:
level DQN for selecting items. We divide the network archi- 1) strategy I; 2) strategy II; and 3) the BCS strategy. For
tecture of DQNi (su,t , iu,t ; I, θ i ) into two major components: the reader’s convenience, in this section, we omit the learn-
1) the state extraction network shown in Fig. 3(a) and 2) the able parameters, for example, θ i in QiDQNi (su,t , iu,t ; θ i ), θ c in
Q-network shown in Fig. 3(b). The state extraction network
QcDQNc (su,t , cu,t ; θ c ), and subscripts for state, item, and cate-
Snet(su,t ; θ i ) is responsible for obtaining a latent representa-
gory, that is, s ← su,t , i ← iu,t , and c ← cu,t . We also denote
tion of a given state. Like [3], we employ two separate RNNs
the next state su,t+1 as s̃.
with gated recurrent units (GRUs) to handle historical items
1) Strategy I: It uses the category network DQNc to choose
with positive and negative feedbacks, respectively. First, his-
category c with the largest Q-value
torical items Hu,t are divided into two subsequences, based
on positive and negative feedbacks. These subsequences are c = argmaxj∈C QcDQNc (s, j). (5)
ordered by time such that PIu,t consists of items with positive
feedback PIu,t = {piu,1 , piu,2 , . . . , piu,N } and NIu,t consists of 2) Strategy II: It uses the item network DQNi to select the
items with negative feedback NIu,t = {niu,1 , niu,2 , . . . , niu,N }. category with the largest Q-value. The Q-function QiDQNi (s, i)
We adopt RNNs with fixed length T (T = 10 in this arti- for the item DQNi is given by
cle). Incomplete PIu,t or NIu,t vectors are filled with zeros. 
When PIu,t or NIu,t contain more than T elements, the newer QiDQNi (s, i) = R(s, i) + γ P(s̃|s, i)V(s̃) (6)
items are used. The overall procedure to obtain the high-level s̃
representation hu,t is shown in (1) as where P(s̃|s, i) is the transition probability from state s to
  state s̃ by executing action i. V(s) is value function represent-
phu,t = GRUp PIu,t ing the quality of the overall state s and R(s, i) is the reward
 
nhu,t = GRUn NIu,t function for recommended item i. We also derive the Q-value

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12032 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022

Algorithm 1: BCS Strategy


Require: Q-values of all items, Qi (s, j), for j = 1 to |I|.
DQN i
Q-values of all categories, QcDQN c (s, l), for l = 1 to |C|.
1: for c in C do 
2: Qc i (s, c) ← i∈I c p(i|s, c)Qi (s, i)
DQN DQN i
3: end for
4: for c in C do c
c Q (s,c) Q (s,c)
DQN c DQN i
5: Qc (s, c) ← α  e Qc
+ (1 − α)  e c
c (s,j) Q (s,j)
j∈C e
DQN DQN i
j∈C e
6: end for
7: select category c ← argmaxc∈C Qc (s, c)
c Qi
8: select action i ← argmaxi∈Iu,t (s, i)
Fig. 3. (a) Architecture of state extraction network (Snet). (b) Architecture DQN i
of Q-network (Qnet). 9: return i and c

QcDQNi (s, c) of category c using the item DQNi network as


3) Bidirectional Category Selection Strategy: Combining
follows:
 the previous two strategies, BCS uses a combination of QcDQNc
QcDQNi (s, c) = Rc (s, c) + γ Pc (s̃|s, c)V(s̃) (7) and QcDQNi to select the category

c = argmaxj∈C Qc (s, j), where (14)
where V(s) is the same value function as used in (6) since
QcDQNc (s,c) Qc i (s,c)
the states are identical. Rc (s, c) and Pc (s̃|s, c) are the reward e e DQN

function and the transition function based on the category. Due Qc (s, c) = α  QcDQNc (s,j)
+ (1 − α)  Qc (s,j)
.
j∈C e e DQNi
to the relationship between categories and items, Rc (s, c) and j∈C
Pc (s̃|s, c) can be defined as (15)

Rc (s, c) = p(i|s, c)R(s, i) (8) Here, α ∈ [0, 1] is a scalar to balance the effect between the
i∈I c Q-values QcDQNc (s, c) and QcDQNi (s, c), obtained from strat-
 egy I and strategy II, respectively. The detailed BCS approach
Pc (s̃|s, c) = p(i|s, c)P(s̃|s, i) (9)
is described in Algorithm 1. We first compute QcDQNi (s, c)
i∈I c
using (10) (lines 1–3). Then, Qc (s, c) is computed using
where, p(i|s, c) is the probability of selecting item i at state s.
a combination of QcDQNc (s, c) and QcDQNi (s, c) (lines 4–6).
Using (8) and (9), (7) can be rewritten such that QcDQNi (s, c)
Finally, category c with the largest Qc (s, c) value and item
can be directly obtained from the Q-values of the items
i with the largest QiDQNi (s, i) value in the chosen category c
belonging to category c
 are selected (lines 7 and 8).
QcDQNi (s, c) = Rc (s, c) + γ Pc (s̃|s, c)V(s̃)
 s̃
 D. Training Procedure for DHCRS
  Our overall model requires training of three compo-
= p(i|s, c) R(s, i) + γ P(s̃|s)V(s̃)
nents: 1) DQNc (su,t , cu,t ; C, θ c ) for selecting the category;
i∈I c s̃
 2) DQNi (su,t , iu,t ; I, θ i ) for selecting the item; and 3) score
= p(i|s, c)QiDQNi (s, i). (10) function v for estimating the probability p.
i∈I c 1) Training DQN c (su,t , cu,t ; C, θ c ): The learnable param-
Similar to [1], a learnable conditional choice model is adopted eters θ c are trained via Q-learning by minimizing the temporal
to approximate p(i|s, c), as follows: difference error given as follows:
 
ev(s,i;θ )
p

p(i|s, c) =  (11) c θ c = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D


v(s,b;θ p ) 
b∈I c e   2
× ycu,t − QcDQNc su,t , cu,t ; θ c (16)
where the score function v(s, i; θ p ) is given by
   
    ycu,t = Rc su,t , cu,t + γ max QcDQNc su,t+1 , j; θ̂ c (17)
v s, i; θ = xi
p T
xa − β xb (12) j∈C
a∈PI b∈NI
where ycu,t is the target, θ̂ c are the delay parameters of θ c . As
where xi , xa , xb ∈ θ p are the learnable embeddings of items i, shown in (8), Rc (su,t , cu,t ) is obtained based on the rewards
a, and b. The scalar β balances the effect between the histor- of the items belonging to the category. As we will only know
ical items with negative and positive feedbacks. Finally, the the reward R(su,t , iu,t ) for the recommended item iu,t from the
optimal category is obtained using feedback, we will approximate the rewards for the other items.
c = argmaxj∈C QcDQNi (s, j). (13) We rewrite (8) as
      
In contrast to strategy I, strategy II can effectively leverage Rc su,t , cu,t = p j|su,t , cu,t R su,t , j
the correlation between the category and items. c
j∈Iu,t

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12033

   
= p iu,t |su,t , cu,t R su,t , iu,t Algorithm 2: Overall DHCRS Training Process
     Require: Training user set U, interaction length T, number of
+ p j|su,t , cu,t R̂ su,t , j (18) training epochs N, frequency of updating delay parameters K.
c \{i }
j∈Iu,t u,t 1: Initialize θ c , θ i , θ̂ c , θ̂ i and θ p
2: Create an empty replay buffers D and I, count = 0
where R̂(su,t , j) denotes the approximated reward. We assume 3: for epcoch = 1, N do
similar items may receive similar feedback from the same user. 4: I←∅
Therefore, R̂(su,t , j) can be obtained based on the similarity 5: for u in U do
6: Initialize su,1
function sim(j, iu,t ) between the item j and the recommended 7: for t = 1, T do
item iu,t as follows: 8: count += 1
      9: Obtain Qi (s , j; θ i ) for each item j in I
R̂ su,t , j = sim j, iu,t R su,t , iu,t (19) DQNi u,t
10: c
Obtain QDQNc (su,t , l; θ c ) for each category l in C
where sim(j, iu,t ) ∈ [0, 1] is the normalized Pearson correla- 11: Get item iu,t and category cu,t via Algorithm 1
12: Acquire feedback fu,t from user u
tion coefficient [34] between items j and iu,t . Finally, θ c is 13: Update su,t to su,t+1 according to fu,t and iu,t
updated using gradient descent with learning rate η 14: Store transition (su,t , cu,t , iu,t , fu,t , su,t+1 ) into D
  15: Store transition (su,t , cu,t , iu,t , fu,t , su,t+1 ) into I
θ c = θ c − η∇θ c  θ c . (20) 16: if |D| > threshold then
17: Sample a mini-batch B from D
2) Training DQN i (su,t , iu,t ; I, θ i ): This includes two sep- 18: Train DQNc to obtain updated θ c via Eq.20
arate learning procedures: 1) the regular Q-learning method 19: Train DQNi to obtain updated θ i via Eq.23
for the recommended item (with rewards) and 2) an auxiliary 20: Train DQNi with auxiliary learning to obtain updated
learning method for the other items (to tackle the large action θ i via Eq.26
21: end if
space). Similar to DQNc (su,t , cu,t ; C, θ c ), the loss function is 22: if count mod K == 0 then
the temporal difference error i (θ i ) 23: Update the delay parameters: θ̂ c ← θ c and θ̂ i ← θ i
  24: end if
i θ i = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D 25: end for

  2 26: end for
× yiu,t − QiDQNi su,t , iu,t ; θ i . (21) 27: Calculate probability p
28: while |I| > 0 do
29: Sample a minibatch B from I
The target yiu,t is given by 30: Update θ p via Eq.28 with B
  31: I ←I −B
yiu,t = R su,t , iu,t + γ max
c
QiDQNi su,t+1 , j; θ̂ i . (22) 32: end while
j∈Iu,t
33: end for
Finally, the following update rule is used:
       
θ i = θ i − η∇θ i i θ i . (23) − QcDQNc su,t , cu,t ; θ c − p iu,t |su,t , cu,t QiDQNi su,t , iu,t ; θ i
  
During the above training process using Q-learning, sufficient ŷiu,t
sampling of state–item pairs is necessary. However, when the
(24)
number of training interactions is limited, it cannot sufficiently
cover all possible state-item pairs, especially when the action where ŷiu,t is the training target. Note that auxiliary learning
space is large. As a result, the unseen state-item pairs are erro- is only applied for the items in the same category cu,t as the
neously estimated to have unrealistic values [35]. This problem recommended item iu,t (excluding it). Therefore, the Q-value
can affect the BCS strategy as QcDQNi (su,t , c) for a category c QiDQNi (su,t , iu,t ; θ i ) for recommended item iu,t is only involved
depends on Q-values of all items in the category. To allevi- as a part of the target, and it would not generate a gradient.
ate this problem, we propose an auxiliary learning algorithm The loss function for auxiliary learning ˆi (θ i ) is defined by
for items other than the recommended item. Supposing that
 
the Q-value QcDQNc (su,t , cu,t ; θ c ) from DQNc for category cu,t ˆi θ i = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D
is given, it should be closer to QcDQNi (su,t , cu,t ; θ i ) since both ⎡⎛ ⎞2 ⎤
of them are related to the Q-value of category cu,t . Based ⎢      ⎥
× ⎣⎝ŷiu,t − p j|su,t , cu,t QiDQNi su,t , j; θ i ⎠ ⎦
on this assumption, the objective of the auxiliary learning is c ,j =i
j∈Iu,t u,t
to minimize the difference between QcDQNc (su,t , cu,t ; θ c ) and
(25)
QcDQNi (su,t , cu,t ; θ i ) as follows:  
θ = θ − λη∇θ i ˆ θ i .
i i
(26)
   
QcDQNi su,t , cu,t ; θ i − QcDQNc su,t , cu,t ; θ c
       The learnable parameter θ i is then updated as shown in (26),
= p j|su,t , cu,t QiDQNi su,t , j; θ i − QcDQNc su,t , cu,t ; θ c where the learning rate for auxiliary learning is λ × η, and λ
c
j∈Iu,t

is a scalar to be the proportion factor.
   
= p j|su,t , cu,t QiDQNi su,t , j; θ i 3) Learning the Score Function v: We use the historical
c ,j =i
j∈Iu,t u,t interactions in the replay buffer D to obtain v for estimating

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12034 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022

TABLE II
S TATISTICS OF DATASETS the feedback in the dataset into two classes: 1) positive and
2) negative. Positive feedback for an item is considered when
the user has given an arbitrary rating to the item. Otherwise,
if the user ignores or does not interact with an item, then we
treat such behavior as negative feedback. We directly give a
positive reward (equal to 1) to the recommender system if its
recommended item gets positive feedback. Otherwise, a neg-
ative reward (equal to −0.2) is given for the recommended
item with negative feedback. The absolute value of the nega-
probability p. Its loss function p (θ p ) is based on the cross-
tive reward is set to a smaller value than the one for positive
entropy associated with (12), and is shown as follows:
reward because the interaction without rating does not strongly
 
p θ p = E(su,t ,cu,t ,iu,t ,fu,t ,su,t+1 )∼D indicate the user’s disinterest in the item.
 p    Similar to [6], we randomly select 10% of users as the
−yu,t log σ v su,t , iu,t ; θ p
 p      test users and the remaining users as the training users. The
− 1 − yu,t log 1 − σ v su,t , iu,t ; θ p (27) purpose of this dataset division is to facilitate the evaluation
p
where σ is the sigmoid function and the label yu,t = 1, if fu,t of multiple recommendation steps (long-term recommenda-
p
is positive, and yu,t = 0, if fu,t is negative. This loss function tion performance). Here, each user has a historical rating set
encourages the score of the item with positive feedback to including all items that have been rated by the user. For a
be higher than the one with negative feedback. Finally, v is user, we obtain the long-term recommended performance by
optimized by minimizing p (θ p ) as follows: evaluating how many items in his historical rating set can be
  hit by multiple recommendation steps. Note that our dataset
θ p = θ p − η∇θ p p θ p . (28) division is different from the traditional leave-one-out proto-
col [36], which holds out the last interaction for each user as
4) Overall DHCRS Training Process: Algorithm 2
the test set while the remaining interactions are the training
describes the overall training process for a given user set U.
set. However, leave-one-out cannot be used for evaluating the
We first initialize the learnable parameters, θ p , θ c , and θ i ,
multistep long-term recommendation performance as it only
and the delay ones θ̂ c and θ̂ i . The training process begins by
considers the one-step recommendation result. For each test
obtaining the category and item Q-values for a given state
user, we evaluate the overall performance of 20 recommen-
su,t (line 8). Using these values in Algorithm 1, an item iu,t
dation steps instead of 1 recommendation step. The number
in category cu,t is recommended to user u and a feedback fu,t
of recommendation steps is equal to the minimum length of
is received (lines 9 and 10). The state su,t is then updated to
interaction history of a user in the datasets, ensuring that the
su,t+1 based on item iu,t and its feedback (line 11). This state
value range of the HR for any user is in [0, 1]. At each time
transition consisting of the current state su,t , category cu,t ,
step t, the recommendation agent recommends an item iu,t to
recommended item iu,t , its feedback fu,t , and the new state
a target user u, and none of the items are allowed to be rec-
su,t+1 is stored in a replay buffer D (line 12). When the replay
ommended repeatedly. The differences between our settings
buffer reaches a threshold, a minibatch B is sampled from D
and [6] are the implicit feedback and the constrained 20 rec-
to update the category DQN (DQNc ) and item DQN (DQNi )
ommendation steps. All results are obtained by averaging over
using Q-learning (lines 14–16). DQNi is trained again using
four runs in which the training and test sets are randomized.
auxiliary learning to obtain θ i for the unobserved items with-
We adopt two evaluation metrics: 1) HR and 2) NDCG,
out rewards (line 17). Delay parameters θ̂ c and θ̂ i are updated
to evaluate the recommendation performance. Let L =
after every K intervals (lines 19–21). This learning process is
{iu,1 , iu,2 , . . . , iu,K } be the recommendation sequence pre-
executed for every user u in U (lines 4–23). Next, the item
dicted by the agent for a target user u. Associated labels
selection probability p is calculated (line 24). θ p is updated
Y = {yu,1 , yu,2 , . . . , yu,k }, where yu,j = 1 indicates the rec-
using a minibatch B stored in the current epoch (lines 26–31).
ommended item iu,j is in the user’s interaction history, and
yu,j = 0 indicates otherwise.
IV. E XPERIMENTS HR for the k-steps recommendation can be obtained by
For our experiments, we use four public datasets, Netflix,
1  1
K
and the other three provided by ML 1M, 10M, and 20M (see HR@K = yu,k . (29)
Table II). In ML datasets, the categories of movies are pro- |U| K
u∈U k=1
vided, and there are at least 20 interactions for each user. In the
NDCG for the k-steps recommendation is defined as
Netflix dataset, the information on categories of movies is not
provided. Therefore, we crawl the category information from 1  DCG@K(u)
NDCG@K = (30)
the IMDB website relying on the name and the release date |U| IDCG@K(u)
u∈U
of the movie. We removed the users who have less than 20 K
interactions since they cannot be used to effectively evaluate where DCG@K(u) K = k=1 [yu,k /(log2 k + 1)] and
the long-term recommendation performance. IDCG@K(u) = k=1 [1/(log 2 k + 1)]. We consider that
In our experiments, we use the implicit feedback. Following in the ideal sequence, all items are the engaged items for
the typical setting for the implicit feedback [14], we divide calculating IDCG@K(u).

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12035

For both HR@K and NDCG@K, having large values is bet- QcDQNi (su,t , cu,t ; θ i ) in (15); and 3) scalar β = 0.2 to balance
ter. When the HR is the same but NDCG is higher, then it indi- the effect between the historical items with negative feedback
cates that the system can achieve successful recommendations and the positive feedback in (12). The detailed analyses about
early. these three parameters are given in Section IV-G.
We compare our model with state-of-the-art baselines, For a fair comparison, we use the same embedding size
including traditional supervised deep learning methods (64), and hidden layer size (128) for all baselines, and
(GRU4Rec, Caser, and SASRec) and recent DRL methods the other hyperparameters have been followed as per the
(DDPG-KNN, DQN-R, and TPGR). GRU4Rec1 [15] uses suggestion from the methods’ authors or tuned using cross-
GRU with the ranking-based loss to model the users’ sequen- validation. For GRU4Rec, Caser, SASRec, and TPGR, we
tial behaviors. Caser2 [37] employs CNN in both the horizontal use the codes provided by the authors. Similar to our model,
and vertical ways to model the high-order behaviors of users. we tuned the hyperparameters of different baselines via ML
SASRec3 [36] uses a left-to-right transformer model to capture 1M. For GRU4rec, Caser, SASRec, and TPGR with provided
users’ sequential behaviors. DDPG-KNN [31] is an actor– codes, we test the common hyperparameter learning rate from
critic-based method for addressing large-scale discrete action [1e − 2, 1e − 3, 1e − 4, 1e − 5, 1e − 6], and the optimal learn-
space problems. It reduces action space by converting dis- ing rates for GRU4rec, Caser, SASRec, and TPGR are 1e − 6,
crete action space to continuous action space, and it is adapted 1e − 6, 1e − 5, and 1e − 5, respectively. We also tune some
for discrete recommendation by an approximate KNN method. key specific hyperparameters for these methods. For GRU4rec,
DQN-R [3] is a value-based method using DQN. It uses two only the learning rate is required to be justified. For Caser, the
separate RNNs to capture users’ behavior for both positive and optimal length of the sequence is 5 out of [5, 20], and the num-
negative feedback. It utilizes a DQN to estimate the Q-value ber of targets is 3 out of [3, 5]. For the other parameters of
for each item at a given current state. TPGR4 [6] is a policy- Caser, we use the default values, for example, the number of
gradient-based method learned by the REINFORCE algorithm. vertical filters (4) and the number of horizontal filters (16). For
It can reduce the action space by employing a hierarchical SASRec, the optimal number of blocks is 2 out of [1, 2, 3],
tree that is constructed by clustering the items. DHCRS is our and the number of heads is 1 out of [1, 2]. For TPGR, the
proposed model, which uses two-level DQNs for selecting cat- optimal L2 factor is 1e − 3 out of [1e − 2, 1e − 3, 1e − 4], and
egories and items, respectively. It incorporates a BCS strategy the deep of hierarchy is 2, which is equal to our model.
along with an auxiliary learning algorithm. For DQN-R, we use the same DQN architecture as our
The neural network architecture of our DHCRS model model since the main distinctions between our model and
can be divided into the state extraction network Snet and DQN-R are additional DQN for the category and a different
Q-network Qnet, respectively. This neural network architec- category selection procedure. For DDPG-KNN, the corre-
ture is the same as [3] with five layers. The Snet is a sponding critic and policy neural networks also use the same
1 layer RNN with GRU cell, and the Qnet has four lay- architecture as our model and DQN-R, since the corresponding
ers. For the Qnet, the first two layers are separated for paper does not provide the specific architecture of the neural
the positive and negative states extracted from the Snet and networks. Note that the influence of the network architecture is
output the high-level features. The next two layers concate- the way to extract state and the major difference between DQN
nate previously separated features and output the Q-values. and DDPG is the learning algorithm. Therefore, keeping iden-
We test different settings of hyperparameters of the neural tical network architecture can provide a more fair comparison
network, for example, the embedding size from [8, 16, 32, 64], between our model and DDPG-KNN. Moreover, we report the
the RNN layer size from [8, 16, 32, 64], the hidden dense results based on the full-size KNN since a larger k (the num-
layer size from [16, 32, 64, 128], and the learning rate from ber of the nearest neighbors) leads to a better performance [6].
[1e − 2, 1e − 3, 5e − 4, 2.5e − 4, 1e − 4, 1e − 5] for Q-learning For DQN-R and DDPG-KNN, the hyperparameters are almost
and learning score function v. The best parameters are deter- the same as ours, but we tune the learning rate from [1e −
mined via the relatively small dataset ML 1M, and the final 2, 5e − 3, 1e − 3, 5e − 4, 2.5e − 4, 1e − 4, 1e − 5], and the
embedding size is 64, the RNN layer size is 64, the dense layer optimal learning rates are 2.5e − 4 and 5e − 3 for DQN-R and
size is 128, and the learning rate is 2.5e − 4. In addition, we DDPG-KNN, respectively. For the deep learning baselines,
use the Adam optimizer to learn our model with commonly for example, Caser and SASRec, the results reported in the
used batch size 256, and the discount factor γ is set to 0.9 to corresponding papers are obtained through the leave-one-out
strengthen the consideration of long-term returns. Note that our protocol. Hence, we cannot directly use their results. Thus, we
category DQN and item DQN use the same hyperparameters employ the same learning method provided by the correspond-
as shown above. In addition to the common hyperparameters ing papers during training. However, for the test phase, we
for the neural network or RL, there are three unique hyper- focus on their long-term recommendation performance instead
parameters in our models: 1) λ in (26) is 0.01; 2) scalar of just predicting the next recommendation.
α = 0.5 to balance the effect between QcDQNc (su,t , cu,t ; θ c ) and

1 https://github.com/Songweiping/GRU4Rec_TensorFlow
2 https://github.com/graytowne/caser_pytorch
A. Overall Performance
3 https://github.com/kang205/SASRec The comparison results on the four datasets are presented
4 https://github.com/chenhaokun/TPGR in Table III. We observe the following.

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12036 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022

TABLE III
C OMPARISON OF HR AND NDCG W ITH 20 S TEPS OF R ECOMMENDATIONS ON 1M, 10M, AND 20M, AND N ETFLIX .
T HE B EST P ERFORMANCE I S H IGHLIGHTED IN B OLD , AND “*” I NDICATES T HAT P-VALUE I S L ESS T HAN 0.05
FOR S IGNIFICANCE T EST (T WO -S IDED T -T EST ) OVER THE B EST BASELINE

1) The recommendation accuracy of DRL-based methods


is generally better than the methods using supervised
learning since the long-term recommendation accuracy
is explicitly optimized by RL. This result demonstrates
the superiority of the DRL-based methods for long-term
recommendation.
2) Our proposed DHCRS consistently outperforms other
DRL-based methods across all metrics on the four
datasets with a relatively large margin. Specifically, Fig. 4. Performance curves of the methods using different category selection
DDPG-KNN and TPGR are other DRL-based methods strategies on ML. (a) 1M. (b) 10M.
for addressing large-scale action problems. The better
performance of DHCRS is due to the hierarchical selec-
tion process employed to reduce the action space of category selection strategies, strategy I (“Only Category Q”),
items. Another reason is the correlation between the strategy II (“Only Item Q”), and BCS strategy.
high-level category and the low-level item is explicitly From Fig. 4, we observe the following.
considered. 1) All DRL-based methods can converge effectively.
3) DQN and DHCRS are similar as they use the same 2) The different variants of our method can generally
neural network architecture and value-based Q-learning outperform the baselines without the selection of the
algorithm for training. The distinctions between DQN category throughout the training. It indicates that not
and DHCRS are the additional consideration of the only our final category selection strategy BCS is effec-
category information before the action selection and tive but the performance can be improved by applying
the learning algorithm for addressing large-scale action a hierarchical category–item selection mechanism.
problems. Therefore, the improvement of DHCRS over 3) Among different category selection strategies, BCS can
DQN suggests that taking the category into account consistently exceed others with a relatively large gap.
can significantly improve the accuracy of the long-term This result suggests that considering both kinds of
recommendation with massive action space. category selection strategies together can improve rec-
4) The average improvements of DHCRS over the best ommendation accuracy.
baseline for different datasets are 3.07% and 1.74% for 4) In contrast with BCS and DQN, BCS can obtain rel-
both metrics HR and NDCG, respectively. The rela- atively high HR quickly, and it only needs about 20
tively large improvement for HR is because the reward epochs to achieve the best performance (around 46%
in DHCRS is directly correlated to HR. The performance and 49% for 1M and 10M) that DQN can achieve. This
of NDCG, which can be affected by the recommenda- result indicates that the BCS training efficiency is better
tion position, can also be improved without the explicit than DQN. We believe the improvement is due to the
consideration of the recommendation position. enhanced sampling efficiency during training due to the
hierarchical category–item selection approach.
B. Bidirectional Category Selection Strategy Effectiveness
The additional step of category selection is the main differ- C. Effectiveness of Auxiliary Learning
ence between our model and the other DRL-based methods. To investigate the effectiveness of auxiliary learning, we
To investigate the effectiveness of the proposed BCS strategy, show our model’s test performance curves (Fig. 5) using BCS
we show the test performance curves of different methods category selection with and without auxiliary learning as well
for HR@20, that is, the HR for 20 recommendation steps as DQN baseline on datasets 1M and 10M. As in Fig. 5,
throughout the training for the datasets 1M and 10M. The the performance of our DHCRS without auxiliary learning
performance curves are shown in Fig. 4, which includes the cannot significantly outperform the DQN baseline for ML
baseline methods DQN and DDPG with no consideration for 10M. Moreover, the performance of DHCRS without auxil-
category selection, and variants of our method with different iary learning in the early stage of training is close to the one

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12037

exhibit the average variance of mean Q-values in cat-


egories “comedy” and “drama” for different users in
Fig. 7. Here, we can see that the variances correlated
to DHCRS with auxiliary learning can be kept at a low
level, and the change curve is more stable than the ones
for the other two methods. These results indicate that
auxiliary learning can control the variance of items in
the same category.
Fig. 5. Performance curves of the methods with or without auxiliary learning 3) As the training progresses, the distributions of points
on ML. (a) 1M. (b) 10M. Aux-learning is short for auxiliary learning. for DHCRS with auxiliary learning and the other two
methods become very different. With more epochs, the
points for DQN and DHCRS without auxiliary learning
obtained by DQN. These results indicate that merely incorpo- are concentrated near the y-axis. On the other hand, the
rating different category selection strategies cannot effectively points regarding DHCRS with auxiliary learning are still
improve the recommendation performance. This problem is dispersed along the x-axis, which can result in better dis-
due to the Q-values of some items of a particular category are tinguishing the different users. For example, the values
probability inaccurate. By incorporating auxiliary learning, the of comedy movies for different users may be different.
recommendation performance and training efficiency can be If Q-values of “comedy” for these users are too close,
improved since the insufficient sampling actions (items) dur- it will be difficult to distinguish users.
ing Q-learning can be fine-tuned with auxiliary learning. The
advantage of auxiliary learning is to help to obtain reasonable E. Effectiveness of Category Reward
Q-values for fewer sampling actions such that the obtained
Q-values are close to the corresponding category Q-values. The rewards on the item level can be observed or obtained
By using auxiliary learning, even the Q-values of the actions directly. However, the reward for selecting a category is hid-
without sufficient sampling can reveal a general preference on den. To address this problem, we propose leveraging the
the category level. As a result, the variance of the Q-values of similarity between the recommended item and the other items
items in the same category can be reduced, since the corre- within the same category to obtain the category reward. To
sponding Q-values are forced to be close to a specific value. investigate the effectiveness of the proposed category reward,
Moreover, the low variance of the Q-values of items in the we also show the test performance curves of different category
same category is consistent with the fact that the evaluated rewards in Fig. 8, in which there are three kinds of category
values of recommended items in the same category for a par- rewards. “Item reward” uses the reward of the recommended
ticular user should be close. Consequently, auxiliary learning item as the category reward. “Unnormalized Pearson reward”
can improve the process of category selection. uses the traditional Pearson coefficient to calculate the simi-
larity between items. “Normalized Pearson reward” is the one
used in our model, which normalizes the range of Pearson
D. Case Study of Auxiliary Learning coefficients into [0, 1] instead of [−1, 1].
From Fig. 8, we observe that: 1) the performance obtained
To further study the effect of auxiliary learning on the
by “unnormalized Pearson reward” is worse than the others as
Q-values of the items in the same category, we record the
the positive reward can be transformed to the negative one and
mean and the variance of the Q-values of items in a given cat-
vice versa. These results in the rewards to be inconsistent on
egory for different predictions on ML 1M. More specifically,
the category and the item sides. By normalizing the Pearson
the two most popular categories: 1) “comedy” and 2) “drama”,
coefficient, the performance can be improved; 2) in contrast,
are considered, and the recorded predictions correlate to dif-
“normalized Pearson reward” can exceed “item reward” on the
ferent test users at the recommendation steps 5 and 20 within
ML 1M, while their performance is closer for the ML 10M.
different training epochs 5, 10, and 50. The corresponding
However, the problem of “item reward” is in the stability. As
results obtained by three different methods, DHCRS with or
shown in the performance curves of “unnormalized Pearson
without auxiliary learning, and DQN, are visualized in Fig. 6.
reward” on the 1M and 10M datasets, the performance of the
For example, in Fig. 6(a), each scatter point is the mean and
model fluctuates dramatically, especially on the 10M dataset.
variance of the estimated Q-values of the items in “comedy”
The problem is because of using the reward on the item side
for a certain user at step 5 within epoch 5.
directly, which may overestimate the value of the category
From Fig. 6, we can observe the following.
reward. For example, a user disliking “star wars” does not
1) The points in the figure for step 20 are more dispersed
mean that the user must dislike “Sci-Fi” movies.
than those in the figures for step 5. This indicates that the
personalization of DRL-based agents can be enhanced
as the interaction with the user increases. F. Effectiveness of Estimating Probability p
2) For each figure, the variances of the points regarding In our model, we employ an SVD-based method with addi-
DHCRS with auxiliary learning are generally lower than tional negative interaction to estimate probability p. To investi-
the ones for the other two methods, especially for the gate our method’s effectiveness, we compare our method with
epochs 5 and 10. To show the change of variance, we two alternative methods: 1) “uniform” and 2) “ours w/o neg.”

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12038 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022

Fig. 6. Scatter plot of mean (x-axis) and variance (y-axis) of Q-values of items in the categories “comedy” and “drama” of ML 1M at recommendation
steps 5 and 20 within training epochs 5, 10, and 50. These points are obtained by three different methods: 1) DQN; 2) DHCRS with auxiliary learning; and
3) DHCRS without auxiliary learning. (a) “comedy”, step 5, epoch 5. (b) “drama”, step 5, epoch 5. (c) “comedy”, step 5, epoch. 10, (d) “drama”, step 5,
epoch 10. (e) “comedy”, step 5, epoch 50. (f) “drama”, step 5, epoch 50. (g) “comedy”, step 20, epoch 5. (h) “drama”, step 20, epoch 5. (i) “comedy”, step
20, epoch 10. (j) “drama”, step 20, epoch 10. (k)“‘comedy”, step 20, epoch 50. (l) “drama”, step 20, epoch 50.

Fig. 9. Comparison for different methods for estimating probability p on


ML (a) 1M and (b) 10M.
Fig. 7. Variance change curves for categories. (a) “comedy” and (b) “drama”
on ML 1M.

the negative interaction, the performance can be further


improved.

G. Impact of Hyperparameters
There are three key hyperparameters in our model:
1) α is for balancing the effect of QcDQNi (su,t , cu,t ; θ i ) and
QcDQNc (su,t , cu,t ; θ c ) in (15); 2) β denotes the negative impor-
tant ratio in (12); and 3) λ is the proportion ratio of the learning
rate of our auxiliary learning to the Q-learning. Specifically,
Fig. 8. Performance curves of the methods using different category rewards α can be used to control the proportion of different angles
on ML. (a) 1M. (b) 10M. of category selection strategies (strategies I and II) in BCS.
β can also affect the estimated probability p. As Q-learning
and auxiliary learning can update the Q-value for a certain
In “uniform,” the probability of selecting an item in a cate- item from different learning perspectives, the different learn-
gory obeys the uniform distribution. While “ours w/o neg” is ing rates for them, that is, λ, can result in different Q-values.
similar to our adopted method, it does not involve the negative Below, we investigate the effects of different settings of these
interactions. The performance in terms of HR and NDCG is hyperparameters on ML 1M, respectively.
shown in Fig. 9. From Fig. 9, we observe the following. We show the results with different α from the range
1) “Uniform” also achieves promising results, which can [0.1, 0.9] stepped by 0.2 in Fig. 10(a). We observe that α = 0.3
exceed results obtained by baselines, as shown in can achieve the best performance, and the performance with
Table III. By using the distribution learned from his- the middle values 0.3, 0.5, and 0.7 is better than the ones
torical interactions instead of straightforward uniform with boundary values 0.1 and 0.9. These results suggest that
distribution, the performance can be further enhanced. the setting with moderate α can effectively balance the effect
2) Although “ours w/o neg” can improve the performance, of different kinds of category selection strategies and improve
its increment over “uniform” on ML 10M with respect to the recommendation accuracy. We vary the value of β from 0.0
HR@20 is small. By incorporating the consideration of to 1.0 stepped by 0.2 and present the results in Fig. 10(b). We

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12039

BCS strategy with additional auxiliary learning to increase


the sampling efficiency during the Q-learning. We conducted
numerous experiments on four different datasets. The exper-
imental results demonstrate that our method can consistently
achieve better recommendation accuracy than the best baseline
for the long-term recommendation.
Nevertheless, there are some issues in the real-world rec-
ommendation scenarios, which can be studied in the future.
Current DHCRS relies on the explicit category classifica-
tion, while some recommendation scenarios in the real world
are without well-defined category information. To address
this issue, we intend to study how to learn the latent item-
category hierarchy, for example, grouping the items based on
the pretraining item embeddings via a hierarchical clustering
Fig. 10. Comparisons of different settings of hyper-parameters on ML 1M.
algorithm, and then apply the learned hierarchy to DHCRS.
(a) Balance ratio α for different category selection strategies. (b) Important In addition, our model is only evaluated in environments
ratio β for negative interactions. (c) Proportion λ between the learning rates where items only have limited categories. However, in the the
of Q-learning and auxiliary learning.
real-world application, one item may exist in many various cat-
TABLE IV egories or hashtags. How to address these large-scale multiple
T IME C OST OF O UR M ODEL label problems is a challenge for our DHCRS framework. For
this problem, we plan to study how to reduce the scale of
categories. More specifically, we intend to treat the category
information like features and using features selection technol-
ogy to find relatively independent and informative categories,
and then use the reduced category set in our DHCRS.
observe that involving the effect of negative interactions can
consistently outperform the one without negative effect, that
is, β = 0.0. β = 0.4 can reach the best performance. However, VI. ACKNOWLEDGMENT
as β grows beyond the 0.4 mark, the performance begins to This work was conducted within the Delta-NTU Corporate
decline, which suggests that β also requires a moderate value. Lab for Cyber-Physical Systems, Nanyang Technological
We show the results for different λ ({1.0, 0.5, 0.1, 0.05, 0.01}) University.
in Fig.10(c). Here, the performance with large λ = 1.0 is very
poor, which cannot even exceed 40% in terms of HR@20.
This is because a large learning rate for auxiliary learning R EFERENCES
can result in the Q-values for the items in the same category
[1] E. Ie et al., “SlateQ: A tractable decomposition for reinforcement learn-
being too close and lose the variety of different items. As λ ing with recommendation sets,” in Proc. 28th Int. Joint Conf. Artif. Intell.
decreases, the performance can be significantly improved, and (IJCAI), 2019, pp. 2592–2599.
the performance finally becomes stable. Combining Fig. 10(c) [2] L. Zou, L. Xia, Z. Ding, J. Song, W. Liu, and D. Yin, “Reinforcement
learning to optimize long-term user engagement in recommender
with Fig. 7, we can see that even using a small learning rate, systems,” in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discov. Data
that is, small λ, auxiliary learning can also limit the variance of Min., 2019, pp. 2810–2818.
the Q-values for the items in a category. These results suggest [3] X. Zhao, L. Zhang, Z. Ding, L. Xia, J. Tang, and D. Yin,
“Recommendations with negative feedback via pairwise deep reinforce-
that small λ should be used. ment learning,” in Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discov.
Data Min., 2018, pp. 1040–1048.
H. Computational Cost [4] G. Zheng et al., “DRN: A deep reinforcement learning framework
for news recommendation,” in Proc. World Wide Web Conf., 2018,
Our experiments are conducted on GPU (Nvidia RTX 6000) pp. 167–176.
with Intel Gold 5217 CPU @ 3.00 GHz. The computational [5] X. Zhao, L. Xia, L. Zhang, Z. Ding, D. Yin, and J. Tang, “Deep rein-
forcement learning for page-wise recommendations,” in Proc. 12th ACM
costs, with respect to training, and recall time per epoch Conf. Recommender Syst., 2018, pp. 95–103.
on different datasets are shown in Table IV. The training [6] H. Chen et al., “Large-scale interactive recommendation with tree-
time is relatively large on the large-scale datasets, ML 20M structured policy gradient,” in Proc. AAAI Conf. Artif. Intell., vol. 33,
and Netflix, due to the vast number of users. However, the 2019, pp. 3312–3320.
[7] L. Huang, M. Fu, F. Li, H. Qu, Y. Liu, and W. Chen, “A deep reinforce-
recommendations for test users only take minutes. ment learning based long-term recommender system,” Knowl. Based
Syst., vol. 213, Feb. 2021, Art. no. 106706.
[8] V. Mnih et al., “Human-level control through deep reinforcement
V. C ONCLUSION AND F UTURE W ORK learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
In this article, we proposed the DHCRS. DHCRS employs a [9] D. Silver et al., “Mastering the game of go with deep neural networks
two-level category-item hierarchy using two DQNs to address and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
[10] L. Zou et al., “Pseudo Dyna-Q: A reinforcement learning framework for
the recommender system’s large action space issue. To fur- interactive recommendation,” in Proc. 13th Int. Conf. Web Search Data
ther improve the recommendation accuracy, we proposed a Min., 2020, pp. 816–824.

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
12040 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 52, NO. 11, NOVEMBER 2022

[11] T. Zahavy, M. Haroush, N. Merlis, D. J. Mankowitz, and S. Mannor, [36] W.-C. Kang and J. McAuley, “Self-attentive sequential recommenda-
“Learn what not to learn: Action elimination with deep reinforcement tion,” in Proc. IEEE Int. Conf. Data Min. (ICDM), Singapore, 2018,
learning,” in Advances in Neural Information Processing Systems. Red pp. 197–206.
Hook, NY, USA: Curran Assoc., Inc., 2018, pp. 3562–3573. [37] J. Tang and K. Wang, “Personalized top-N sequential recommendation
[12] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for via convolutional sequence embedding,” in Proc. 11th ACM Int. Conf.
recommender systems,” Computer, vol. 42, no. 8, pp. 30–37, 2009. Web Search Data Min., 2018, pp. 565–573.
[13] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “BPR:
Bayesian personalized ranking from implicit feedback,” in Proc. 25th
Conf. Uncertainty Artif. Intell., 2009, pp. 452–461.
[14] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural
collaborative filtering,” in Proc. 26th Int. Conf. World Wide Web, 2017,
pp. 173–182.
[15] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session- Mingsheng Fu received the Ph.D. degree in com-
based recommendations with recurrent neural networks,” 2015. [Online]. puter science from the University of Electronic
Available: arXiv:1511.06939. Science and Technology of China, Chengdu, China,
[16] M. Fu, H. Qu, Z. Yi, L. Lu, and Y. Liu, “A novel deep learning-based in 2019.
collaborative filtering model for recommendation system,” IEEE Trans. From 2019 to 2021, he is a Research Fellow with
Cybern., vol. 49, no. 3, pp. 1084–1096, Mar. 2019. the School of Electrical and Electronic Engineering,
[17] Q. Liu, C. Long, J. Zhang, M. Xu, and P. Lv, “TriATNE: Tripartite adver- Nanyang Technological University, Singapore. He
sarial training for network embeddings,” IEEE Trans. Cybern., early is also a Postdoctoral Researcher with School of
access, Mar. 17, 2021, doi: 10.1109/TCYB.2021.3061771. Computer Science and Engineering, University of
[18] Z. Sun et al., “Research commentary on recommendations with side Electronic Science and Technology of China. His
information: A survey and research directions,” Electron. Commerce Res. current research interests are neural networks, rein-
Appl., vol. 37, Sep./Oct. 2019, Art. no. 100879. forcement learning, and recommender systems.
[19] J. He, X. Li, and L. Liao, “Category-aware next point-of-interest rec-
ommendation via listwise Bayesian personalized ranking,” in Proc. 26th
Int. Joint Conf. Artif. Intell. (IJCAI), vol. 17, 2017, pp. 1837–1843.
[20] L. Zhang, Z. Sun, J. Zhang, H. Kloeden, and F. Klanner, “Modeling hier-
archical category transition for next POI recommendation with uncertain
check-ins,” Inf. Sci., vol. 515, pp. 169–190, Apr. 2020. Anubha Agrawal is currently pursuing the M.S.
[21] G. Shani, D. Heckerman, and R. I. Brafman, “An MDP-based rec- degree in computer science and systems with the
ommender system,” J. Mach. Learn. Res., vol. 6, pp. 1265–1295, University of Washington, Tacoma, WA, USA.
Sep. 2005. Her current research interests relate to deep neu-
[22] Y. Lei, Z. Wang, W. Li, and H. Pei, “Social attentive deep Q-network for ral networks, machine learning, and opinion spam
recommendation,” in Proc. 42nd Int. ACM SIGIR Conf. Res. Develop. detection.
Inf. Retrieval, 2019, pp. 1189–1192.
[23] R. Gao, H. Xia, J. Li, D. Liu, S. Chen, and G. Chun, “DRCGR: Deep
reinforcement learning framework incorporating CNN and GAN-based
for interactive recommendation,” in Proc. IEEE Int. Conf. Data Min.
(ICDM), Beijing, China, 2019, pp. 1048–1053.
[24] X. Zhao, L. Xia, D. Yin, and J. Tang, “Whole-chain recommendations,”
2019. [Online]. Available: arXiv:1902.03987. Athirai A. Irissappane received the master’s
[25] X. Bai, J. Guan, and H. Wang, “A model-based reinforcement learning degree from the National University of Singapore,
with adversarial training for online recommendation,” in Advances in Singapore, in 2012, and the Ph.D. degree from
Neural Information Processing Systems, Vancouver, BC, Canada: Neural Nanyang Technological University, Singapore, in
Inf. Process. Syst. Found., Inc., 2019, pp. 10735–10746. 2016.
[26] X. Chen, S. Li, H. Li, S. Jiang, Y. Qi, and L. Song, “Generative adver- She is currently an Assistant Professor with
sarial user model for reinforcement learning based recommendation the University of Washington, Tacoma, WA, USA.
system,” in Proc. Int. Conf. Mach. Learn., 2019, pp. 1052–1061. She has worked as a Researcher of Dimensional
[27] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. H. Chi, Mechanics with Rolls-Royce@NTU corporate lab
“Top-K off-policy correction for a REINFORCE recommender system,” and Toshiba, Singapore. Her current research
in Proc. 12th ACM Int. Conf. Web Search Data Min., 2019, pp. 456–464. interests are in reinforcement learning and planning
[28] D. Zhao, L. Zhang, B. Zhang, L. Zheng, Y. Bao, and W. Yan, “Deep focusing on deep reinforcement learning techniques and their applications
hierarchical reinforcement learning based recommendations via multi- toward real-world problems.
goals abstraction,” 2019. [Online]. Available: arXiv:1903.09374.
[29] J. Zhang, B. Hao, B. Chen, C. Li, H. Chen, and J. Sun, “Hierarchical
reinforcement learning for course recommendation in MOOCs,” in Proc.
AAAI Conf. Artif. Intell., vol. 33, 2019, pp. 435–442.
[30] T. D. Kulkarni, K. R. Narasimhan, A. Saeedi, and J. B. Tenenbaum,
“Hierarchical deep reinforcement learning: Integrating temporal abstrac- Jie Zhang received the Ph.D. degree from the
tion and intrinsic motivation,” in Advances in Neural Information Cheriton School of Computer Science, University
Processing Systems. Red Hook, NY, USA: Curran, 2016, pp. 3675–3683. of Waterloo, Waterloo, ON, Canada, in 2009.
[31] G. Dulac-Arnold et al., “Deep reinforcement learning in large discrete He is an Associate Professor with the School
action spaces,” 2015. [Online]. Available: arXiv:1512.07679. of Computer Science and Engineering, Nanyang
[32] T. P. Lillicrap et al., “Continuous control with deep reinforcement Technological University, Singapore. He is also
learning,” 2015. [Online]. Available: arXiv:1509.02971. an Associate with the Singapore Institute of
[33] A. Tavakoli, F. Pardo, and P. Kormushev, “Action branching architectures Manufacturing Technology, Singapore. During
for deep reinforcement learning,” in Proc. 32nd AAAI Conf. Artif. Intell., Ph.D. studies, he held the prestigious NSERC
2018, pp. 4131–4138. Alexaprestigious. His papers have been published
[34] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl, by top journals and conferences and won several
“GroupLens: An open architecture for collaborative filtering of net- best paper awards.
news,” in Proc. ACM Conf. Comput. Supported Cooperative Work, 1994, Dr. Zhang was a recipient of the Alumni Gold Medal at the 2009
pp. 175–186. Convocation Ceremony. Bell Canada Graduate Scholarship is rewarded for
[35] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement top Ph.D. students across Canada. The Gold Medal is awarded once a year
learning without exploration,” in Proc. Int. Conf. Mach. Learn., 2019, to honor the top Ph.D. graduate from the University of Waterloo. He is also
pp. 2052–2062. active in serving research communities.

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.
FU et al.: DRL FRAMEWORK FOR CATEGORY-BASED ITEM RECOMMENDATION 12041

Liwei Huang received the Ph.D. degree in computer Hong Qu (Member, IEEE) received the Ph.D.
science from the University of Electronic Science degree in computer science from the University
and Technology of China, Chengdu, China, in 2019. of Electronic Science and Technology of China,
She is currently a Postdoctoral Fellow with the Chengdu, China, in 2006.
University of Electronic Science and Technology of From 2007 to 2008, he was a Postdoctoral Fellow
China. Her interests focus on reinforcement learning, with the Advanced Robotics and Intelligent Systems
robotics, and coordination among agents. Lab, School of Engineering, University of Guelph,
Guelph, ON, Canada. From 2014 to 2015, he worked
as a Visiting Scholar with the Potsdam Institute for
Climate Impact Research, Potsdam, Germany, and
the Humboldt University of Berlin, Berlin, Germany.
He is currently a Professor with Computational Intelligence Laboratory,
School of Computer Science and Engineering, University of Electronic
Science and Technology of China. His research interests include neural
networks, machine learning, and big data.

Authorized licensed use limited to: DELHI TECHNICAL UNIV. Downloaded on October 12,2023 at 14:40:32 UTC from IEEE Xplore. Restrictions apply.

You might also like