0% found this document useful (0 votes)
22 views7 pages

Advance Self Attentive Sequential Recommendation

Uploaded by

Tin Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views7 pages

Advance Self Attentive Sequential Recommendation

Uploaded by

Tin Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Advanced Self-Attentive Sequential Recommendation

Trang Dang Duc Tin∗ , Pham Quoc Buu† , Duong Thanh Trieu‡ , and Ngo Quang Minh§
∗ 22125106, 22APCS1, HCMUS
Email: [email protected]
† 22125012, 22APCS1, HCMUS

Email: [email protected]
‡ 22125112, 22APCS1, HCMUS
Email:[email protected]
§ 22125057, 22APCS1, HCMUS

Email:[email protected]

Abstract- Nowadays, Recommender Systems are widely interactions sequence. Since the space of SR input grows
used in a variety of fields. Many of them are constructed exponentially with the number of user’s past actions used
based on the Sequential Recommendation concept. The as context, it’s challenging to make use of the patterns from
main idea of SR is to predict the next items that users sequential dynamics. Therefore, there are many researches to
are interested in or most likely to buy by modeling the find an approach to capturing these sequential dynamics.
sequential dependencies over their previous interactions Currently, there are 2 popular approaches to capturing such
with items. Currently, there are mainly 2 approaches of patterns: Markov Chains(MCs) and Recurrent Neural Networks
capturing the user’s action pattern: Markov Chains MCs (RNNs). MCs assumes that the next behavior of the user only
and Recurrent Neural Networks RNNs. MCs assumes that depends on the last or last few actions, which works best
the next action or behavior of the user can be predicted with sparse datasets and the parsimony of the model is a
based on their prior or last few actions, while the RNNs can crucial element. On the other hand, RNNs uses all the user’s
be used for longer-term semantics. Both of them have their last actions and summarize them for longer-term semantics
ways of performing better, while RNNs is better in working and performs better with dense datasets, when it’s feasible to
with a dense dataset when it is feasible to improve the improve the complexity of the model. Each of these approaches
complexity of the model, on the other hand MCs performs may performs well in some specific cases, has their own
better while handling datasets that are extremely sparse limitations. While MCs somewhat underperform if it’s used in
where the parsimony is the most important. SASREC, a more complex scenarios, the RNNs require a large amount of
model that combines these 2 approaches, aims to balance data.
these 2 approaches by capturing long-term semantics but SASREC is a model that has just been introduced and aims
using an attention mechanism, guessing based on a few to combine these 2 methods together. This model applies
last actions. With the SASREC being our base model, we self-attention, an efficient mechanism to uncover syntactic
introduce an advanced SASREC, which is called Advanced and semantic patterns between words in a sentence SR.
SASREC ASASREC. In this paper, we try to improve the This approach can address both of the problems of the two
SASREC model by focusing on re-modeling the function mentioned methods: while being able to handle and capture
and adding the timestamp input in order to increase the all the user’s behaviors context from the past, it can also make
accuracy, speed, and efficiency of the model. We have predictions based on few previous actions. SASREC shows a
tested it on some big datasets, the results are 0.8879 in better performance than many MC/CNN/RNN-based methods.
Hit@10 and 0.6456 in NDCG@10 on the Steam dataset, SASREC, however, is still struggling while dealing with some
which shows the effectiveness and flexibility of the model specific cases, given that it only uses the user’s ID and the item’s
suggesting its applicability in real-world recommendation ID as inputs for the model, which leads to some mispredictions.
scenarios. Inspired by SASREC, and intending to upgrade the model
and increase its accuracy, we propose an improvement of
I. I NTRODUCTION SASREC, called ASASREC. By using timestamp as an
Recommendation Systems play an important role in various additional timestamp input for the model, and re-modeling
fields, from e-commerce to media platforms, increasing user the mathematics of the model with an additional change of
satisfaction and experience. Recommendation system helps to regularization method, we successfully improve the accuracy
enhance the user’s experience by showing the users choices of the old version of SASREC. The result shows that our work
that are suitable to their preference and helps the businesses in improving the SASREC into ASASREC will contribute
improve their sales. One of the most popular methodologies of to solving real-world problems in building an efficient and
recommendation is Sequential Recommendation(SR), which is accurate recommender system.
used to predict user’s next behaviors based on their historical
II. RELATED WORKS
A. Stochastic Shared Embeddings
Regularization techniques are used to manage model com-
plexity and prevent over-fitting. ℓ2 regularization arises as the
most popular approach and has been widely used in various
matrix factorization models within recommender systems. On
the other hand, ℓ1 regularization is used when a sparse model
is preferred. For deep neural networks, it has been shown
that ℓp regularizations often lack potency, while dropout is
more effective in practical scenarios. Additionally, there are
numerous other regularizations such as techniques, including
parameter sharing, max-norm regularization, gradient clipping,
etc.
In this paper, we utilize a regularization method called SSE, Figure 1: A simplified diagram showing the training process of
which is a tested data-driven technique for deep neural networks SASREC. At each time step, the model considers all previous
without the need for dropout. In brief, the technique is items and uses attention to ’focus’ on items relevant to the
data-driven in that the loss influences the regularization step next action
(the method stochastically replaces embedding with another
embedding with some pre-defined probability during SGD).
Additionally, the original paper [40] has experimented with
various architectures, including recommendation systems with
encouraging results.

B. Self-Attentive Sequential Recommendation


Sequential dynamics is used to capture the context of users’
next behaviors based on their recent actions. There are 2
methods that have arisen: MCs, which predict a user’s next
action based solely on their recent history, and Recurrent Neural
Networks RNNs, which consider the entire action history to
uncover long-term semantics. While MCs perform best in
extremely sparse datasets where model simplicity is important,
RNNs works better in denser datasets which require higher Figure 2
model complexity.
SASREC is introduced to make use of the advantages of
these 2 approaches by combining the ability to capture long- where S u is the number of actions, the systems will predict
term semantics, akin to RNN, with the efficiency of making for the next item. During the prediction process, at step t,
predictions based on a limited number of actions, similar to the model will predict the next item based on the previous t
MC, through the use of an attention mechanism. At each time items. As shown in Figure 1, conceptually, the input of the
step, SASREC identifies ’relevant’ items from a user’s action model can be thought as S u = (S1u , S2u , ..., S|S u
u |−1 ) and its
u u u u
history and uses them to predict the next item. Studies show output is S = (S2 , S3 , ..., S|S u | ) which is a ’right shifted’
that this method outperforms various state-of-the-art sequential version of the input. In this section, we will describe the
models (including MC/CNN/RNN-based approaches) on both construction of a sequential recommendation model via an
sparse and dense datasets. Moreover, the model is more efficient embedding layer, several self attention blocks and some
than comparable CNN/RNN-based models. improvement with the respective architecture. You can find the
However, our model ASASREC outperforms SASREC by notation summarized in Figure 2.
almost 0.05 in terms of NDCG@10 (Normalized Discounted
Cumulative Gain at rank 10) on real-world datasets like Steam A. Embedding layer
and Amazon. The improved performance of demonstrates its We will transform the training sequence S u =
superiority in personalized recommendation accuracy. (S1 , S2u , ..., S|S
u u
u |−1 ) into a fixed length sequence s =
(s1 , s2 , s3 ..., sn ) where n is the maximum length that the
III. M ETHODOLOGY model can handle. Now, we will talk about cases where
To begin with, normally, the Sequential Recommendation sequence length is not n. Firstly its length is bigger than
systems will get input as a user’s action sequence n, we will choose the most recent n actions (e.g (s1 , s2 , .., sn )
of (s1 , s2 , .., sn+1 , ...)) Secondly if sequence length is smaller
S u = (S1u , S2u , ..., S|S
u
u|) (1) than n, we will add a padding item to the left until its length
is n. Then, we create an item embedding matrix M ∈ R|I|×d , another self-attention block based on F his mean, we
where d is the latent dimensionality, and retrieve the input stack the self-attention block together, the output of one
embedding matrix E ∈ Rn×d , where Ei = MSi . A constant block will be the input of another (i.e a self-attention layer
zero vector 0 is used as the embedding for the padding term. and a feed-forward network). The b-th (b > 1) block is
• Positional Embedding: In fact, the self-attention model defined as:
doesn’t include any recurrent or convolutional module, it S (b) = SA(F (b−1) )Fi
(b) (b)
= F F N (Si , ∀i ∈ 1, 2, 3.., n
is not aware of the positions of the previous items. Hence (6)
we inject an earnable position embedding P ∈ Rn×d into the 1-st block is defined as S (i) = S and F (i) = F
the input embedding However, there are some issues occur when the network
 
MS1 + P1 goes deeper:
b =  MS2 + P2  – The increased model capacity leads to overfitting.
 
E (2)
 ...  – The training process becomes unstable ( due to
MSn + Pn vanishing gradients etc.).
– Models with more parameters often require more
B. Self-Attention Block
training time.
First, let get start with the scaled dot-product attention: We will discuss some ideas to handle these issues.
QK T • Residual Connections: The core idea behind residual
Attention(Q, V, K) = sof tmax( √ )V (3) networks is to propagate low-layer features to higher
d
layers by residual connection. Hence, if low-layer features
Where Q represents the queries, K the keys and V the values
are useful, the model can easily propagate them to the
(each row represents an item). Intuitively, the attention layer
final layer. For example, in out cases, the last visited item
calculates a weighted sum of all values, the weight of between
plays a key role in predicting the next item. However, after
query i and value j relates to the√interaction between query i
several self-attention blocks, the embedding of the last
and key j. In addition, the scale d is to avoid a large value
visited item is entangled with all previous items, adding
of the inner product in the case that the dimensionality is high.
residual connections to propagate the last visited item’s
• Self-Attention Layer: This is an element of the self- embedding to the final layer would make it much easier
attention block, the self-attention operation takes the for the model to leverage low-layer information.
embedding matrix E b as input, converts it to three matrices
• Layer Normalization: Layer normalization is used to
through linear projections and then put these matrices to normalize the inputs across features (i.e., zero-mean
the attention layer: and unit-variance) which is beneficial for stabilizing and
S = SA(E) b Q, E
b = Attention(EW bK , E
bL ) (4) accelerating neural network training. Assuming the input
is a vector x which contains all features of a sample, the
where the projection matrices EW b Q , EW
b K , EWb L ∈ operation is defined as:
d×d
R . In fact, the projections make the model more x−µ
flexible. That is, for example, the model can also learn LayerN orm(x) = α ⊙ √ +β (7)
σ2 + ϵ
asymmetric interactions (i.e (query i, key j) and (query
j,key) can have different interactions). where ⊙ is an element-wise product (i.e the Hadamard
• Point-Wise Feed-Forward Network: Self-attention is product), µ and σ are the mean and the variance of x, α
able to aggregate all previous items’ embeddings with and β are learned scaling factors and bias terms.
adaptive weights. However, it is still a linear model. In • Dropout: ‘Dropout’ is a regularization technique to
order to endow the model with nonlinearity and consider alleviate overfitting problems in deep neural networks.
interactions between different latent dimensions, we apply The idea of dropout is simple: randomly ‘turn off’ neurons
point-wise two-layer feed forward network to all Si with probability p during training, and use all neurons
identical: when testing. From this idea, the following operation can
reduce these above problems:
Fi = F F N (Si ) = ReLU (Si W (1) + b(1) )W (2) + b(2)
(5) g(x) = x + Dropout(g(LayerN orm(x))) (8)
where W (1) , W (2) are d × d matrix and b(1) , b(2) are where g(x) represents the self-attention layer or the feed-
dimensional vectors. Note that, there is no interaction forward network. That is to say, for layer g in each block,
between two Si and Sj (j ̸= i), this means the information we apply layer normalization on the input x before feeding
is not leaked from back to front. into g, apply dropout on g ′ s output, and add the input x
to the final output. We introduce these operations below.
C. Stacking Self-Attention Blocks:
With the first self-attention block, Fi is aggregates all D. Prediction Layer:
previous items’ embeddings (i.e Ecj withj ≤ i). However, After b self-attention blocks that adaptively and hierarchi-
it may be useful to find out more item transitions via cally extract information of previously consumed items,
we predict the next item (given the first t items) based on A. Datasets
Ftb .Particularly, we adopt a Matrix Factorization layer to We assess our methodologies using two real-world datasets
predict the relevance of item i: from different applications. These datasets differ significantly
(b) (T ) in their domains, platforms, and levels of sparsity:
ri,t = Ft Ni (9)
• Amazon: This dataset, presented in the VisualSIGIR
paper, comprises product reviews collected by crawling
where ri,t is the relevance of item i being the next item
Amazon.com. Each main product category on Amazon is
given the first t items (i.e s1 , s2 , ..., st ) and N ∈ R|I|×d
treated as a distinct dataset. We focus on two categories,
is an item embedding matrix. Hence, a high interaction
namely ’Beauty’ and ’Games.’ This dataset is particularly
score ri,t means a high relevance, and we can generate
notable for its high sparsity and variability. [47]
recommendations by ranking the scores.
• Steam: This dataset contains information from October
• Shared Item Embedding:To reduce the model size and
2010 to January 2018 and includes 2,567,538 users, 15,474
alleviate overfitting, we consider another scheme that only
games, and 7,793,069 English reviews. It also provides
uses a single item embedding M:
additional information that could be valuable for future
(b)
ri,t = Ft Mi
(T )
(10) researches, such as users’ play hours, pricing details,
media scores, categories, developers, and more.
(b) For all datasets, we consider the presence of a review
Note that Ft can be represented as a function depending
(b)
on the item embedding M: Ft = f (Ms1 , Ms2 , ..., Mst ) or rating as implicit feedback, indicating that the user has
• Explicit User Modeling: In order to provide personalized interacted with the item. We utilize timestamps to determine
recommendations, existing methods often take one of two the chronological order of these actions. Users and items with
approaches: 1) learn an explicit user embedding represent- fewer than 5 related actions are excluded from the analysis.
ing user preference. 2) consider the user’s previous actions, To partition the data, we divide the historical sequence S u
and induce an implicit user embedding from embeddings for each user u into three parts: (1) the most recent action
u u
of visited items. Our method falls into the latter category, S|S u | for testing, (2) the second most recent action S|S u |−1
(b)
as we generate an embedding Fn by considering all for validation, and (3) all remaining actions for training. Note
actions of a user. However, we can also insert an explicit that during testing, the input sequences contain training actions
user embedding at the last layer, for example via addition: and validation actions.
Data statistics are shown in Table I. From the table, it is
(b) (T ) clear that the Amazon datasets have the fewest actions per
ru,i,t = (Uu + Ft )Mi (11)
user and per item (on average) and Steam has a high average
Where U is user embedding matrix. number of actions per item.
Table I: Dataset statistics (after preprocessing)
E. Network Training:
avg. avg.
Recall that, we convert user sequence (excluding the last Dataset #users #items actions actions #actions
action) (S1u , .., S|S
u
u |−1 ) to s = (s1 , s2 , .., sn ) via truncation or /user /item
padding items. We define t as the expected output at time step Amazon Beauty 52,024 57,289 7.6 6.9 0.4M
t: Amazon Games 31,013 23,715 9.3 12.1 0.3M
 Steam 334,730 13,047 11.0 282.5 3.7M
< pad > if st is a padding item

ot = st+1 1≤t<n (12)
S u
 B. Evaluation Metrics
Su |
After reviewing a variety of metrics from past papers, we
where < pad > indicates a padding item. Our model takes a have decided to choose the two most common evaluation
sequence t as input, the corresponding sequence o as expected standard ranking system, namely NDCG and Hit for top
output, and we adopt the binary cross entropy loss as the recommendations:
objective function: • NDCG@K: defined as:
n
X X X 1 X DCG@K(i, Πi )
= [log (σ(rot ,t )) + (1 − σ(rot ,t ))] (13) NDCG@K = , (14)
S u ∈S t∈[1,2,3..,n] j̸∈S u
n i=1 DCG@K(i, Π∗i )
where i represents i-th user and
Note that we ignore the terms where ot =< pad >
K
X 2RiΠil − 1
DCG@K(i, Πi ) = . (15)
IV. E XPERIMENTS log2 (l + 1)
l=1
In this section, we present our experimental setup and In the DCG definition, Πil represents the index of the l-th
empirical results using various means and metrics. ranked item for user i in test data based on the learned
Table II: Recommendation performance. The best-performing method in each row is boldfaced, and the second-best method in
each row is underlined. Improvements over non-neural and neural approaches are shown in the last two columns respectively.
(a) (b) (c) (d) (e) (f)
Dataset Metric
PopRec FPMC TransRec GRU4Rec GRU4Rec+ ASASREC
Hit@10 0.4053 0.4360 0.4657 0.2175 0.3999 0.4904
Beauty
NDCG@10 0.2327 0.2941 0.3070 0.1253 0.2606 0.3269
Hit@10 0.4774 0.6852 0.6888 0.2988 0.6749 0.7460
Games
NDCG@10 0.2829 0.4730 0.4607 0.1887 0.4809 0.5410
Hit@10 0.7222 0.7760 0.7674 0.4240 0.8168 0.8879
Steam
NDCG@10 0.4585 0.5061 0.4902 0.2746 0.5745 0.6456

score matrix X. R is the rating matrix and Rij is the rating To show the effectiveness of our method, we include three
given to item j by user i. Π∗i is the ordering provided by groups of recommendation methods.
the ground truth rating. The first group includes general recommendation methods
• Hit@K: defined as a fraction of positive items retrieved which only consider user feedback without considering the
by the top K recommendations the model makes: sequence order of actions:
Pn • PopRec.
1{∃1 ≤ l ≤ K : RiΠil = 1}
Hit@K = i=1 , (16) The second group contains methods based on first-order
n
Markov chains, which consider the last visited item:
here we already assume there is only a single positive • Factorized Personalized Markov Chains (FPMC) [1]
item that user will engage next and the indicator function • Translation-based Recommendation (TransRec) [19]
1{∃1 ≤ l ≤ k : RiΠil = 1} is defined to indicate whether
The last group contains deep-learning-based sequential rec-
the positive item falls into the top K position in our
ommender systems, which consider several (or all) previously
obtained ranked list .
visited items:
In the ranking setting, at time point t, the rating matrix • GRU4Rec [26]
R can be constructed in two ways. One approach is to +
• GRU4Rec [26]
include all ratings after t, while the other approach is to
only include ratings at time point t + 1. For the purpose C. Implementation Details
of maintaining a similar setting that allows for convenient For the architecture in the default version of ASASREC,we
performance measurement, we adopt the latter approach. use the same setting as its predecessor SASREC which has two
When dealing with a large dataset that consists of numerous self-attention blocks (b = 2), and use the learned positional
users and items, the evaluation process can become compu- embedding. Item embeddings in the embedding layer and
tationally intensive. Specifically, computing the ranking of prediction layer are shared. We implement ASASREC with
all items based on their predicted scores for every individual TensorFlow. The optimizer is the Adam optimizer [42], the
user, as described in (14), would be quite time-consuming. learning rate is set to 0.001, and the batch size is 128. The
To speed up the evaluation process, we employ a sampling dropout rate of turning off neurons is 0.5 for the three datasets
strategy. Specifically, we sample a fixed number C of negative due to their sparsity. The maximum sequence length n is set
candidates while also including the positive item that we know to 50 for the datasets, i.e., roughly proportional to the mean
the user will engage with next. By doing so, both the set of number of actions per user. The performance of variants and
item candidates Rij and the set of item predictions Πi are different hyper-parameters is examined below.
reduced to a smaller subset. Consequently, prediction scores
only need to be computed for this reduced set of items through D. Recommendation Performance
a single forward pass of the neural network. Table II presents the recommendation performance of all
Our ideal scenario is to achieve NDCG and Hit values methods on the three datasets. By considering the second-best
of exactly 1. A NDCG@K = 1 indicates that the positive methods across all datasets, a general pattern emerges with non-
item is consistently ranked at the top position of the top-K neural methods performing better on sparse datasets and neural
recommendation list. Similarly, a Hit@K = 1 signifies that the approaches performing better on denser datasets. Presumably
positive item is always included in the top-K recommendations this owes to neural approaches having more parameters to
generated by the model. capture high-order transitions (i.e., they are expressive but easily
During our evaluation procedures, increasing the value of overfit), whereas carefully designed but simpler models are
C or decreasing the value of K makes the recommendation more effective in high-sparsity settings. Our method ASASREC
problem harder. This is because a larger candidate pool and outperforms all baselines on both sparse and dense datasets,
a higher ranking quality are desired, therefore requiring the and gains 6.9% Hit Rate and 9.6% NDCG improvements (on
model to make more accurate and precise recommendations. average) against the strongest baseline. One likely reason is
that our model can adaptively attend items within different history in comparison with the original SASREC model. In the
ranges on different datasets (e.g. only the previous item on future, we plan to extend the model by incorporating context
sparse datasets and more on dense datasets). information (e.g. dwell time, action types, rating, devices, etc.),
and possibly CTR (e.g. clicks through rate).
Table III: Comparing SASREC and ASASREC on Amazon
Games Dataset while varying dimension of embeddings. R EFERENCES
M ETHODS NDCG@10 H IT @10 USER DIM ITEM DIM [1] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizing
personalized markov chains for next-basket recommendation,” in WWW,
SASREC 0.5936 0.8233 N/A 50 2010.
SASREC 0.5919 0.8202 N/A 100 [2] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-based
ASASREC 0.6221 0.8129 50 50 recommendations with recurrent neural networks,” in ICLR, 2016.
ASASREC 0.6292 0.8389 50 100 [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
[4] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit
feedback datasets,” in ICDM, 2008.
Table IV: Comparing SASREC and ASASREC on Amazon [5] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “BPR:
bayesian personalized ranking from implicit feedback,” in UAI, 2009.
Games Dataset while varying the maximum length allowed. [6] F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender systems
M ETHODS NDCG@10 R ECALL @10 M AX L EN USER DIM ITEM DIM
handbook. Springer US, 2011.
[7] Y. Koren and R. Bell, “Advances in collaborative filtering,” in Recom-
SASREC 0.5919 0.8202 200 N/A 100 mender Systems Handbook. Springer, 2011.
ASASREC 0.6281 0.8341 200 50 100
[8] S. Kabbur, X. Ning, and G. Karypis, “Fism: factored item similarity
SASREC 0.5769 0.8045 100 N/A 100 models for top-n recommender systems,” in SIGKDD, 2013.
ASASREC 0.6186 0.8318 100 50 100 [9] S. Zhang, L. Yao, and A. Sun, “Deep learning based recommender
system: A survey and new perspectives,” arXiv, vol. abs/1707.07435,
E. Scalability 2017.
[10] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, and H. Liu, “What
As with standard MF methods, ASASREC scales linearly your images reveal: Exploiting visual contents for point-of-interest
with the total number of users, items and actions, which recommendation,” in WWW, 2017.
[11] W. Kang, C. Fang, Z. Wang, and J. McAuley, “Visually-aware fashion
is roughly the same as the original SASREC. A potential recommendation and design with generative image models,” in ICDM,
scalability concern is the maximum length n, however the 2017.
computation can be effectively parallelized with GPUs. Here [12] H. Wang, N. Wang, and D. Yeung, “Collaborative deep learning for
recommender systems,” in SIGKDD, 2015.
we measure the training time and performance of ASASREC [13] D. H. Kim, C. Park, J. Oh, S. Lee, and H. Yu, “Convolutional matrix
with different n, empirically study its scalability, and analyze factorization for document context-aware recommendation,” in RecSys,
whether it can handle sequential recommendation in most 2016.
[14] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, “Neural
cases. Table V shows the performance and efficiency of collaborative filtering,” in WWW, 2017.
ASASREC with various sequence lengths. Performance is [15] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders
better with larger n, up to around n = 500 at which point meet collaborative filtering,” in WWW, 2015.
[16] Y. Koren, “Collaborative filtering with temporal dynamics,” Communica-
performance saturates (possibly because 99.8% of actions have tions of the ACM, 2010.
been covered). However, even with n = 600, the model can [17] C. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing, “Recurrent
be trained in 2,000 seconds, which is still faster than Caser recommender networks,” in WSDM, 2017.
[18] L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J. G. Carbonell,
and GRU4Rec+ . Hence, our model can easily scale to user “Temporal collaborative filtering with bayesian probabilistic tensor
sequences whose lengths is a few hundred, and we think it’s factorization,” in SDM, 2010.
sufficient for most cases, especially for feedback type like [19] R. He, W. Kang, and J. McAuley, “Translation-based recommendation,”
in RecSys, 2017.
reviews and purchases. We plan to investigate approaches for [20] R. He, C. Fang, Z. Wang, and J. McAuley, “Vista: A visually, socially,
handling very long sequences in the future. and temporally-aware model for artistic recommendation,” in RecSys,
2016.
Table V: Scalability: performance and training time with [21] R. He and J. McAuley, “Fusing similarity models with markov chains
for sparse sequential recommendation,” in ICDM, 2016.
different maximum length n on Steam dataset. [22] J. Tang and K. Wang, “Personalized top-n sequential recommendation
via convolutional sequence embedding,” in WSDM, 2018.
n 10 50 100 200 300 400 500 600 [23] H. Jing and A. J. Smola, “Neural survival recommender,” in WSDM,
Time(s) 75 101 157 341 613 965 1406 1895 2017.
NDCG@10 0.480 0.557 0.571 0.587 0.593 0.594 0.596 0.595 [24] Q. Liu, S. Wu, D. Wang, Z. Li, and L. Wang, “Context-aware sequential
recommendation,” in ICDM, 2016.
[25] A. Beutel, P. Covington, S. Jain, C. Xu, J. Li, V. Gatto, and E. H. Chi,
“Latent cross: Making use of context in recurrent recommender systems,”
V. C ONCLUSION in WSDM, 2018.
In this work, we proposed an extension for the self-attention- [26] B. Hidasi and A. Karatzoglou, “Recurrent neural networks with top-k
gains for session-based recommendations,” CoRR, vol. abs/1706.03847,
based sequential model SASREC for the next item recommenda- 2017.
tion. ASASREC models the entire user sequence (without any [27] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S.
recurrent or convolutional operations), and adaptively considers Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption
generation with visual attention,” in ICML, 2015.
consumed items for prediction. Our model provides additional [28] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
complexities regarding to the aspect of user/item interaction jointly learning to align and translate,” in ICLR, 2015.
[29] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T. Chua, “Attentive
collaborative filtering: Multimedia recommendation with item- and
component-level attention,” in SIGIR, 2017.
[30] J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua, “Attentional
factorization machines: Learning the weight of feature interactions via
attention networks,” in IJCAI, 2017.
[31] S. Wang, L. Hu, L. Cao, X. Huang, D. Lian, and W. Liu, “Attention-
based transactional context embedding for next-item recommendation,”
in AAAI, 2018.
[32] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
machine translation system: Bridging the gap between human and
machine translation,” arXiv preprint arXiv:1609.08144, 2016.
[33] J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, “Deep recurrent models
with fast-forward connections for neural machine translation,” TACL,
2016.
[34] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks,” in ECCV, 2014.
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016.
[36] L. J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol.
abs/1607.06450, 2016.
[37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in ICML, 2015.
[38] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from
overfitting,” JMLR, 2014.
[39] D. Warde-Farley, I. J. Goodfellow, A. C. Courville, and Y. Bengio, “An
empirical analysis of dropout in piecewise linear networks,” CoRR, vol.
abs/1312.6197, 2013.
[40] Liwei Wu and Shuqing Li and Cho-Jui Hsieh and James Sharpnack,
“Stochastic Shared Embeddings: Data-driven Regularization of Embedding
Layers ”, 2020
[41] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for
recommender systems,” Computer, 2009.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in ICLR, 2015.
[43] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A
time-restricted self-attention layer for asr,” in ICASSP, 2018.
[44] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, and X. Cheng, “Learning
hierarchical representation model for next basket recommendation,” in
SIGIR, 2015.
[45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,” CoRR,
vol. abs/1803.01271, 2018.
[46] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient
flow in recurrent nets: the difficulty of learning long-term dependencies,”
2001.
[47] J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, “Image-based
recommendations on styles and substitutes,” in SIGIR, 2015.
[48] S. Feng, X. Li, Y. Zeng, G. Cong, Y. M. Chee, and Q. Yuan, “Personalized
ranking metric embedding for next new poi recommendation,” in IJCAI,
2015.
[49] Y. Koren, “Factorization meets the neighborhood: a multifaceted collab-
orative filtering model,” in SIGKDD, 2008.

You might also like