Advance Self Attentive Sequential Recommendation
Advance Self Attentive Sequential Recommendation
Trang Dang Duc Tin∗ , Pham Quoc Buu† , Duong Thanh Trieu‡ , and Ngo Quang Minh§
∗ 22125106, 22APCS1, HCMUS
Email: [email protected]
† 22125012, 22APCS1, HCMUS
Email: [email protected]
‡ 22125112, 22APCS1, HCMUS
Email:[email protected]
§ 22125057, 22APCS1, HCMUS
Email:[email protected]
Abstract- Nowadays, Recommender Systems are widely interactions sequence. Since the space of SR input grows
used in a variety of fields. Many of them are constructed exponentially with the number of user’s past actions used
based on the Sequential Recommendation concept. The as context, it’s challenging to make use of the patterns from
main idea of SR is to predict the next items that users sequential dynamics. Therefore, there are many researches to
are interested in or most likely to buy by modeling the find an approach to capturing these sequential dynamics.
sequential dependencies over their previous interactions Currently, there are 2 popular approaches to capturing such
with items. Currently, there are mainly 2 approaches of patterns: Markov Chains(MCs) and Recurrent Neural Networks
capturing the user’s action pattern: Markov Chains MCs (RNNs). MCs assumes that the next behavior of the user only
and Recurrent Neural Networks RNNs. MCs assumes that depends on the last or last few actions, which works best
the next action or behavior of the user can be predicted with sparse datasets and the parsimony of the model is a
based on their prior or last few actions, while the RNNs can crucial element. On the other hand, RNNs uses all the user’s
be used for longer-term semantics. Both of them have their last actions and summarize them for longer-term semantics
ways of performing better, while RNNs is better in working and performs better with dense datasets, when it’s feasible to
with a dense dataset when it is feasible to improve the improve the complexity of the model. Each of these approaches
complexity of the model, on the other hand MCs performs may performs well in some specific cases, has their own
better while handling datasets that are extremely sparse limitations. While MCs somewhat underperform if it’s used in
where the parsimony is the most important. SASREC, a more complex scenarios, the RNNs require a large amount of
model that combines these 2 approaches, aims to balance data.
these 2 approaches by capturing long-term semantics but SASREC is a model that has just been introduced and aims
using an attention mechanism, guessing based on a few to combine these 2 methods together. This model applies
last actions. With the SASREC being our base model, we self-attention, an efficient mechanism to uncover syntactic
introduce an advanced SASREC, which is called Advanced and semantic patterns between words in a sentence SR.
SASREC ASASREC. In this paper, we try to improve the This approach can address both of the problems of the two
SASREC model by focusing on re-modeling the function mentioned methods: while being able to handle and capture
and adding the timestamp input in order to increase the all the user’s behaviors context from the past, it can also make
accuracy, speed, and efficiency of the model. We have predictions based on few previous actions. SASREC shows a
tested it on some big datasets, the results are 0.8879 in better performance than many MC/CNN/RNN-based methods.
Hit@10 and 0.6456 in NDCG@10 on the Steam dataset, SASREC, however, is still struggling while dealing with some
which shows the effectiveness and flexibility of the model specific cases, given that it only uses the user’s ID and the item’s
suggesting its applicability in real-world recommendation ID as inputs for the model, which leads to some mispredictions.
scenarios. Inspired by SASREC, and intending to upgrade the model
and increase its accuracy, we propose an improvement of
I. I NTRODUCTION SASREC, called ASASREC. By using timestamp as an
Recommendation Systems play an important role in various additional timestamp input for the model, and re-modeling
fields, from e-commerce to media platforms, increasing user the mathematics of the model with an additional change of
satisfaction and experience. Recommendation system helps to regularization method, we successfully improve the accuracy
enhance the user’s experience by showing the users choices of the old version of SASREC. The result shows that our work
that are suitable to their preference and helps the businesses in improving the SASREC into ASASREC will contribute
improve their sales. One of the most popular methodologies of to solving real-world problems in building an efficient and
recommendation is Sequential Recommendation(SR), which is accurate recommender system.
used to predict user’s next behaviors based on their historical
II. RELATED WORKS
A. Stochastic Shared Embeddings
Regularization techniques are used to manage model com-
plexity and prevent over-fitting. ℓ2 regularization arises as the
most popular approach and has been widely used in various
matrix factorization models within recommender systems. On
the other hand, ℓ1 regularization is used when a sparse model
is preferred. For deep neural networks, it has been shown
that ℓp regularizations often lack potency, while dropout is
more effective in practical scenarios. Additionally, there are
numerous other regularizations such as techniques, including
parameter sharing, max-norm regularization, gradient clipping,
etc.
In this paper, we utilize a regularization method called SSE, Figure 1: A simplified diagram showing the training process of
which is a tested data-driven technique for deep neural networks SASREC. At each time step, the model considers all previous
without the need for dropout. In brief, the technique is items and uses attention to ’focus’ on items relevant to the
data-driven in that the loss influences the regularization step next action
(the method stochastically replaces embedding with another
embedding with some pre-defined probability during SGD).
Additionally, the original paper [40] has experimented with
various architectures, including recommendation systems with
encouraging results.
score matrix X. R is the rating matrix and Rij is the rating To show the effectiveness of our method, we include three
given to item j by user i. Π∗i is the ordering provided by groups of recommendation methods.
the ground truth rating. The first group includes general recommendation methods
• Hit@K: defined as a fraction of positive items retrieved which only consider user feedback without considering the
by the top K recommendations the model makes: sequence order of actions:
Pn • PopRec.
1{∃1 ≤ l ≤ K : RiΠil = 1}
Hit@K = i=1 , (16) The second group contains methods based on first-order
n
Markov chains, which consider the last visited item:
here we already assume there is only a single positive • Factorized Personalized Markov Chains (FPMC) [1]
item that user will engage next and the indicator function • Translation-based Recommendation (TransRec) [19]
1{∃1 ≤ l ≤ k : RiΠil = 1} is defined to indicate whether
The last group contains deep-learning-based sequential rec-
the positive item falls into the top K position in our
ommender systems, which consider several (or all) previously
obtained ranked list .
visited items:
In the ranking setting, at time point t, the rating matrix • GRU4Rec [26]
R can be constructed in two ways. One approach is to +
• GRU4Rec [26]
include all ratings after t, while the other approach is to
only include ratings at time point t + 1. For the purpose C. Implementation Details
of maintaining a similar setting that allows for convenient For the architecture in the default version of ASASREC,we
performance measurement, we adopt the latter approach. use the same setting as its predecessor SASREC which has two
When dealing with a large dataset that consists of numerous self-attention blocks (b = 2), and use the learned positional
users and items, the evaluation process can become compu- embedding. Item embeddings in the embedding layer and
tationally intensive. Specifically, computing the ranking of prediction layer are shared. We implement ASASREC with
all items based on their predicted scores for every individual TensorFlow. The optimizer is the Adam optimizer [42], the
user, as described in (14), would be quite time-consuming. learning rate is set to 0.001, and the batch size is 128. The
To speed up the evaluation process, we employ a sampling dropout rate of turning off neurons is 0.5 for the three datasets
strategy. Specifically, we sample a fixed number C of negative due to their sparsity. The maximum sequence length n is set
candidates while also including the positive item that we know to 50 for the datasets, i.e., roughly proportional to the mean
the user will engage with next. By doing so, both the set of number of actions per user. The performance of variants and
item candidates Rij and the set of item predictions Πi are different hyper-parameters is examined below.
reduced to a smaller subset. Consequently, prediction scores
only need to be computed for this reduced set of items through D. Recommendation Performance
a single forward pass of the neural network. Table II presents the recommendation performance of all
Our ideal scenario is to achieve NDCG and Hit values methods on the three datasets. By considering the second-best
of exactly 1. A NDCG@K = 1 indicates that the positive methods across all datasets, a general pattern emerges with non-
item is consistently ranked at the top position of the top-K neural methods performing better on sparse datasets and neural
recommendation list. Similarly, a Hit@K = 1 signifies that the approaches performing better on denser datasets. Presumably
positive item is always included in the top-K recommendations this owes to neural approaches having more parameters to
generated by the model. capture high-order transitions (i.e., they are expressive but easily
During our evaluation procedures, increasing the value of overfit), whereas carefully designed but simpler models are
C or decreasing the value of K makes the recommendation more effective in high-sparsity settings. Our method ASASREC
problem harder. This is because a larger candidate pool and outperforms all baselines on both sparse and dense datasets,
a higher ranking quality are desired, therefore requiring the and gains 6.9% Hit Rate and 9.6% NDCG improvements (on
model to make more accurate and precise recommendations. average) against the strongest baseline. One likely reason is
that our model can adaptively attend items within different history in comparison with the original SASREC model. In the
ranges on different datasets (e.g. only the previous item on future, we plan to extend the model by incorporating context
sparse datasets and more on dense datasets). information (e.g. dwell time, action types, rating, devices, etc.),
and possibly CTR (e.g. clicks through rate).
Table III: Comparing SASREC and ASASREC on Amazon
Games Dataset while varying dimension of embeddings. R EFERENCES
M ETHODS NDCG@10 H IT @10 USER DIM ITEM DIM [1] S. Rendle, C. Freudenthaler, and L. Schmidt-Thieme, “Factorizing
personalized markov chains for next-basket recommendation,” in WWW,
SASREC 0.5936 0.8233 N/A 50 2010.
SASREC 0.5919 0.8202 N/A 100 [2] B. Hidasi, A. Karatzoglou, L. Baltrunas, and D. Tikk, “Session-based
ASASREC 0.6221 0.8129 50 50 recommendations with recurrent neural networks,” in ICLR, 2016.
ASASREC 0.6292 0.8389 50 100 [3] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
[4] Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit
feedback datasets,” in ICDM, 2008.
Table IV: Comparing SASREC and ASASREC on Amazon [5] S. Rendle, C. Freudenthaler, Z. Gantner, and L. Schmidt-Thieme, “BPR:
bayesian personalized ranking from implicit feedback,” in UAI, 2009.
Games Dataset while varying the maximum length allowed. [6] F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender systems
M ETHODS NDCG@10 R ECALL @10 M AX L EN USER DIM ITEM DIM
handbook. Springer US, 2011.
[7] Y. Koren and R. Bell, “Advances in collaborative filtering,” in Recom-
SASREC 0.5919 0.8202 200 N/A 100 mender Systems Handbook. Springer, 2011.
ASASREC 0.6281 0.8341 200 50 100
[8] S. Kabbur, X. Ning, and G. Karypis, “Fism: factored item similarity
SASREC 0.5769 0.8045 100 N/A 100 models for top-n recommender systems,” in SIGKDD, 2013.
ASASREC 0.6186 0.8318 100 50 100 [9] S. Zhang, L. Yao, and A. Sun, “Deep learning based recommender
system: A survey and new perspectives,” arXiv, vol. abs/1707.07435,
E. Scalability 2017.
[10] S. Wang, Y. Wang, J. Tang, K. Shu, S. Ranganath, and H. Liu, “What
As with standard MF methods, ASASREC scales linearly your images reveal: Exploiting visual contents for point-of-interest
with the total number of users, items and actions, which recommendation,” in WWW, 2017.
[11] W. Kang, C. Fang, Z. Wang, and J. McAuley, “Visually-aware fashion
is roughly the same as the original SASREC. A potential recommendation and design with generative image models,” in ICDM,
scalability concern is the maximum length n, however the 2017.
computation can be effectively parallelized with GPUs. Here [12] H. Wang, N. Wang, and D. Yeung, “Collaborative deep learning for
recommender systems,” in SIGKDD, 2015.
we measure the training time and performance of ASASREC [13] D. H. Kim, C. Park, J. Oh, S. Lee, and H. Yu, “Convolutional matrix
with different n, empirically study its scalability, and analyze factorization for document context-aware recommendation,” in RecSys,
whether it can handle sequential recommendation in most 2016.
[14] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T. Chua, “Neural
cases. Table V shows the performance and efficiency of collaborative filtering,” in WWW, 2017.
ASASREC with various sequence lengths. Performance is [15] S. Sedhain, A. K. Menon, S. Sanner, and L. Xie, “Autorec: Autoencoders
better with larger n, up to around n = 500 at which point meet collaborative filtering,” in WWW, 2015.
[16] Y. Koren, “Collaborative filtering with temporal dynamics,” Communica-
performance saturates (possibly because 99.8% of actions have tions of the ACM, 2010.
been covered). However, even with n = 600, the model can [17] C. Wu, A. Ahmed, A. Beutel, A. J. Smola, and H. Jing, “Recurrent
be trained in 2,000 seconds, which is still faster than Caser recommender networks,” in WSDM, 2017.
[18] L. Xiong, X. Chen, T.-K. Huang, J. Schneider, and J. G. Carbonell,
and GRU4Rec+ . Hence, our model can easily scale to user “Temporal collaborative filtering with bayesian probabilistic tensor
sequences whose lengths is a few hundred, and we think it’s factorization,” in SDM, 2010.
sufficient for most cases, especially for feedback type like [19] R. He, W. Kang, and J. McAuley, “Translation-based recommendation,”
in RecSys, 2017.
reviews and purchases. We plan to investigate approaches for [20] R. He, C. Fang, Z. Wang, and J. McAuley, “Vista: A visually, socially,
handling very long sequences in the future. and temporally-aware model for artistic recommendation,” in RecSys,
2016.
Table V: Scalability: performance and training time with [21] R. He and J. McAuley, “Fusing similarity models with markov chains
for sparse sequential recommendation,” in ICDM, 2016.
different maximum length n on Steam dataset. [22] J. Tang and K. Wang, “Personalized top-n sequential recommendation
via convolutional sequence embedding,” in WSDM, 2018.
n 10 50 100 200 300 400 500 600 [23] H. Jing and A. J. Smola, “Neural survival recommender,” in WSDM,
Time(s) 75 101 157 341 613 965 1406 1895 2017.
NDCG@10 0.480 0.557 0.571 0.587 0.593 0.594 0.596 0.595 [24] Q. Liu, S. Wu, D. Wang, Z. Li, and L. Wang, “Context-aware sequential
recommendation,” in ICDM, 2016.
[25] A. Beutel, P. Covington, S. Jain, C. Xu, J. Li, V. Gatto, and E. H. Chi,
“Latent cross: Making use of context in recurrent recommender systems,”
V. C ONCLUSION in WSDM, 2018.
In this work, we proposed an extension for the self-attention- [26] B. Hidasi and A. Karatzoglou, “Recurrent neural networks with top-k
gains for session-based recommendations,” CoRR, vol. abs/1706.03847,
based sequential model SASREC for the next item recommenda- 2017.
tion. ASASREC models the entire user sequence (without any [27] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov, R. S.
recurrent or convolutional operations), and adaptively considers Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption
generation with visual attention,” in ICML, 2015.
consumed items for prediction. Our model provides additional [28] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
complexities regarding to the aspect of user/item interaction jointly learning to align and translate,” in ICLR, 2015.
[29] J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T. Chua, “Attentive
collaborative filtering: Multimedia recommendation with item- and
component-level attention,” in SIGIR, 2017.
[30] J. Xiao, H. Ye, X. He, H. Zhang, F. Wu, and T. Chua, “Attentional
factorization machines: Learning the weight of feature interactions via
attention networks,” in IJCAI, 2017.
[31] S. Wang, L. Hu, L. Cao, X. Huang, D. Lian, and W. Liu, “Attention-
based transactional context embedding for next-item recommendation,”
in AAAI, 2018.
[32] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural
machine translation system: Bridging the gap between human and
machine translation,” arXiv preprint arXiv:1609.08144, 2016.
[33] J. Zhou, Y. Cao, X. Wang, P. Li, and W. Xu, “Deep recurrent models
with fast-forward connections for neural machine translation,” TACL,
2016.
[34] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional
networks,” in ECCV, 2014.
[35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016.
[36] L. J. Ba, R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol.
abs/1607.06450, 2016.
[37] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
training by reducing internal covariate shift,” in ICML, 2015.
[38] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhut-
dinov, “Dropout: a simple way to prevent neural networks from
overfitting,” JMLR, 2014.
[39] D. Warde-Farley, I. J. Goodfellow, A. C. Courville, and Y. Bengio, “An
empirical analysis of dropout in piecewise linear networks,” CoRR, vol.
abs/1312.6197, 2013.
[40] Liwei Wu and Shuqing Li and Cho-Jui Hsieh and James Sharpnack,
“Stochastic Shared Embeddings: Data-driven Regularization of Embedding
Layers ”, 2020
[41] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for
recommender systems,” Computer, 2009.
[42] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
in ICLR, 2015.
[43] D. Povey, H. Hadian, P. Ghahremani, K. Li, and S. Khudanpur, “A
time-restricted self-attention layer for asr,” in ICASSP, 2018.
[44] P. Wang, J. Guo, Y. Lan, J. Xu, S. Wan, and X. Cheng, “Learning
hierarchical representation model for next basket recommendation,” in
SIGIR, 2015.
[45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,” CoRR,
vol. abs/1803.01271, 2018.
[46] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber et al., “Gradient
flow in recurrent nets: the difficulty of learning long-term dependencies,”
2001.
[47] J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, “Image-based
recommendations on styles and substitutes,” in SIGIR, 2015.
[48] S. Feng, X. Li, Y. Zeng, G. Cong, Y. M. Chee, and Q. Yuan, “Personalized
ranking metric embedding for next new poi recommendation,” in IJCAI,
2015.
[49] Y. Koren, “Factorization meets the neighborhood: a multifaceted collab-
orative filtering model,” in SIGKDD, 2008.