Embarrassingly Shallow Auto-Encoders For Dynamic
Embarrassingly Shallow Auto-Encoders For Dynamic
https://doi.org/10.1007/s11257-021-09314-7
Received: 4 March 2021 / Accepted in revised form: 28 November 2021 / Published online: 13 January 2022
© The Author(s), under exclusive licence to Springer Nature B.V. 2022
Abstract
Recent work has shown that despite their simplicity, item-based models optimised
through ridge regression can attain highly competitive results on collaborative filtering
tasks. As these models are analytically computable and thus forgo the need for often
expensive iterative optimisation procedures, they have become an attractive choice
for practitioners. Computing the closed-form ridge regression solution consists of
inverting the Gramian item-item matrix, which is known to be a costly operation
that scales poorly with the size of the item catalogue. Because of this bottleneck,
the adoption of these methods is restricted to a specific set of problems where the
number of items is modest. This can become especially problematic in real-world
dynamical environments, where the model needs to keep up with incoming data to
combat issues of cold start and concept drift. In this work, we propose Dynamic easer :
an algorithm based on the Woodbury matrix identity that incrementally updates an
existing regression model when new data arrives, either approximately or exact. By
exploiting a widely accepted low-rank assumption for the user-item interaction data,
this allows us to target those parts of the resulting model that need updating, and avoid
a costly inversion of the entire item-item matrix with every update. We theoretically
and empirically show that our newly proposed methods can entail significant efficiency
gains in the right settings, broadening the scope of problems for which closed-form
models are an appropriate choice.
Jan Van Balen: work done while the author was at Adrem Data Lab.
B Olivier Jeunen
[email protected]
1 Adrem Data Lab, Department of Computer Science, University of Antwerp, Antwerp, Belgium
2 Spotify, Brussels, Belgium
3 Faculty of Information Technology, Monash University, Clayton, VIC, Australia
123
510 O. Jeunen et al.
1 Introduction
Recommender systems are information retrieval applications that aim to mitigate the
problem of “information overload”, by matching users to certain items (Borchers
et al. 1998). They have become ubiquitous on the world wide web and have found
applications in many different areas where these items can represent anything from
news articles and musical artists to retail products and social media accounts. Most
modern approaches to recommendation are based on some form of collaborative filter-
ing (Ekstrand et al. 2011), a family of methods that aim to model user preferences and
learn them from a dataset of user behaviour. These methods have known widespread
success over the years, and are the cornerstone of modern recommender systems
research. As a consequence, the quest for more effective collaborative filtering algo-
rithms is a very active research area, where significant strides forward are being made
every year. Many novel methods are based on deep and nonlinear neural networks, and
the expressiveness of this model class has made them ubiquitous in the field (Liang
et al. 2018; Elahi et al. 2019; Shenbin et al. 2020). Recent work casts doubt on the
reproducibility of evaluation strategies that are often adopted to empirically validate
research findings (Dacrema et al. 2019; Rendle 2019; Rendle et al. 2020), making
it harder to conclude whether these complex model classes are what the field needs
moving forward.
In a parallel line of research, the effectiveness of simpler linear models for the col-
laborative filtering task has been shown time and again (Ning and Karypis 2011; Levy
and Jack 2013; Sedhain et al. 2016; Steck 2019b, c; Steck et al. 2020). Most notably
and recently, Embarrassingly Shallow Auto-Encoders (reversed: easer ) have been
shown to yield highly competitive results with the state of the art, whilst often being
much easier to implement, and much more efficient to compute (Steck 2019a). The
closed-form solution that is available for ridge regression models is at the heart of these
major advantages, as easer effectively optimises a regularised least-squares problem.
Recently, easer has been extended to incorporate item metadata into two variants:
ceaser and add- easer (Jeunen et al. 2020). These extensions improve the capabili-
ties of closed-form linear models to deal with issues such as the “long tail” (very few
items account for the large majority of interactions) and “cold start” (new items do not
have any interactions) (Schein et al. 2002; Park and Tuzhilin 2008; Shi et al. 2014).
The main benefit of easer and its variants over competing approaches, is their
computational efficiency. As the core algorithm consists of a single inversion of the
Gramian item-item matrix, it is often many times more efficient to compute than mod-
els relying on iterative optimisation techniques. As reported in the original paper, the
algorithm can be implemented in just a few lines of Python and is typically com-
puted in the order of minutes on various often used publicly available benchmark
datasets (Steck 2019a). Nevertheless, matrix inversion is known to scale poorly for
large matrices, and easer ’s reliance on it does inhibit its adoption in use-cases with
large item catalogues. In such cases, methods that rely on gradient-based optimisation
techniques are still preferable.
To add insult to injury, real-world systems rarely rely on a single model that is
computed once and then deployed. To make this concrete: suppose we operate a
hypothetical retail website, and we wish to send out an e-mail with a top-N list of
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 511
personalised recommendations to our subscribed users every few days. Naturally, the
model that generates these recommendation lists should evolve over time, preferably
incorporating new user-item interactions that have occurred over the past days. The
importance of having such a dynamic model is threefold: (1) It will generate more
novel and diverse recommendations than its static counterpart (Castells et al. 2015),
(2) it will be able to combat concept drift in the data (due to shifting item popularity or
seasonality trends in preferences) (Gama et al. 2014), and (3) it will have the means to
handle cold-start problems when either with new items or news users appear (Schein
et al. 2002).
Many modern digital systems generate new data at increasingly fast rates, and this is
no different for our hypothetical retail website. This is important to take into account
when choosing a recommendation algorithm. Models that are already inefficient to
compute initially, will only see these problems exacerbated when the predominant
approach every few days is to recompute them iteratively on more and more data.
This puts a theoretical limit on how often we can update the model, and incurs a
computational cost that we would like to reduce. Instead, it would be preferable to
have models that can be updated with new information when it arrives, but do not
require a full retraining of untouched parameters for every new batch of data that
comes in. This is not an easy feat, and the field of “online recommender systems”
that are able to handle model updates more elegantly has seen much interest in recent
years (Vinagre et al. 2020). More generally, the problem of “lifelong” or “continual”
learning in the machine learning field deals with similar issues (Chen 2018).
In this work, we present a novel algorithm to incrementally update the state-of-
the-art item-based linear model easer , which is naturally extended to include recent
variants that exploit side-information: ceaser and add- easer . easer consists of two
major computation steps: (1) the generation of the Gramian item-item matrix, and (2)
the inversion of this matrix that yields the solution to the regression problem.
We propose Dynamic easer (dyn- easer ), consisting of incremental update
rules for these two steps that leverage the recently proposed Dynamic Index algo-
rithm (Jeunen et al. 2019) and the well-known Woodbury matrix identity (Hager 1989),
respectively. As such, dyn- easer provides a way to efficiently update an existing
easer -like model without the need of recomputing the entire regression model from
scratch with every data update.
A theoretical analysis of the proposed algorithm shows that the highest efficiency
gains can be expected when the rank of the update to the Gramian is low, an assumption
that has been widely adopted in the recommender systems literature before (Koren
et al. 2009). We show how this quantity can be bounded using simple summary statis-
tics from the new batch of data, and support our findings with empirical results. Further
experiments confirm that dyn- easer is able to significantly cut down on computa-
tion time compared to iteratively retrained easer , in a variety of recommendation
domains. Finally, we show how we can update the model with low-rank approxima-
tions when the new batch of data itself is not low-rank; providing a tuneable trade-off
between the exactness of the solution and the efficiency with which it can be kept
up-to-date. Empirical observations show how this approximate variant of dyn- easer
still yields highly competitive recommendation performance, with greatly improved
update speed, and how the low-rank assumption can even improve on recommendation
123
512 O. Jeunen et al.
We first formalise our use-case and present relevant mathematical notation used
throughout the rest of this work. We are interested in the “binary, positive-only”
implicit feedback setting (Verstrepen et al. 2017), where we have access to a dataset
consisting of preference indications from users in U over items in I at time t ∈ N,
assumed from a set of interaction data P ⊆ U ×I ×N. Ignoring temporal information,
these preferences can be represented in a binary user-item matrix X ∈ {0, 1}|U |×|I | ,
where X u,i = 1 if we have a click, view, purchase,…for user u and item i in P,
and X u,i = 0 otherwise. With Pt , we denote the set of all interactions up to time t:
{(u, i, t ) ∈ P|t < t}. Consequently, X t is the user-item matrix constructed from the
set of interactions Pt . We will refer to the set of all items seen by user u as Iu ⊆ I,
and vice versa Ui ⊆ U for an item i. The Gramian of the user-item matrix is defined
as G := X X; it is an item-item matrix that holds the co-occurrence count for items
i and j at index G i, j . The goal at hand for a recommendation algorithm is to predict
which zeroes in the user-item matrix X actually shouldn’t be zeroes, and thus imply
that the item would in some way “fit” the user’s tastes and consequently make for a
good item to be shown as a recommendation.
In some cases, additional information about items can be available. Such “side-
information” or “metadata” often comes in the form of discrete tags, which can, for
example, be a release year, genre or director for a movie, an artist or genre for a song, a
writer for a book, or many more. Incorporating item metadata in the modelling process
can help mitigate cold-start and long-tail issues, where the preference information for
a given item is limited (Schein et al. 2002; Park and Tuzhilin 2008). We will refer to
the set of all such tags as the vocabulary V. In a similar fashion to the user-item matrix
X, a tag-item matrix T ∈ R|V |×|I | is constructed. Note that this matrix is real-valued,
as it will often contain pre-computed values such as tf-idf weights instead of binary
indicators.
In what follows, we present a brief introduction to item-based recommendation
models, most notably item- knn (Sarwar et al. 2001), slim (Ning and Karypis 2011)
and easer (Steck 2019a). We then additionally introduce ceaser and add- easer as
extensions of easer that incorporate item side-information whilst retaining a closed-
form solution (Jeunen et al. 2020), as these are most relevant to the dynamic easer
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 513
Here, X u,· denotes the u th row of X. Note that computing recommendation scores
for all training users and all items simply consists of computing the matrix multipli-
cation X S, an operation that is made more efficient when the matrix S is restricted to
be sparse. Scores for items i already present in the user history Iu are often ignored,
and the remaining items are ranked and presented in a top-N recommendation list
or slate to the user. Early seminal works would define the similarity matrix S as all
pairwise cosine similarities among items in the high-dimensional but sparse user-item
matrix X (Sarwar et al. 2001). This has then been extended to include slightly more
advanced notions of similarity such as Pearson’s correlation or conditional probabili-
ties (Deshpande and Karypis 2004). Recent work has introduced the “Dynamic Index”
algorithm to incrementally compute the Gramian of X, additionally showing that sev-
eral conventional similarity metrics such as cosine similarity or Jaccard index can be
readily computed from G when it is up-to-date (Jeunen et al. 2019).
Methods for actually learning an optimal item-item similarity matrix have been pro-
posed for the task of rating prediction (Koren 2008), as well as for pairwise learning
from implicit feedback (Rendle et al. 2009). Ning and Karypis were the first to pro-
pose to learn a sparse weight matrix S through a pointwise optimisation procedure,
aptly dubbing their approach the Sparse LInear Method (slim) (Ning and Karypis
2011). slim optimises a least-squares regression model with elastic net regularisation,
constrained to positive weights:
The restriction of the diagonal to zero avoids the trivial solution where S = I. Many
extensions of slim have been proposed in recent years, and it has become a widely
used method for the collaborative filtering task (Ning and Karypis 2012; Levy and
Jack 2013; Christakopoulou and Karypis 2014; Sedhain et al. 2016; Christakopoulou
and Karypis 2016; Steck 2019a, c; Steck et al. 2020; Chen et al. 2020). In practice,
the slim optimisation problem is often decomposed into |I| independent problems
(one per target item). Although these can then be solved in an embarrassingly parallel
fashion, this renders the approach intractable for very large item catalogues. Indeed,
123
514 O. Jeunen et al.
as they aim to solve |I| regression problems, their computational complexity is in the
order of O(|I|(|I| − 1)2.373 ), assuming they exploit the recent advances in efficient
matrix multiplication and inversion (Le Gall 2014; Alman and Vassilevska W. 2021).
The computational cost of the original slim approach is a known impediment for its
adoption in certain use-cases; related work has reported that hyper-parameter tuning
took several weeks on even medium-sized datasets (Liang et al. 2018).1
Steck studied whether the restrictions of slim to only allow positive item-item
weights and their l1 -regularisation-induced sparsity were necessary for the resulting
model to remain competitive, and concluded that this was not always the case (Steck
2019a; Steck et al. 2020). The resulting Tikhonov-regularised least-squares problem
can then be formalised as:
The main advantage of simplifying the optimisation problem at hand is that the well-
known closed-form solutions for ordinary least squares (OLS) and ridge regression
can now be adopted. Including the zero-diagonal constraint via Lagrange multipliers
yields the Embarrassingly Shallow Auto-Encoder (easer ) model:
1 It should be noted that the authors have since released a more performant coordinate-descent-based
implementation of their method (Ning et al. 2019).
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 515
The second method, add- easer , treats the regression problem on the user-item
matrix X and the one on the tag-item matrix T as two fully independent problems to
solve in parallel; combining the two resulting item-item weight matrices S X and S T
in an additive fashion later down the line.
2
S∗ = α arg min S X W X (X − X S X ) + λ X S X 2F
F
2
(7)
+(1 − α) arg min ST W T (T − T S T ) + λT S T 2F ,
F
subject to diag(S X ) = diag(S T ) = 0.
add- easer doubles the amount of parameters used by easer and ceaser , increasing
its degrees of freedom at learning time at the cost of having to solve two regression
problems instead of one. Note, however, that these are fully independent and can
be computed in parallel. Equation 8 shows the analytical formulas to obtain the two
independent models, and combine them with a blending parameter 0 ≤ α ≤ 1.
The computational complexity of ceaser and add- easer remains in the order of
O(|I|2.373 ), which is equivalent to the original easer approach. As such, these methods
allow item side-information to be included into the model without a significant added
123
516 O. Jeunen et al.
cost in terms of computational complexity. The main reason for this, is that we adapt
the entries in the Gramian G, but do not alter its dimensions.
Collaborative filtering techniques that can be incrementally updated when new data
arrives are a lively research area in itself. Vinagre et al. (2014) propose incremental
Stochastic Gradient Descent (SGD) as a way to dynamically update matrix factori-
sation models based on positive-only implicit feedback. Their methodology has first
been extended to include negative feedback (Vinagre et al. 2015), and then to a co-
factorisation model that is more complex than traditional matrix factorisation, but also
leads to superior recommendation accuracy (Anyosa et al. 2018). He et al. (2016)
propose an incremental optimisation procedure based on Alternating Least Squares
(ALS), and also show how it can be applied to efficiently and effectively update matrix
factorisation models. More recently, Ferreira et al. propose a method that personalises
learning rates on a user-basis, reporting further improvements. In contrast, our work
focuses on item-based similarity models that come with closed-form solutions, as
these have been shown to be highly competitive with the state of the art in many
collaborative filtering use-cases.
Instead of just incorporating new data into the model, Matuszyk et al. (2018) pro-
pose to forget older data that has become obsolete, reporting significantly improved
performance for collaborative filtering approaches. The dynamic easer method we
propose in Sect. 3 fits perfectly into this paradigm, as it can incorporate new data
just as easily as it can forget irrelevant information in a targeted manner. This type
of decremental learning has the additional advantage of being able to avoid complete
retraining in privacy-sensitive application areas, where specific user histories need to
be removed from the model upon request.
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 517
We have given a brief history of item-based collaborative filtering models, and have
discussed why easer and its variants are computationally often more efficient than
their counterparts based on slim. For very large item catalogues, however, its more
than quadratic computational complexity in the number of items still becomes a very
tangible issue. Because of this, the demand for an algorithm that can efficiently
update easer -like models when new data arrives, is still very real, and a necessity
for these methods to obtain widespread adoption in practice. Recent work proposes
the “Dynamic Index” algorithm as a way to incrementally update item similarities
in neighbourhood-based models that adopt cosine similarity (Jeunen et al. 2019).
A crucial building block of this metric and the algorithm is the efficient and incre-
mental computation of the Gramian matrix G = X X. By storing G in low-overhead
sparse data-structures such as inverted indices, they minimise memory overhead whilst
still allowing for an amortised constant lookup time when querying Iu , which is
a requirement for incremental updates. From Eqs. 4, 6 and 8 , it is clear to see
that easer and its variants are dependent on this Gramian matrix as well. In fact,
it is the only building block needed to be able to compute the resulting item-item
weight matrix Ŝ. As such, we adopt parts of the Dynamic Index algorithm pro-
posed by Jeunen et al. to first efficiently compute and then incrementally update
the Gramian matrix G. Once we have an up-to-date matrix G, we need to compute
its inverse to obtain P̂ and the eventual model Ŝ from that. The matrix inversion
to go from G to P̂ is the workhorse behind easer that takes up the large major-
ity of the computation time, as this step corresponds to solving the least-squares
problem formulated in Eq. 3. Iterative re-computation of this matrix inverse every
time we wish to incorporate new data into the model, is thus to be avoided if it can
be.
Equation 9 shows the Woodbury matrix identity, which posits that the inverse of a
rank-k correction to some n × n matrix A can be computed by performing a rank-k
correction on the inverse of the original matrix (Hager 1989).
So, given A−1 , U, C and V , there is no need to re-compute the inversion on the
update of A, but it is sufficient to multiply a few matrices and compute the inverse
of (C −1 + V A−1 U) ∈ Rk×k . Naturally, for large n and k n, the efficiency gains
coming from this reformulation will be most significant. Although this is no require-
123
518 O. Jeunen et al.
ment for Eq. 9 to hold, we assume C ∈ Rk×k to be a diagonal matrix. As a result, the
inversion of C becomes trivial and consists of just k operations.
In our setting, suppose we have an up-to-date model at a certain time t with X t , G t ,
P̂ t and Ŝt . At a given time t + 1, suppose we have an updated user-item matrix X t+1 ,
but we wish to compute G t+1 , P̂ t+1 and the resulting Ŝt+1 as efficiently as possible.
As we mentioned before, computing G t+1 incrementally can be achieved easily and
efficiently by adopting parts of the Dynamic Index algorithm. In fact, because of
the incremental nature of the algorithm, we can easily just store the difference in the
Gramian matrix instead of its entirety: G Δ = G t+1 − G t = X t+1 X t+1 − X t X t . Given
a set of user-item interactions PΔ ⊂ U × I to include into the model and an inverted
index Lt mapping users u to their histories Iu , Algorithm 1 shows how to achieve this.
Note that the indices holding Iu are just a sparse representation of the user-item matrix
X and don’t require any additional memory consumption. Furthermore, Algorithm 1 is
easily parallellisable through the same MapReduce-like paradigm adopted by Jeunen
et al. (2019). Naturally, an efficient implementation will exploit the symmetry of the
Gramian G Δ to decrease memory consumption as well as the number of increments
needed at every update.
The form on the right-hand side already begins to resemble Woodbury’s formula in
Eq. 9. All that’s left is to decompose G Δ ∈ Rn×n into matrices U ∈ Rn×k , C ∈ Rk×k
and V ∈ Rk×n . As G Δ is the difference of two real symmetric matrices G t+1 and G t ,
it will always be a real symmetric matrix as well. This means that the eigenvectors
of G Δ can be chosen to be orthogonal to each other: Q ≡ Q −1 . Consequently, an
eigendecomposition always exists, where k is the rank of G Δ :
G Δ = QΛ Q −1
= QΛ Q
(11)
k
= Λii Q ·,i Q ·,i .
i=1
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 519
P̂ t+1 = (G t + λI + G Δ )−1
= (G t + λI + QΛ Q )−1 (12)
= P̂ t − P̂ t Q(Λ−1 + Q P̂ t Q)−1 Q P̂ t .
The full dyn- easer procedure is presented in Algorithm 2. If the updates to the
Gramian matrix are low-rank, this procedure will be much more computationally
efficient than re-computing the inverse of the entire Gramian matrix from scratch, as
we will show in the following subsection. The assumption that the data-generating
process behind user-item interactions is generally low-rank, has been exploited far
and wide in the recommender systems literature (Koren et al. 2009).
It is interesting to note that easer does not follow the low-rank assumption that
motivates the popular family of latent factor models for collaborative filtering. Indeed,
easer is a full-rank model, combatting overfitting with Gaussian priors on its parame-
ters rather than reducing the dimensionality of the problem. The low-rank assumption
we adopt here is on the update to the Gramian G Δ , instead of the full Gramian G. As
we will show further on, both theoretically and empirically, this assumption holds in
a variety of settings.
The fact that G Δ is symmetric and will often be very sparse in nature can be
exploited when computing the eigendecomposition on line 3 of Algorithm 2, as
we will show in the following section. Many modern software packages for sci-
entific computing implement very efficient procedures specifically for such cases
(e.g. SciPy (Virtanen et al. 2020)). Note that alternative algorithms to factorise G Δ
into lower-dimensional matrices exist, often relying on randomised sampling pro-
cedures (Martinsson et al. 2011; Halko et al. 2011). These algorithms are reportedly
more efficient to compute than the traditional eigendecomposition, but often not geared
specifically toward the high-dimensional yet sparse use-case we tackle in this work,
or not equipped to exploit the symmetric structure that is typical for the Gramian.
As they compute two dense matrices of Q’s dimensions—their improvement in com-
putation time comes with the cost of increased memory consumption. Furthermore,
these methods are often focused on approximate matrix reconstructions whereas we
are interested in an exact decomposition of the update to the Gramian. As the eigende-
composition fulfils our needs, the study of alternative factorisation methods falls out
of the scope of the present work.
Throughout this section, we have focused on dyn- easer as a general extension of
easer . Naturally, our approach is trivially extended to include ceaser , add- easer
or a weight matrix W different from the identity matrix I as well, as these variants
only change the input to Algorithms 1 and 2, but bear no impact on the procedures
themselves.
123
520 O. Jeunen et al.
2 In our experiments, we use an efficient SciPy implementation of a variant called the Implicitly Restarted
Lanczos Method (Lehoucq et al. 1998; Virtanen et al. 2020); the analysis is equivalent.
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 521
G t+1 = G t − X t X t + X t+1 X t+1
= G t − X [UΔ ,·],t X [UΔ ,·],t + X [UΔ ,·],t+1 X [UΔ ,·],t+1 .
The update then becomes: G Δ = X [UΔ ,·],t+1 X [UΔ ,·],t+1 − X [UΔ ,·],t X [UΔ ,·],t .
Lemma 1 Given a |U| × |I| user-item matrix X, its Gramian matrix G, and updates
to X; the rank of the update of the Gramian matrix G Δ can be upper bounded by two
times the number of unique, nonzero rows in X Δ : rank(G Δ ) ≤ 2|UΔ |.
Proof As the rank of a matrix is defined as its number of linearly independent row
or column vectors, a (possibly loose) upper bound for rank(X [UΔ ,·] ) is given by its
number of nonzero rows |UΔ |. Consequently, the rank of the Gramian matrix has the
same bound: rank(X [UΔ ,·] X [UΔ ,·] ) ≤ |UΔ |. It is well known that the rank of the sum
of two matrices is less than or equal to the sum of the ranks of the individual matrices.
Bringing those together, we have that rank(G Δ ) ≤ 2|UΔ |.
3 In the SciPy package for Python, an implementation of the randomised method presented by Liberty et
al. can be found under scipy.linalg.interpolative.estimate_rank (Liberty et al. 2007; Virtanen et al. 2020).
123
522 O. Jeunen et al.
This upper bound on rank(G Δ ) holds for any update to X. When users in the update
are disjoint of those in X t , the bound can be tightened to |UΔ |. For general-purpose
use cases, it is not be feasible to ensure that users in the update do not appear with
partial histories in previous iterations of the model. For specific applications such as
session-based recommendation, however, it is common practice to train models on
the session-item matrix, which satisfies this assumption by definition (Ludewig and
Jannach 2018).
Lemma 2 Given a |U| × |I| user-item matrix X, its Gramian matrix G, and updates
to X that only consist of adding new rows or altering previously zero-rows; the rank
of the update of the Gramian matrix G Δ can be upper bounded by the number of rows
being added or altered: rank(G Δ ) ≤ |UΔ |.
Proof When the update only pertains to new users, this means that X [UΔ ,·],t = 0,
which ensures that rank(G Δ ) = rank(X Δ ). Because rank(X Δ ) is bounded by |UΔ |
per definition, so is rank(G Δ ): rank(G Δ ) ≤ |UΔ |.
We have provided bounds for rank(G Δ ) by focusing on the number of users that
have contributed interactions in the new batch of data that we wish to include into
the model. Analogously, in some settings, it might be easier to bound the number of
unique items that are being interacted with. In a news recommendation setting, for
example, a new batch of data might consist of only a very limited number of items
(in the order of hundreds) being read by a much higher number of users (hundreds of
thousands). In this case, we can straightforwardly extend Lemmas 1 and 2 to bound
the rank by the number of independent columns in X as opposed to its rows. The
further reasoning and results follow trivially, bounding rank(G Δ ) by 2|IΔ | and |IΔ |,
respectively. Whereas the original easer approach and the need to iteratively retrain
would make it a poor choice for applications with possibly vast item catalogues but
smaller active item catalogues, such as catalogues of news articles, the presented
upper bounds theoretically show why dyn- easer can provide an efficient updating
mechanism.
Naturally, the rank of the update will not always be low in general recommendation
use-cases. The easiest counter-example to think of is the case where we wish to include
k user-item interactions that pertain to k new and unique users as well as k unique
items. This will lead to a diagonal-like structure of X Δ and rank(X Δ ) = k, which
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 523
is problematic for large values of k. However, it is also not hard to see that incorpo-
rating such a batch of data into our model will not affect any of our personalisation
capabilities. Indeed, as easer exploits signal from item co-occurrences, data where
no item co-occurrences are present is practically useless, even though it is full-rank.
Although this is a contrived example, it serves to illustrate that the rank of the update
is not necessarily synonymous with its informational value.
In these cases, we can still resort to updating our model P̂ with a low-rank approx-
imation of G Δ without hurting the performance of the updated model. Instead of
computing the rank and a full eigendecomposition of the Gramian as shown in Algo-
rithm 2, we can choose the rank k at which we wish to truncate, and update P̂ with
a low-rank approximation G̃ Δ instead of the real thing. The resulting algorithm is
shown in Algorithm 3, and it provides a tuneable trade-off between the exactness of
the acquired solution and the efficiency of incremental updates.
Interestingly, this type of approximate update is closely related to yet another
extension of the slim paradigm: Factored Item Similarity Models (fism) (Kabbur
et al. 2013). In fism, the similarity matrix S is modelled as the product of two
lower-dimensional latent factor matrices. The resulting low-rank model is shown to
be increasingly effective as the sparsity in the user-item interactions it learns from
increases, highlighting that this type of approximation does not necessarily imply a
decrease in recommendation accuracy. In approximate dyn- easer , we do not directly
model the similarity matrix S as factorised, but we update S with a factorised version
of the update to the Gramian G Δ . Factorised models such as fism or approximate
dyn- easer also bear resemblance to models that are often used in natural language
processing applications. Indeed, the well-known word2vec algorithm to learn word
embeddings for natural language processing applications implicitly learns to factorise
a matrix holding the (shifted positive) pointwise mutual information between word-
context pairs (Mikolov et al. 2013; Levy and Goldberg 2014).
Although our motivations for approximate dyn- easer are rooted in improving the
computational cost of exact dyn- easer , the advantages of transitivity that come from
adopting low-rank representations can significantly impact recommendation perfor-
mance as well. Imagine items a, b, c ∈ I where (a, b) and (b, c) co-occur in the
training data of user histories, but (a, c) does not. Full-rank easer cannot infer a cor-
relation between a and c in such a setting, whereas low-rank models can learn a latent
factor that unifies a, b and c. This explains the advantage that low-rank models have
in sparse data environments. For further insights on the advantages, differences and
analogies between full-rank and low-rank models, we refer the interested reader to the
work of Van Balen and Goethals (2021).
As we are factorising G Δ by its truncated eigendecomposition, we are guaranteed to
end up with the optimal rank-k approximation with respect to the mean squared error
between G̃ Δ and G Δ . Naturally, with the highly sparse nature of G Δ , this optimal
approximation will focus on reconstructing entries with large values, and rows or
columns with many nonzero values. This corresponds to focusing on the items that
occur most often in the new incoming batch of user-item interactions PΔ . Because of
this, we can expect approximate dyn- easer to favour recently popular items, which
can give an additional performance boost in the right application areas. Nevertheless,
an in-depth discussion or validation of the efficacy of factorised easer -like models
123
524 O. Jeunen et al.
Table 1 Datasets we adopt throughout the experiments presented in this work, along with their source and
summary statistics that describe the user-item interactions and their sparsity
falls outside the scope of this work, as we focus on the efficiency with which the
model can be updated. If the cut-off rank k is lower than the true rank of the update,
approximate dyn- easer guarantees an improvement in terms of the computational
complexity of the update procedure.
The goal of this section is to validate that the methods we proposed in earlier sections
of this manuscript work as expected, and to investigate whether expectations grounded
in theory can be substantiated with empirical observations. Concretely, the research
questions we wish to answer are the following:
RQ1 Can exact dyn- easer provide more efficient model updates in comparison with
iteratively retrained easer ?
RQ2 Can our theoretical analysis on the correlation between rank(G Δ ) and the run-
time of dyn- easer set realistic expectations in practice?
RQ3 Do the phenomena we describe for bounding rank(G Δ ) occur in real-world
session-based or news recommendation datasets?
RQ4 Can approximate dyn- easer provide a sensible trade-off between recommen-
dation efficiency and effectiveness?
Table 1 shows the publicly available datasets we use throughout our experiments
in an attempt to provide empirical answers to the above-mentioned research ques-
tions. The well-known MovieLens dataset (Harper and Konstan 2015) consists of
explicit ratings (on a 1–5 scale) that users have given to movies, along with the time
of rating. We drop ratings lower than 3.5 and treat the remainder as binary prefer-
ence expressions. Additionally, we only keep users and items that appear at least 3
times throughout the dataset. This type of pre-processing is common, and ensures
we are left with positive preference expressions that carry enough signal for effec-
tive personalisation (Liang et al. 2018; Beel and Brunel 2019). We take the newest
and largest variant of the dataset as our starting point: MovieLens-25M. Many rec-
ommender systems applications are based on shorter browsing sessions rather than
full user histories that might span years (Ludewig and Jannach 2018). As laid out in
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 525
Sect. 3.3, these set-ups can be especially amenable to our approach, as the adoption of
these shorter sessions instead of longer user histories naturally decreases the rank of
the update to the Gramian. We adopt two well-known datasets for session-based rec-
ommender systems: the YooChoose dataset, released in the context of the 2015 ACM
RecSys Challenge (Ben-Shimon et al. 2015); and the RetailRocket dataset (Kaggle
2016). These datasets consist of implicit feedback (clicks) from users on retail prod-
ucts, and we compute the 3-core for users and items in the same manner we did for
MovieLens-25M, after removing repeated user-item interactions. To validate our intu-
itions regarding dyn- easer and the rank of the Gramian in news recommendation
setups, we use the Adressa and Microsoft News datasets (MIND) (Gulla et al. 2017;
Wu et al. 2020). These datasets contain implicit feedback inferred from browsing
behaviour on news websites; we pre-process them analogously to the other datasets.
Some datasets have prohibitively large item catalogues for easer to compute the
inverse Gramian at once. However, the large majority of items are often at the extreme
end of the so-called “long tail”, only being interacted with once or twice. We prune
these items to keep the easer computation feasible but still highlight the advantages
of dyn- easer .
Note that these pruning operations on rare items significantly cut down computation
time for easer (directly dependent on |I|), but do not pose an unfair advantage for
dyn- easer . Items that appear just once in the dataset blow up the size of the Gramian,
but do not significantly impact the rank of the Gramian updates. Indeed, in these
situations we get that k |I|, and the computational advantages of dyn- easer over
r
ease become even more pronounced. We adopt such pruning as it is common practice
and keeps the computational needs for reproducing our experiments reasonable. The
reason we do not further explore other commonly known datasets such as the Million
Song Dataset (MSD) (Bertin-Mahieux et al. 2011), is that these do not include logged
timestamps that indicate when the user-item interactions occurred. Because of this,
they are unsuited for evaluating a realistic scenario where models are incrementally
retrained over time.
The final dataset we adopt is the SuperMarket Dataset with Implicit feedback
(SMDI) introduced by Viniski et al. Because this dataset has a comparatively small
item catalogue, the computation time for all easer variants is in the order of seconds
and largely dominated by variance and system overhead. We adopt the SMDI dataset
to study the recommendation performance of approximate dyn- easer , as it exhibits
a distribution shift that is largely absent in the other datasets we consider.
To foster the reproducibility of our work, all source code for the experi-
ments we have conducted is publicly available under an open-source license at
github.com/olivierjeunen/dynamic-easer/. All code is written in Python 3.7 using
SciPy (Virtanen et al. 2020). Reported runtimes are wall-time as measured using
an Intel Xeon processor with 14 cores. The rest of this section is structured to follow
the research questions laid out above.
123
526 O. Jeunen et al.
To verify the gains in runtime from exact dyn- easer over iteratively retrained easer ,
we chronologically split the user-item interactions based on a fixed timestamp t, yield-
ing all user-item interactions up to t, and all those after t. The Microsoft News Dataset
comes with user’s reading histories and clicks on shown recommendations, but the
former type of interaction does not include timestamps. Because of this, we treat these
historical interactions as “early data” included in the original easer computation,
and incorporate the timed clicks chronologically into dyn- easer in the procedure
described below.
We train an easer model on the early batch, and log the runtime in seconds needed
for this computation. This operation is repeated over 5 runs, and we report a 95%
Gaussian confidence interval. As new incoming user-item interactions do not affect
the dimension of the Gramian matrix that needs to be inverted, the runtime needed to
compute easer remains fairly constant when adding new user-interactions.
Over the newer batch of data, we employ a non-overlapping sliding window tech-
nique that chronologically generates batches of data to be included in the existing
model via our proposed exact dyn- easer procedure. The size of this window δ is
varied to study the effects on the runtime of dyn- easer . Larger values of δ imply
larger update batch sizes, which will often lead to an increase in rank(G Δ ). Naturally,
when δ becomes too large, a point is reached where the overhead induced by our
incremental updating method becomes prohibitively large, and it becomes favourable
to fully retrain the easer model. Sensible values of δ come with a restriction: when
the runtime of the model update is larger than δ, this would indicate that the procedure
cannot keep up with incoming data in real-time. We do not encounter this issue for
any of the values of δ explored in our experiments - suggesting that dyn- easer can
be a good fit for various configurations.
Figure 1 visualises the resulting runtimes from the procedure laid out above, on all
five considered datasets. The time for the sliding window increases over the x-axis,
and runtime for varying values of δ is shown on the y-axis. The explored values of δ
differ based on the dataset and use-case: for the 25-year spanning MovieLens dataset,
daily updates might be sufficient; for the 3-month spanning news recommendation
dataset Adressa, more frequent 5-minute updates might be more appropriate, to keep
up with the highly dynamic nature of the environment.
We included values of δ that push the runtime for dyn- easer up to that of easer
to highlight the limitations of our approach. Provided that the computing power and
infrastructure is available, however, δ can be decreased to bring dyn- easer ’s runtime
into the desirable range. Note that this limitation on δ is general for online learning
approaches from user-item interactions, and not specific to the methods we propose
in this work.
From the runtime results, we can observe that our proposed method entails signif-
icant performance improvements compared to iterative model retraining, for a wide
range of settings. Over all datasets, we observe a clear trend toward lower runtimes for
shorter sliding windows and more frequent updates, as is expected from our theoretical
results.
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 527
Fig. 1 Runtime results for dyn- easer updated with different intervals (sliding window width δ), as com-
pared to iteratively retrained easer over the final N days of the datasets. We observe that for a wide range
of interval widths δ, dyn- easer can provide significant efficiency gains. When δ becomes too large, the
overhead that comes with incremental updates becomes too high and a full retrain with vanilla easer is
favourable
123
528 O. Jeunen et al.
As the MovieLens-25M dataset spans several decades, the amount of new user-item
interactions to be incorporated on a daily basis remains modest. Exploring lower values
of δ would not provide any additional insights into the performance of dyn- easer
because of this. As a consequence, we obtain a clean separation between the runtime
for dyn- easer on batches of different length.
The remaining four datasets represent session- and news-based recommendation
environments, which are known to be much more fast-paced and dynamic. Because we
focus on smaller sliding window lengths δ here, we clearly see daily seasonal patterns
emerging. Indeed, dyn- easer runtime peaks coincide with peaks in website traffic.
As the rank of the update is typically correlated with the number of users or items in
the update, this phenomenon is to be expected. It highlights that dyn- easer is able
to effectively target those model parameters that need updating, and does not spend
unnecessary computing cycles on unchanged parts of the model. Note that δ does not
need to be a fixed constant in real-world applications. An effective use of computing
power might decrease and increase δ during traffic peaks and valleys, respectively.
The runtime of the incremental updates shown in Fig. 1 is visualised against the rank
of the updates in Fig. 2. We clearly observe a strong correlation between the rank of the
update to the Gramian and the runtime of dyn- easer , with a trend that is consistent
over varying values of δ.
We fit a polynomial of the form f (x) = a · x b + c on a randomly sampled subset
of 90% of measurements, and assess its performance in predicting the runtime for
dyn- easer based on rank(G Δ ) on the remaining 10% of the measurements. Table 2
shows the optimal parameters, the number of samples (runtime measurements) and
the root mean squared error (RMSE) on the test sample for every dataset. Figure 2
qualitatively shows that we are able to predict the runtime for dyn- easer updates with
reasonable accuracy when we know the rank of the update. Combined with the bounds
on this quantity laid out in Sect. 3.3, we can use this to set an expected upper bound
for the computation time of our incremental updates through dyn- easer . Table 2
quantitatively shows the magnitude of the errors, reassuring our qualitatively obtained
insights. Note that whereas the absolute RMSE increases with the datasets with larger
item catalogues, the relative error of the model remains fairly constant. Indeed, a mean
error of 5 seconds on a prediction of 10 seconds is not equivalent to being 5 seconds off
when the order of magnitude is 1000 seconds. These empirical observations together
with the theoretical analysis presented in Sect. 3 highlight the efficiency and favourable
scalability of the proposed dyn- easer procedure.
Figure 3 shows the rank of the incremental updates from Fig. 1 compared to summary
statistics for the batches of user-item interactions. This visualisation shows the effec-
tiveness of the upper bounds laid out in Sect. 3.3 in order to assess their utility and
provide a better understanding of the underlying dynamics for every dataset.
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 529
Fig. 2 Runtime for incremental updates from Fig. 1 plotted against the rank of the update to the Gramian
matrix G Δ . We observe a strong correlation between higher values of rank(G Δ ) and runtime, as well as a
correlation between δ and higher rank(G Δ ). This result highlights that rank(G Δ ) is the main driving factor
between dyn- easer ’s computational complexity, and that bounding it can give realistic expectations for
dyn- easer efficiency in practice
Table 2 Resulting polynomial model to predict runtime from rank(G Δ ), along with the root mean squared
error (RMSE) it attains and the number of observations N it was fitted on
Dataset RMSE N a b c
123
530 O. Jeunen et al.
We observe that both for general purpose MovieLens-25M and the session-based
datasets, the user-focused bound performs reasonably well in approximating the rank
of the update to the Gramian. This is in line with our theoretical expectations, and
confirms that the number of unique users in any given batch of user-item interactions
are the main driving factor for rank(G Δ ). We further see that the upper bound becomes
looser as the number of unique users grows. This as well is expected behaviour, as it
becomes less likely for new users’ behaviour to be linearly independent of other users
in the batch as the batch size grows. As mentioned in 3.3, the upper bound of 2|U|
could be tightened to |U| if we did not perform a hard split on time but rather divided
user sessions into a “finished” and “ongoing” set. This phenomenon occurs naturally
for the YooChoose dataset, where we clearly see that the 2|U| bound is much looser.
Note that the tight bounds for MovieLens might change if this dataset would include
timestamps for item consumption rather than rating, as the majority of users might
watch a smaller set of current series or movies. Such a recency bias would decrease
the active item catalogue, favouring the item-based bounds.
In contrast with the user-focused datasets, the bound on the number of unique items
is much tighter for the news datasets, providing an almost perfect approximation in
many cases. This confirms our intuition that the rank of the update in these settings is
fully determined by the number active items in the catalogue and virtually independent
of the number of users or interactions in a given batch. This in turn makes these
environments especially amenable to our dyn- easer approach.
The number of unique users or items in a batch can give rise to reasonably tight upper
bounds on the rank of the update in realistic scenarios, using real-world datasets. The
absolute number of user-item interactions |P| provides another (impractical) bound
on the rank of the update; indeed, in a worst-case scenario, every user-item interaction
would pertain to a unique user and a unique item. We include the visualisation of
the relation between rank(G Δ ) and |P| to intuitively show that our proposed approach
scales favourably with respect to the size of the data, a property that is most appreciated
in highly dynamic environments with ever-growing dataset sizes.
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 531
3 3 4
×10 ×10 ×10
1.0
4 5
0.8
4
3 0.6
3
2 0.4
2
1 1 0.2
0 0 0.0
0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0
×103 ×103 ×104
|UΔ| |IΔ| |PΔ|
δ = 5m δ = 10m δ = 15m δ = 30m δ = 1h
Fig. 3 Upper bounds for the batches of incremental updates from Fig. 1 plotted against the rank of the update
to the Gramian matrix G Δ . We observe that different applications that imply different data characteristics
bound the rank of the update in different ways. These results confirm our expectations that where the user-
focused upper bound is a good approximator for the MovieLens and RetailRocket datasets, the item-focused
bound is tighter in news recommendation settings. Moreover, we observe favourable behaviour for the rank
of the update in function of the number of interactions, further highlighting dyn- easer ’s scalability
123
532 O. Jeunen et al.
Figure 4 shows computation time for exact dyn- easer , as well as several approximate
model variants with varying cut-off ranks k. In terms of runtime improvements, we
observe very favourable results for approximate updates. As is expected, the compu-
tational cost of dyn- easer ’s updates can largely be attributed to the computation of
all eigen-pairs, and limiting the rank has a significant impact on the efficiency of said
updates. At cut-off rank k = 250, the computational cost for the updates is decreased
by a factor 3 or 65%.
As we have mentioned above, the computation time for all easer variants on the
SMDI dataset is in the order of seconds and largely dominated by variance and system
overhead. As a result, runtime results on this dataset do not provide significant insights,
and we do not report them.
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 533
Fig. 4 Runtime results for exact and approximate dyn- easer variants, with varying cut-off ranks k. We
observe a quick and steady decline in computation time needed for lower values of k, which can be attributed
to less computation time spent finding eigen-pairs
Figure 5 visualises Recall@K for K ∈ {1, 5, 10, 20} over time on the Adressa and
SMDI datasets. The SMDI dataset has large variance on the number of active users over
time, which heavily influences the statistical significance of some of the evaluation
results, as they are based on insufficient samples. We do not include evaluation results
where the evaluation set consisted of less than 100 users. The denominator for our
Recall measure is min(K , |Iutest |) instead of the number of held-out items |Iutest |, to
ensure that a perfect value of 1 can be attained at all cut-offs k.
Additional to several approximate model variants, we include recommendation
accuracy results for a model that is not updated over time, yielding a lower bound on
the accuracy we can expect from approximate updates. On the Adressa dataset, we
observe that such a model quickly deteriorates over time. This is to be expected, as
the Adressa dataset and news recommendation application are heavily biased toward
recent items and interactions. This bias is less clear on the SMDI dataset at lower cut-
123
534 O. Jeunen et al.
Fig. 5 Recommendation accuracy results for exact and approximate dyn- easer variants, with varying cut-
off ranks k. We additionally report results for a stale easer model that does not incorporate new user-item
data over time. We observe that exact (dyn- )easer ’s recommendation accuracy can largely be retained by
approximate variants , as the recommendation accuracy measurements remain within statistical noise of one
another for a large range of cut-off ranks k. For k = 5, we observe a statistically significant improvement
as time goes on. For k = 1, we observe a decrease in recommendation accuracy with respect to the exact
model
123
Table 3 Runtime and recommendation accuracy measurements on the Adressa and SMDI datasets, for the experimental procedure laid out in Algorithm 4
Adressa easer (stale) – – 0.30 −25.6% 0.67 −33.0% 1.07 −34.0% 1.73 −36.0%
dyn- easer (exact) 118s 0% 0.40 0% 1.00 0% 1.64 0% 2.72 0%
r
dyn- ease (approx.) 1000 116s −2% 0.4 0% 1.00 0% 1.64 0% 2.72 0%
500 68s −43% 0.40 0% 1.00 0% 1.64 0% 2.73 0%
250 45s −62% 0.40 0% 1.00 0% 1.64 0% 2.73 0%
50 29s −75% 0.39 −3.9% 0.99 −0.8% 1.62 −0.9% 2.71 −0.7%
5 28s −76% 0.56 +38.9% 1.21 +21.0% 1.87 +14.0% 2.96 +7.7%
1 27s −77% 0.67 +67.0% 1.30 +30.0% 1.90 +16.0% 2.87 +5.3%
SMDI easer (stale) – – 9.57 -6.8% 8.52 −16.0% 9.87 -16.0% 12.73 −18.0%
r
Embarrassingly shallow auto-encoders for dynamic collaborative...
The R@K column shows measurements for Recall@K as described in the text. We compare exact dyn- easer with approximate variants at varying cut-off ranks k, and
observe favourable trade-offs
123
535
536 O. Jeunen et al.
off ranks K , but clearly manifests itself for Recall@20 near the end of the evaluation
period.
For both datasets, we observe that the accuracy of approximate dyn- easer variants
for high values of k, is statistically insignificantly different from exact dyn- easer .
This highlights that the N-fold improvement in terms of computational cost can come
with a negligible impact on recommendation accuracy, showing the advantages of
approximate computations. For Adressa, we observe a statistically significant improve-
ment over exact dyn- easer for low values of k. This can be attributed to the reasons
laid out in Sect. 3, as the low-rank approximation handles sparsity, transitivity, and
favours recently popular items. These model characteristics are highly favourable
in news recommendation settings—but might have smaller influence on supermarket
data. Nevertheless, the results highlight that many efficient low-rank updates can yield
highly competitive models compared to more costly full-rank updates.
All runtime and recommendation accuracy measurements are aggregated in Table 3,
providing further insights on the trade-off between runtime and recommendation accu-
racy for approximate dyn- easer . We denote the Recall@K measure as R@K for
improved spacing. On the SMDI dataset, the differences in recommendation accuracy
among exactly or approximately update model variants were not found to be statisti-
cally significant at the p = 0.05 level. The differences between the stale easer and
dyn- easer models are significant.
The size of the Adressa dataset yields more statistical power, and both the differ-
ences between stale easer and dyn- easer and those between exact dyn- easer and
approximate dyn- easer with k ∈ {1, 5} were found to be statistically significant.
5 Conclusion
Linear item-based models are an attractive choice for many collaborative filtering
tasks due to their conceptual simplicity, interpretability, and recommendation accu-
racy. Recent work has shown that the analytical solution that is available for ridge
regression can significantly improve the scalability of such methods, with a state-
of-the-art algorithm called easer (Steck 2019a). easer consists of a single matrix
inversion of the Gramian. As its computational complexity does not rely on the num-
ber of users or even the number of user-item interactions in the training set, it is
particularly well suited to use-cases with many users or interactions, with the sole
constraint that the size of the item catalogue is limited.
When deployed in real-world applications, models often need to be periodically
recomputed to incorporate new data and account for newly available items and shifting
user preferences, as well as general concept drift. Iteratively retraining an easer -like
model from scratch puts additional strain on such real-world applications, putting a
hard upper limit on the frequency of model updates that can be attained, and possibly
driving up computational costs. This especially limits the application of easer in
domains where item recency is an important factor deciding on item relevance—such
as in retail or news recommendation.
In this work, we propose a novel and exact updating algorithm for embarrassingly
shallow auto-encoders that combines parts of the Dynamic Index algorithm (Jeunen
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 537
et al. 2019) and the Woodbury matrix identity (Hager 1989): Dynamic easer
(dyn- easer ). We have provided a thorough theoretical analysis of our proposed
approach, highlighting in which cases it can provide a considerable advantage over
iteratively retrained easer , and in which cases it does not. These theoretical insights
are corroborated by empirical insights from extensive experiments, showing that
dyn- easer is well suited for efficient and effective online collaborative filtering in
various real-world applications.
dyn- easer exploits the sparse and symmetric structure of the Gramian to efficiently
compute the eigendecomposition of the Gramian update. When the rank of the update
is large, however, this operation can still become prohibitively expensive. To mitigate
this problem, we have additionally proposed an approximate dyn- easer variant that
uses a low-rank approximation of the Gramian update as opposed to its exact decom-
position. Empirical results highlight further efficiency improvements at a small cost for
recommendation accuracy. Our work broadens the scope of problems for which item-
based models based on ridge regression are an appropriate choice in practice. To foster
the reproducibility of our work, the source code for all our experiments is publicly
available under an open-source license at github.com/olivierjeunen/dynamic-easer.
A promising area for future work is to further improve dyn- easer ’s computational
efficiency by looking at alternative (approximate) matrix decompositions that exploit
efficient random sampling (Halko et al. 2011; Martinsson et al. 2011), as the bottleneck
of our current approach lies in the computation of the exact eigendecomposition of
the update to the Gramian. Furthermore, we would like to explore applications of our
efficient incremental updating scheme to more general multi-label regression tasks
beyond the collaborative filtering use-case we tackle in this work.
Acknowledgements This work received funding from the Flemish Government (AI Research Programme).
References
Alman, J., Vassilevska, V. W.: A refined laser method and faster matrix multiplication. In Proc. of the Thirty-
Second Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’21. Society for Industrial
and Applied Mathematics, (2021)
Anyosa, S.C., Vinagre, J., Jorge, A.M.: Incremental matrix co-factorization for recommender systems
with implicit feedback. In Companion Proceedings of the The Web Conference 2018, WWW ’18,
page 1413–1418. International World Wide Web Conferences Steering Committee, (2018). ISBN
9781450356404
Beel, J., Brunel, V.: Data pruning in recommender systems research: Best practice or malpractice? In
Proceedings of the 13th ACM Conference on Recommender Systems, RecSys ’19 (2019)
Ben-Shimon, D., Tsikinovsky, A., Friedmann, M., Shapira, B., Rokach, L., Hoerle, J.: Recsys challenge 2015
and the yoochoose dataset. In: Proceedings of the 9th ACM Conference on Recommender Systems,
RecSys ’15, pp. 357–358. ACM (2015)
Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., Lamere, P.: The million song dataset. In: Proceedings of
the 12th International Society for Music Information Retrieval Conference, ISMIR ’11 (2011)
Borchers, A., Herlocker, J., Konstan, J., Riedl, J.: Ganging up on information overload. Computer 31(4),
106–108 (1998). ISSN 0018-9162
Castells, P., Hurley, N.J., Vargas, S.: Novelty and Diversity in Recommender Systems, pp. 881–918. Springer,
US (2015)
Chen, B., Liu, Z.: Lifelong machine learning. Synth. Lect. Artif. Intel. Mach. Learn. 12(3), 1–207 (2018)
Chen, Y., Wang, Y., Zhao, X., Zou, J., de Rijke, M.: Block-aware item similarity models for top-n recom-
mendation. ACM Trans. Inf. Syst. 38(4), 1–26 (2020)
123
538 O. Jeunen et al.
Christakopoulou, E., Karypis, G.: Hoslim: higher-order sparse linear method for top-n recommender sys-
tems. In: Advances in Knowledge Discovery and Data Mining, pp. 38–49. Springer, New York (2014)
Christakopoulou, E., Karypis, G.: Local item-item models for top-n recommendation. In: Proceedings of
the 10th ACM Conference on Recommender Systems, RecSys ’16, pp. 67–74. ACM (2016). ISBN
978-1-4503-4035-9
Dacrema, M.F., Cremonesi, P., Jannach, D.: Are we really making much progress? a worrying analysis of
recent neural recommendation approaches. In: Proceedings of the 13th ACM Conference on Recom-
mender Systems, RecSys ’19, pp. 101–109. ACM, (2019). ISBN 978-1-4503-6243-6
Deshpande, M., Karypis, G.: Item-based top-n recommendation algorithms. ACM Trans. Inf. Syst. 22(1),
143–177 (2004)
Ekstrand, M.D., Riedl, J.T., Konstan, J.A.: Collaborative filtering recommender systems. Found. Trends
Hum. Comput. Interact. 4(2), 81–173 (2011). ISSN 1551-3955
Elahi, E., Wang, W., Ray, D., Fenton, A., Jebara, T.: Variational low rank multinomials for collaborative fil-
tering with side-information. In: Proceedings of the 13th ACM Conference on Recommender Systems,
RecSys ’19, pp. 340–347. ACM (2019). ISBN 978-1-4503-6243-6
Ferreira, E.J., Enembreck, F., Barddal, J.P.: Adadrift: An adaptive learning technique for long-history stream-
based recommender systems. In: Proceedings of the 2020 IEEE International Conference on Systems,
Man, and Cybernetics (SMC), pp. 2593–2600 (2020)
Gama, I., Žliobaitė, J., Bifet, A., Pechenizkiy, M., Bouchachia, A.: A survey on concept drift adaptation.
ACM Comput. Surv. 46(4), (2014). ISSN 0360-0300
Gulla, J.A., Zhang, L., Liu, P., Özgöbek, Ö., Su, X.: The adressa dataset for news recommendation. In
Proceedings of the International Conference on Web Intelligence, WI ’17, pp. 1042–1048. Association
for Computing Machinery, (2017). ISBN 9781450349512
Hager, W.W.: Updating the inverse of a matrix. SIAM Rev. 31(2), 221–239 (1989)
Halko, N., Martinsson, P.G., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for
constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011). https://doi.org/
10.1137/090771806
Harper, F.M., Konstan, J.A.: The movielens datasets: history and context. ACM Trans. Interact. Intel. Syst.
5(4):19:1–19:19, 19:1-19:19 (2015)
He, X., Zhang, H., Kan, M., Chua, T.: Fast matrix factorization for online recommendation with implicit
feedback. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Devel-
opment in Information Retrieval, SIGIR ’16, pp. 549–558. ACM (2016)
Jeunen, O.: Revisiting offline evaluation for implicit-feedback recommender systems. In: Proceedings of
the 13th ACM Conference on Recommender Systems, RecSys ’19, pp. 596–600. ACM, (2019). ISBN
978-1-4503-6243-6
Jeunen, O., Goethals, B.: Pessimistic reward models for off-policy learning in recommendation. In: Pro-
ceedings of the 15th ACM Conference on Recommender Systems, RecSys ’21 (2021)
Jeunen, O., Verstrepen, K., Goethals, B.: Fair offline evaluation methodologies for implicit-feedback recom-
mender systems with mnar data. In: Proceedings of the REVEAL 18 Workshop on Offline Evaluation
for Recommender Systems (RecSys ’18), October (2018)
Jeunen, O., Verstrepen, K., Goethals, B.: Efficient similarity computation for collaborative filtering in
dynamic environments. In: Proceedings of the 13th ACM Conference on Recommender Systems,
RecSys ’19, pp. D251–259. ACM, (2019). ISBN 978-1-4503-6243-6
Jeunen, O., Van Balen, J., Goethals, B.: Closed-form models for collaborative filtering with side-information.
In: Fourteenth ACM Conference on Recommender Systems, RecSys ’20, pp. 651–656, (2020). ISBN
9781450375832
Kabbur, S., Ning, X., Karypis, G.: Fism: Factored item similarity models for top-n recommender systems.
In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, KDD ’13, pp. 659–667, (2013). ISBN 9781450321747
Kaggle. RetailRocket Recommender System Dataset, 2016. URL https://www.kaggle.com/retailrocket/
ecommerce-dataset
Khawar, F., Poon, L., Zhang, N.L.: Learning the structure of auto-encoding recommenders. In: Proceedigns
of The Web Conference 2020, WWW ’20, pp. 519–529. ACM (2020)
Koren, Y.: Factorization meets the neighborhood: a multifaceted collaborative filtering model. In: Proceed-
ings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
KDD ’08, pp. 426–434. Association for Computing Machinery, (2008). ISBN 9781605581934
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 539
Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer,
42(8):30–37, (2009). ISSN 0018-9162
Lanczos, C.: An iteration method for the solution of the eigenvalue problem of linear differential and integral
operators. J. Res. Natl. Bur. Stand. 45, 255–282 (1950)
Le Gall, F.: Powers of tensors and fast matrix multiplication. In: Proceedings of the 39th International
Symposium on Symbolic and Algebraic Computation, ISSAC ’14, pp. 296–303. Association for
Computing Machinery (2014)
Lehoucq, R.B., Sorensen, D. C., Yang, C.: ARPACK users’ guide: solution of large-scale eigenvalue prob-
lems with implicitly restarted Arnoldi methods. SIAM (1998)
Levy, M., Jack, K.: Efficient top-n recommendation by linear regression. In: RecSys Large Scale Recom-
mender Systems Workshop (2013)
Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. Adv. Neural Inform.
Process. Syst. 27, 2177–2185 (2014)
Liang, D., Krishnan, R.G., Hoffman, M.D., Jebara, T.: Variational autoencoders for collaborative filtering.
In: Proceedings of the 2018 World Wide Web Conference, WWW ’18, pp. 689–698. International
World Wide Web Conferences Steering Committee, ACM (2018)
Liberty, E., Woolfe, F., Martinsson, P., Rokhlin, V., Tygert, M.: Randomized algorithms for the low-rank
approximation of matrices. Proc. Natl. Acad. Sci. 104(51), 20167–20172 (2007)
Ludewig, M., Jannach, D.: Evaluation of session-based recommendation algorithms. User Model. User-
Adap. Inter. 28(4), 331–390 (2018)
Martinsson, P.G., Rokhlin, V., Tygert, M.: A randomized algorithm for the decomposition of matrices. Appl.
Comput. Harmon. Anal. 30(1), 47–68 (2011). ISSN 1063-5203
Matuszyk, P., Vinagre, J., Spiliopoulou, M., Jorge, A.M., Gama, J.: Forgetting techniques for stream-based
matrix factorization in recommender systems. Knowl. Inf. Syst. 55(2), 275–304 (2018). https://doi.
org/10.1007/s10115-017-1091-8
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and
phrases and their compositionality. Adv. Neural Inform. Process. Syst. 26, 3111–3119 (2013)
Ning, X., Karypis, G.: Slim: Sparse linear methods for top-n recommender systems. In: Proceedings of the
2011 IEEE 11th International Conference on Data Mining, ICDM ’11, pp. 497–506. IEEE Computer
Society (2011). ISBN 978-0-7695-4408-3
Ning, X., Karypis, G.: Sparse linear methods with side information for top-n recommendations. In: Pro-
ceedings of the 6th ACM Conference on Recommender Systems, RecSys ’12, pp. 155–162, (2012).
ISBN 9781450312707
Ning, X., Nikolakopoulos, A.N., Shui, Z., Sharma,M., Karypis, G.: SLIM Library for Recommender Sys-
tems, (2019). URL https://github.com/KarypisLab/SLIM
Paige, C.C.: Accuracy and effectiveness of the lanczos algorithm for the symmetric eigenproblem. Linear
Algebra Appl. 34, 235–258 (1980)
Pan, V.Y., Chen, Z.Q.: The complexity of the matrix eigenproblem. In: Proceedings of the 31st Annual
ACM Symposium on Theory of Computing, STOC ’99, pp. 507–516. Association for Computing
Machinery, (1999)
Park, Y., Tuzhilin, A.: The long tail of recommender systems and how to leverage it. In: Proceedings of the
2008 ACM Conference on Recommender Systems, RecSys ’08, pp. 11–18, (2008). URL https://doi.
org/10.1145/1454008.1454012
Rendle, S.: Evaluation metrics for item recommendation under sampling (2019)
Rendle, S., Freudenthaler, C., Gantner, Z., Schmidt-Thieme, L.: BPR: Bayesian personalized ranking from
implicit feedback. In: Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence,
UAI ’09, pp. 452–461. AUAI Press (2009)
Rendle, S., Krichene, W., Zhang, L., Anderson, J.: Neural collaborative filtering vs. matrix factorization
revisited. In: Fourteenth ACM Conference on Recommender Systems, RecSys ’20, pp. 240–248,
(2020). ISBN 9781450375832
Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms.
In: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 285–295.
ACM (2001)
Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: Methods and metrics for cold-start recommen-
dations. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’02, pp. 253–260. ACM (2002)
123
540 O. Jeunen et al.
Sedhain, S., Menon, A.K., Sanner, S., Braziunas, D.: On the effectiveness of linear models for one-class col-
laborative filtering. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence, AAAI’16,
pp. 229–235 (2016)
Shenbin, I., Alekseev, A., Tutubalina, E., Malykh, V., Nikolenko, S.I.: Recvae: A new variational autoen-
coder for top-n recommendations with implicit feedback. In: Proceedings of the 13th International
Conference on Web Search and Data Mining, WSDM ’20, pp. 528–536, (2020)
Shi, Y., Larson, M., Hanjalic, A.: Collaborative filtering beyond the user-item matrix: a survey of the state
of the art and future challenges. ACM Comput. Surv. 47(1) (2014)
Steck, H.: Embarrassingly shallow autoencoders for sparse data. In: The World Wide Web Conference,
WWW ’19, pp. 3251–3257 (2019)
Steck, H.: Collaborative filtering via high-dimensional regression. CoRR, abs/1904.13033 (2019)
Steck, H.: Markov random fields for collaborative filtering. Adv. Neural Inform. Process. Syst. 32, 5473–
5484 (2019)
Steck, H., Dimakopoulou, M., Riabov, N., Jebara, T.: Admm slim: Sparse recommendations for many users.
In: Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20,
pp. 555–563 (2020)
Ubaru, S., Saad, Y.: Fast methods for estimating the numerical rank of large matrices. In: Proceedings of
the 33rd International Conference on Machine Learning, Vol. 48, ICML’16, pp. 468–477. JMLR.org
(2016)
Van Balen, J., Goethals, B.: High-dimensional sparse embeddings for collaborative filtering. In: Proceedings
of the Web Conference 2021, WWW ’21, pp. 575–581. ACM (2021)
Verstrepen, K., Bhaduriy, K., Cule, B., Goethals, B.: Collaborative filtering for binary, positiveonly data.
SIGKDD Explor. Newsl., 19(1):1–21, (2017). ISSN 1931-0145
Vinagre, J., Jorge, A.M., Gama, J.: Fast incremental matrix factorization for recommendation with positive-
only feedback. In: User Modeling, Adaptation, and Personalization, pp. 459–470. Springer, Cham
(2014)
Vinagre, J., Jorge, A.M., Gama, J.: Collaborative filtering with recency-based negative feedback. In: Pro-
ceedings of the 30th Annual ACM Symposium on Applied Computing, SAC ’15, pp. 963–965, (2015).
ISBN 9781450331968
Vinagre, J., Jorge, A.M., Al-Ghossein, M., Bifet, A.: ORSUM - Workshop on Online Recommender Systems
and User Modeling, pp. 619–620. ACM, (2020). ISBN 9781450375832
Viniski, A.D., Barddal, J.P., de Souza Britto Jr., A., Enembreck,F., de Campos, H.V.A.: A case study of
batch and incremental recommender systems in supermarket data under concept drifts and cold start.
Expert Systems with Applications, 176:114890, 2021. ISSN 0957-4174
Virtanen, P., Gommers, R., Oliphant, T.E., et al.: Scipy 1.0: fundamental algorithms for scientific computing
in python. Nat. Methods 17(3), 261–272 (2020)
Wu, F., Qiao, Y., Chen, J., Wu, C., Qi, T., Lian, J., Liu, D., Xie, X., Gao, J., Wu, W., Zhou, M.: MIND:
A large-scale dataset for news recommendation. In: Proceedings of the 58th Annual Meeting of the
Association for Computational Linguistics, ACL’20, pp. 3597–3606. Association for Computational
Linguistics, (2020)
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps
and institutional affiliations.
Olivier Jeunen is a postdoctoral scientist at the University of Antwerp, where he received a PhD for his
thesis “Offline Approaches to Recommendation with Online Success” in 2021. He has a track record of
collaborating with prominent industrial research laboratories, and his recent work has been recognised
with the ACM RecSys ’21 Best Student Paper Award.
Jan Van Balen is a senior research scientist at Spotify, interested in recommender systems and music infor-
mation retrieval. He has previously worked at University of Antwerp (Belgium), Apple Music, and Utrecht
University (The Netherlands).
Bart Goethals is a full professor and chair of the department of computer science at the University of
Antwerp in Belgium, as well as an adjunct professor at Monash University in Melbourne, Australia. His
123
Embarrassingly shallow auto-encoders for dynamic collaborative... 541
research focuses on pattern mining and recommender systems. Additionally, he co-founded and is now
the CTO of university spin-off company Froomle.
123